Recap from Part II

In Part II we built our “Train” dataset. We used the predictors: Total win percentage, number of wins in the last six games, and seed in the tournament. Our response is the actual results of the NCAA tournament for the seasons A-M. In Part III we will build our machine learning model using classification trees. Using the “Train” dataset we “train our model”. Then we create our “Test” dataset and make predictions with our trained model.


  1. Applying a Classification Tree Model
  2. Creating our “Test” dataset
  3. Making Predictions using our Model
  4. Conclusion

Step One

Applying a Classification Tree Model

For a more in-depth explanation of how classification tree models work see this youtube video

R has a package called “Rpart” which allows us to apply the classification tree algorithm quite easily.

train_rpart data =trainData, method = "class")

NOTE: There is an error with wordpress here you need to manually replace the ~ in the equation above for the Rcode to work.

Step Two

Creating our Test Data Set

The process for creating our “Test” data set is very similar to the process we did to create the “Train” data set. The difference is that whereas we used historical tournament results as our response, we now want to predict the match ups for all possible combinations of teams for tournaments N – R. Those match ups now define the teamIDs statistics we need to extract and we are now predicting the results using the model we built from the “Train” data.

We again have created a function that you can use to expedite this process. All you need to do is use a loop to create the “Test” data for each of the seasons.

testData for(i in LETTERS[14:18]) {
testData }

Step Three

Making Predictions

You should notice that our “Test” and “Train” datasets are in the exact same format with the exception of the “Win” column in the “Test” dataset being NA. We can again utilize a handy function in R predict() to make predictions using the model we built.

predictions_rpart predictions subfile write.csv(subfile, file = "model.csv", row.names = FALSE)

Step Four


Thanks for reading we hope it was informational! We would also appreciate any feedback from Kagglers on our process and variable selection! Other algorithms we implemented were SVM and GLM. We achieved the highest accuracy with Random Forests!

If you want more practice data projects, be sure to check out