In Part I, we spent a lot of time familiarizing ourselves with the context of the project and the datasets that we are working with. Thoroughly doing this now can save hours of time later when you begin working with the data which we will now do in this post!

Part II Organization

  1. Introduction to Supervised Machine Learning
  2. Understanding the “Train” dataset we need to create to build our model
  3. Creating the “Train” dataset

Step One

Supervised Machine Learning

In trying to predict March Madness match ups from historical results we are engaging in a “Supervised Learning” task. This is a machine learning term which is used when you are inferring a function from training data. This function is then used to map predictions on new examples (our Test dataset).

This is an awesome chart of the workflow that is involved when you are doing a supervised machine learning task. A couple things to note.

  1. We scaled our data by taking the win percentage as a predictor rather than total wins or total losses. Scaling your data is important for accurate analysis.
  2. Validation set in the chart is the same as our “Test” dataset, specifically the most recent five seasons of the NCAA tournament not including the one going on right now!
  3. New data in the chart would be the regular season statistics for the current season thus allowing us to make March Madness predictions! Note we don’t do this here but you can go to Kaggle to see the data!
  4. We do this for fun so forget about the profit part 🙂

If you went through our first blog post on “Titanic: Machine Learning from Disaster” you would again see that using logistic regression is again an appropriate statistical model to make inferences from our “Train” dataset since what were predicting is win or lose is binary. From now on we will refer to what we are predicting as the response.

Step Two

Understanding the “Train” dataset

Lets go back and remember the variables you may have brainstormed which predict the match up of any given game. We will refer to these variables as the predictors which are used in our model to predict the response (win or lose). Some other variables we came up with that we will implement are:

  • Total Season Win Percentage
  • Number of Wins in A Teams Last 6 Games

Using Regular Season Statistics as Predictors

You should notice that our predictors are all team statistics for the regular season. Our assumption is that regular season performance is predictive of performance in the playoffs. So for any given playoff game we will use statistics from the regular season to predict the result. An added level of depth we must include is that we are predicting match ups which implies two teams. Therefore, we can use the regular season statistics from both teams to predict the result of the game. This data becomes our “Train” dataset to which we use to build our model.

Step Three

Creating the “Train” dataset

Since we have the regular season statistics and the playoff results for the past 18 seasons we can take a subset of this data and use it to “train a model”. The Kaggle competition directs us to what subset we should take, because they ask for predictions for the last 5 seasons we can use the first 15 as data to train our model. This process of taking a subset of the data to do analysis and then verifying your analysis with the remaining data is known as cross validation. To reduce variability this process is often done multiple times with different subsets of the data and then each result is averaged. Reducing variability is important because it indicates that our results are approximately similar to each other.

Determining our Response Values

We can use the results from each of the match ups in the 15 playoff tournaments as our response. In each of those games we know who played, their regular season statistics and who won. Therefore the first step is to create a data frame in R which lists these match ups in one column and the result of the game in another column.

Lets just try doing this with season A first. In the code below we first select the games from season A and name that data frame season_matches. Then we loop through each row of season_matches and we concatenate the season “A” with the teamID for the winning team and the losing team. We place these newly formed strings into a new data frame train_data_frame along with the result of the game.

season_matches <- tourneyRes[which(tourneyRes$season == "A"), ]
team <- vector()
result <- vector()
for(i in c(1:nrow(season_matches))) {
  row <- season_matches[i, ]
  if(row$wteam < row$lteam) {
    vector <- paste("A","_",row$wteam,"_", row$lteam, sep ="")
    team <- c(team, vector)
    result <- c(result, 1)
  } else {
    oth <- paste("A", "_", row$lteam, "_", row$wteam, sep ="")
    team <- c(team, oth)
    result <- c(result, 0)
train_data_frame <- data.frame("Matchup" = team, "Win" = result)

Your console in RStudio should look something like this.

Creating Your Predictors

Now that we have our response values we can begin to organize our predictors. To do this we first need to create a dataset which is organized by teamID rather than season. Again we will just do this for season A for brevity.

#Installing package
#Selecting and sorting the playoff teamIDs least to greatest for season A
playoff_teams <- sort(tourneySeeds$team[which(tourneySeeds$season == "A")])

#Selecting the seeds for season A
playoff_seeds <- tourneySeeds[which(tourneySeeds$season == "A"), ]

#Selecting the regular season statistics for season A
season <- regSeason[which(regSeason$season == "A"), ]

#Wins by team
win_freq_table <-$wteam))
wins_by_team <- win_freq_table[win_freq_table$Var1 %in% playoff_teams, ]

#Losses by team
loss_freq_table <-$lteam))
loss_by_team <- loss_freq_table[loss_freq_table$Var1 %in% playoff_teams, ]

#Total Win Percentage
gamesplayed <- as.vector(wins_by_team$Freq + loss_by_team$Freq)
total_winpct <- round(wins_by_team$Freq / gamesplayed, digits = 3)
total_winpct_by_team <-$Var1), total_winpct))
colnames(total_winpct_by_team) <- c("Var1", "Freq")

#Num of wins in last 6 games
wins_last_six_games_by_team <- data.frame()
for(i in playoff_teams) {
  games <- season[which(season$wteam == i | season$lteam == i), ]
  numwins <- sum(tail(games$wteam) == i)
  put <- c(i, numwins)
  wins_last_six_games_by_team <- rbind(wins_last_six_games_by_team, put)
colnames(wins_last_six_games_by_team) <- c("Var1", "Freq")

#Seed in tournament
pattern <- "[A-Z]([0-9][0-9])"
team_seeds <-$seed, pattern))
seeds <- as.numeric(team_seeds$V2)
playoff_seeds$seed  <- seeds
seed_col <- vector()
for(i in playoff_teams) {
  val <- match(i, playoff_seeds$team)
  seed_col <- c(seed_col, playoff_seeds$seed[val])
team_seed <- data.frame("Var1" = playoff_teams, "Freq" =seed_col)

#Combining columns together
team_metrics <- data.frame()
team_metrics <- cbind(total_winpct_by_team, wins_last_six_games_by_team$Freq, team_seed$Freq)
colnames(team_metrics) <- c("TEAMID", "A_TWPCT","A_WST6", "A_SEED")

If you want to look at any of the data frame pieces we create, just utilize the head() function. For example:


NOTE: For the actual competition we used the following predictors:

  • Away Wins Winning Percentage
  • Wins by margin less than 2
  • Losses by margin less than 2
  • Wins by margin greater than 7
  • Losses by margin greater than 7
  • Win Percentage in last 4 weeks
  • Win Percentage against playoff teams
  • Number of wins in last 6 games of the season
  • Seed in Tournament
  • Wins in Tournament

Putting the Pieces Together and Creating Your “Train” Dataset

So now we have our predictors organized by individual teams regular seasons statistics and we have our response which is the results of tournament play.

Take a look at the image below which shows the data frames we have created and how we want to combine them to create our “Train” dataset. Data frame (DF) 1 and DF 2 are the predictors organized by individual teams (team_metrics data frame), DF 2 is a copy of DF 1 with changed column names. You can also think of this as the home team and the away team data frames. DF 3 is the response, Win or Loss, and the teams that were involved in each of the games (train_data_frame). Remember, we only show the first six rows of each of these data frames!


DF 4 is our “Train” data set. How did we get there? If you look the teamIDs on the left side for each of the “Matchups” in DF 3 you should notice that is the teamID order in which we need to re-order DF 1. Then if you look at only the teamIDs on the right side for the “Matchups” in DF 3 you should notice that that is the teamID order in which we need to re-order DF 2. Once you do so you have two new data frames organized in the correct order and when you bind the columns together you create the correct match ups and ultimately the “Train” data frame!

This is quite confusing so we will explain it step by step:
1. Sort DF 1 by the teamIDs defined in the column titled “A_ID” in DF 4
2. Sort DF 2 by the teamIDs defined in the column titled “B_ID” of DF 4
3. Combine the columns from the two data frames in 1 and 2 to create your “Train” data set!

We provide utility functions which do this process and over all seasons A-M. To use these functions and create your “Train” data set download the file “blog_utility.R” and place it in your “Kaggle” folder on your desktop, then use the following code. You can download the “blog_utility.R” file at this link.

trainData <- data.frame()
for(i in LETTERS[1:13]) {
  trainData <- rbind(trainData, train_frame_model(i))

Take a look at the “Train” dataset we have just created with the head() function.


You have now created your “Train” data set with the the following predictors: Winning percentage, number of wins in the last six games of the season, and seed in the tournament! In the final post we will build our “Test” data set and build a machine learning model! Thanks!

You can go on to Part III here!

If you want more practice data projects, be sure to check out