Recap

In the last post, we prepared our working environment. We also got R/Rstudio working, loaded our data, and followed it up with a bit of visualization and exploration. Now that we have a better understanding of the data, we’re ready to move on to the next part: manipulating the data to prepare to plug it into a model.

Cleaning the TRAIN Data

After doing some exploratory analysis of the data, we now need to clean it to create our model. Note that it is important to explore the data so that you understand what elements need to be cleaned. For example you might have noticed that there are missing values in the data set, especially in the Age column.

Removing Variables Not Used for the Model

At this point, we remove the variables that we do not want to use in the training data for the model: PassengerID, Ticket, Fare, Cabin, and Embarked. To do so, we index our data set trainData with [ ]. The c() function generates a list of numbers. By including this list (along with a negative sign), we let R know to subset (or remove) those columns.

trainData = trainData[-c(1,9:12)]

Additionally, we need to replace qualitative variables (such as gender) into quantitative variables (0 for male, 1 for female etc) in order to fit our model. Note that there are models where the variables can be qualitative. We use the R function gsub() which will replace any text with a value of our choosing.

Replacing Gender variable (Male/Female) with a Dummy Variable (0/1)

Additionally, we need to replace qualitative variables (such as gender) into quantitative variables (0 for male, 1 for female etc) in order to fit our model. Note that there are models where the variables can be qualitative. We use the R function gsub() which will replace any text with a value of our choosing.

trainData$Sex = gsub("female", 1, trainData$Sex)
trainData$Sex = gsub("^male", 0, trainData$Sex)

Making Inferences on Missing Age Values

Lastly, upon examining our dataset, we see that many entries for “age” are missing. Because age entries could be an important variable we try inferencing them based on a relationship between title and age; we’re essentially assuming that Mrs.X will older than Ms.X. Moreover, we’re (naively) assuming that people with the same titles are closer together in age.

So first, we put the index of people with the specified surname into a list for further processing. In R we use the grep() function which will return a vector of row numbers which have a specified surname.

master_vector = grep("Master.",trainData$Name, fixed=TRUE)
miss_vector = grep("Miss.", trainData$Name, fixed=TRUE)
mrs_vector = grep("Mrs.", trainData$Name, fixed=TRUE)
mr_vector = grep("Mr.", trainData$Name, fixed=TRUE)
dr_vector = grep("Dr.", trainData$Name, fixed=TRUE)

You might have noticed that there are other less frequent titles such as Reverend or Colonel which we are ignoring for now.

Next, we rename each name with a shortened tag. This means replacing the full name of an individual, such as “Allison, Master. Hudson Trevor” we shorten it to be “Master”. This allows for a standardized column This is done in the following code:

for(i in master_vector) {
  trainData$Name[i] = "Master"
}
for(i in miss_vector) {
  trainData$Name[i] = "Miss"
}
for(i in mrs_vector) {
  trainData$Name[i] = "Mrs"
}
for(i in mr_vector) {
  trainData$Name[i] = "Mr"
}
for(i in dr_vector) {
  trainData$Name[i] = "Dr"
}

Note that we utilized a for loop, which we explain below.

FOR LOOP
For loop is intended to apply the same function, over a range of data.

Now that we have a series of standardized titles, we calculate the average age of each title.

Making Inference on Missing Age Values: Inputting Title-group averages

We replace the missing ages with their respective title-group average. This means that if we have a missing age entry for a man named Mr. Bond, we substitute his age for the average age for all passenger with the title Mr. Similarly for Master, Miss, Mrs, and Dr. We then write a for loop that goes through the entire Train data set and checks if the age value is missing. If it is, we assign it according to the surname of the observation. This code snippet is a bit complicated; you can just copy and paste for now if you’re not confident about understanding it!

master_age = round(mean(trainData$Age[trainData$Name == "Master"], na.rm = TRUE), digits = 2)
miss_age = round(mean(trainData$Age[trainData$Name == "Miss"], na.rm = TRUE), digits =2)
mrs_age = round(mean(trainData$Age[trainData$Name == "Mrs"], na.rm = TRUE), digits = 2)
mr_age = round(mean(trainData$Age[trainData$Name == "Mr"], na.rm = TRUE), digits = 2)
dr_age = round(mean(trainData$Age[trainData$Name == "Dr"], na.rm = TRUE), digits = 2)

for (i in 1:nrow(trainData)) {
  if (is.na(trainData[i,5])) {
    if (trainData$Name[i] == "Master") {
      trainData$Age[i] = master_age
    } else if (trainData$Name[i] == "Miss") {
      trainData$Age[i] = miss_age
    } else if (trainData$Name[i] == "Mrs") {
      trainData$Age[i] = mrs_age
    } else if (trainData$Name[i] == "Mr") {
      trainData$Age[i] = mr_age
    } else if (trainData$Name[i] == "Dr") {
      trainData$Age[i] = dr_age
    } else {
      print("Uncaught Title")
    }
  }
}

In the above code snippet, we use a if-statement. If statements are intended to help with decision making:

If-statement:
It executes a specific line of code, depending on certain true/false conditions.

if (some true/false statement_1) {
  #do this action if it is true
} else if (some other true/false statement_2) {
  #do this if the statement_1 wasn't true, but statement_2 ended up true
} else if (some other true/false statement_3) {
  #do this if the statement_1 and statement_2 were both not true, but statement_3 was true.
} else {
  #do this if none of the statement_* was true. Note that this last bit of "else" doesn't always have to happen.
}

Ultimately if statements allow people to let programs make decisions.

Quick Recap

At this point, we have accomplished the following:
– [x] load the data we intend to work with.
– [x] did some preliminary exploration into the data.
– [x] cleaned the data by converting the Sex variable to (0/1) and made inferences on the missing age entries.

Part of curating the data is also to create additional variables which we could use and may help with the classification and prediction of Test data passengers surviving.

Creating New Variables to Strengthen Our Model

By creating new variables we may be able to predict the survival of the passengers even more closely. This part of the walkthrough specifically includes three variables which we found to help our model. Think about what the added variables mean; do they make intuitive sense? How might these variables affect the survival rate?

Variable 1: Child.

This additional variable choice stems from the fact that we suspect that being a child might affect the survival rate of a passenger.

We start by creating a child variable. This is done by appending an empty column to the dataset, titled “Child”.
We then populate the column with value “1”, if the passenger is under the age of 12, and “2” otherwie.

trainData["Child"]
for (i in 1:nrow(trainData)) {
  if (trainData$Age[i] <= 12) {
    trainData$Child[i] = 1
  } else {
    trainData$Child[i] = 2
  }
}

Variable 2: Family

This variable is meant to represent the family size of each passenger by adding the number of Siblings/Spouses and Parents/Children (we add 1 so minimum becomes 1). We’re guessing that larger families are less likely to survive, or perhaps it is the other way around. The beautiful part is that it doesn’t matter! The model we build will optimize for the problem. All we’re indicating is that there might be a relationship between family size and survival rate.

trainData["Family"] = NA

for(i in 1:nrow(trainData)) {
  x = trainData$SibSp[i]
  y = trainData$Parch[i]
  trainData$Family[i] = x + y + 1
}

Varible 3: Mother

We add another variable indicating whether the passenger is a mother.
This is done by going through the passengers and checking to see if the title is Mrs and if the number of kids is greater than 0. This also includes any titles with Mrs and if the number of parents is greater than 0

trainData["Mother"] 
for(i in 1:nrow(trainData)) {
  if(trainData$Name[i] == "Mrs" & trainData$Parch[i] > 0) {
    trainData$Mother[i] = 1
  } else {
    trainData$Mother[i] = 2
  }
}

Now, we have a fully equipped training dataset!

Cleaning the TEST Data

Now that we have a cleaned and bolstered trainData, we repeat the exact process on the testData. The idea is to conduct the same steps (in terms of subsetting, cleaning, inference, adding more variables), so that both datasets are in the same state.

The only difference is the following: The test dataset doesn’t have the “Survived” variable (which is what we’re trying to predict), therefore the subsetting indexes are slightly different when cleaning the data. You should copy and paste the code below. Notice how similar the code is to what we used in trainData.

RCode to Clean the Test Data

PassengerId = testData[1]
testData = testData[-c(1, 8:11)]

testData$Sex = gsub("female", 1, testData$Sex)
testData$Sex = gsub("^male", 0, testData$Sex)

test_master_vector = grep("Master.",testData$Name)
test_miss_vector = grep("Miss.", testData$Name)
test_mrs_vector = grep("Mrs.", testData$Name)
test_mr_vector = grep("Mr.", testData$Name)
test_dr_vector = grep("Dr.", testData$Name)

for(i in test_master_vector) {
  testData[i, 2] = "Master"
}
for(i in test_miss_vector) {
  testData[i, 2] = "Miss"
}
for(i in test_mrs_vector) {
  testData[i, 2] = "Mrs"
}
for(i in test_mr_vector) {
  testData[i, 2] = "Mr"
}
for(i in test_dr_vector) {
  testData[i, 2] = "Dr"
}

test_master_age = round(mean(testData$Age[testData$Name == "Master"], na.rm = TRUE), digits = 2)
test_miss_age = round(mean(testData$Age[testData$Name == "Miss"], na.rm = TRUE), digits =2)
test_mrs_age = round(mean(testData$Age[testData$Name == "Mrs"], na.rm = TRUE), digits = 2)
test_mr_age = round(mean(testData$Age[testData$Name == "Mr"], na.rm = TRUE), digits = 2)
test_dr_age = round(mean(testData$Age[testData$Name == "Dr"], na.rm = TRUE), digits = 2)

for (i in 1:nrow(testData)) {
  if (is.na(testData[i,4])) {
    if (testData[i, 2] == "Master") {
      testData[i, 4] = test_master_age
    } else if (testData[i, 2] == "Miss") {
      testData[i, 4] = test_miss_age
    } else if (testData[i, 2] == "Mrs") {
      testData[i, 4] = test_mrs_age
    } else if (testData[i, 2] == "Mr") {
      testData[i, 4] = test_mr_age
    } else if (testData[i, 2] == "Dr") {
      testData[i, 4] = test_dr_age
    } else {
      print(paste("Uncaught title at: ", i, sep=""))
      print(paste("The title unrecognized was: ", testData[i,2], sep=""))
    }
  }
}

#We do a manual replacement here, because we weren't able to programmatically figure out the title.
#We figured out it was 89 because the above print statement should have warned us.
testData[89, 4] = test_miss_age

testData["Child"] = NA

for (i in 1:nrow(testData)) {
  if (testData[i, 4] <= 12) {
    testData[i, 7] = 1
  } else {
    testData[i, 7] = 1
  }
}

testData["Family"] = NA

for(i in 1:nrow(testData)) {
  testData[i, 8] = testData[i, 5] + testData[i, 6] + 1
}

testData["Mother"] = NA

for(i in 1:nrow(testData)) {
  if(testData[i, 2] == "Mrs" & testData[i, 6] > 0) {
    testData[i, 9] = 1
  } else {
    testData[i, 9] = 2
  }
}

Conclusion

At this point, we’ve finished preparing the data. testData and trainData will look very similar (after all they both underwent the very similar processes). Believe it or not, the hard part is over! Now that have clean Train and Test datasets, we will simply plug the Train data in a model (thus training the model). We then use the trained model to create predictions utilizing the Test data. The mathematically hardest part (solving some complex optimization problem within the model) is entirely done by R!

Go on to Part 3 Here

If you want more practice data projects, be sure to check out http://www.teamleada.com