Data Analytics for Beginners: Part 2

11 Saturday Jan 2014

Recap

In the last post, we prepared our working environment. We also got R/Rstudio working, loaded our data, and followed it up with a bit of visualization and exploration. Now that we have a better understanding of the data, we’re ready to move on to the next part: manipulating the data to prepare to plug it into a model.

Cleaning the TRAIN Data

After doing some exploratory analysis of the data, we now need to clean it to create our model. Note that it is important to explore the data so that you understand what elements need to be cleaned. For example you might have noticed that there are missing values in the data set, especially in the Age column.

Removing Variables Not Used for the Model

At this point, we remove the variables that we do not want to use in the training data for the model: PassengerID, Ticket, Fare, Cabin, and Embarked. To do so, we index our data set trainData with [ ]. The c() function generates a list of numbers. By including this list (along with a negative sign), we let R know to subset (or remove) those columns.

trainData = trainData[-c(1,9:12)]

Additionally, we need to replace qualitative variables (such as gender) into quantitative variables (0 for male, 1 for female etc) in order to fit our model. Note that there are models where the variables can be qualitative. We use the R function gsub() which will replace any text with a value of our choosing.

Replacing Gender variable (Male/Female) with a Dummy Variable (0/1)

trainData$Sex = gsub("female", 1, trainData$Sex)
trainData$Sex = gsub("^male", 0, trainData$Sex)

Making Inferences on Missing Age Values

Lastly, upon examining our dataset, we see that many entries for “age” are missing. Because age entries could be an important variable we try inferencing them based on a relationship between title and age; we’re essentially assuming that Mrs.X will older than Ms.X. Moreover, we’re (naively) assuming that people with the same titles are closer together in age.

So first, we put the index of people with the specified surname into a list for further processing. In R we use the grep() function which will return a vector of row numbers which have a specified surname.

master_vector = grep("Master.",trainData$Name, fixed=TRUE)
miss_vector = grep("Miss.", trainData$Name, fixed=TRUE)
mrs_vector = grep("Mrs.", trainData$Name, fixed=TRUE)
mr_vector = grep("Mr.", trainData$Name, fixed=TRUE)
dr_vector = grep("Dr.", trainData$Name, fixed=TRUE)

You might have noticed that there are other less frequent titles such as Reverend or Colonel which we are ignoring for now.

Next, we rename each name with a shortened tag. This means replacing the full name of an individual, such as “Allison, Master. Hudson Trevor” we shorten it to be “Master”. This allows for a standardized column This is done in the following code:

for(i in master_vector) {
  trainData$Name[i] = "Master"
}
for(i in miss_vector) {
  trainData$Name[i] = "Miss"
}
for(i in mrs_vector) {
  trainData$Name[i] = "Mrs"
}
for(i in mr_vector) {
  trainData$Name[i] = "Mr"
}
for(i in dr_vector) {
  trainData$Name[i] = "Dr"
}

Note that we utilized a for loop, which we explain below.

FOR LOOP
For loop is intended to apply the same function, over a range of data.

Now that we have a series of standardized titles, we calculate the average age of each title.

Making Inference on Missing Age Values: Inputting Title-group averages

We replace the missing ages with their respective title-group average. This means that if we have a missing age entry for a man named Mr. Bond, we substitute his age for the average age for all passenger with the title Mr. Similarly for Master, Miss, Mrs, and Dr. We then write a for loop that goes through the entire Train data set and checks if the age value is missing. If it is, we assign it according to the surname of the observation. This code snippet is a bit complicated; you can just copy and paste for now if you’re not confident about understanding it!

master_age = round(mean(trainData$Age[trainData$Name == "Master"], na.rm = TRUE), digits = 2)
miss_age = round(mean(trainData$Age[trainData$Name == "Miss"], na.rm = TRUE), digits =2)
mrs_age = round(mean(trainData$Age[trainData$Name == "Mrs"], na.rm = TRUE), digits = 2)
mr_age = round(mean(trainData$Age[trainData$Name == "Mr"], na.rm = TRUE), digits = 2)
dr_age = round(mean(trainData$Age[trainData$Name == "Dr"], na.rm = TRUE), digits = 2)

for (i in 1:nrow(trainData)) {
  if (is.na(trainData[i,5])) {
    if (trainData$Name[i] == "Master") {
      trainData$Age[i] = master_age
    } else if (trainData$Name[i] == "Miss") {
      trainData$Age[i] = miss_age
    } else if (trainData$Name[i] == "Mrs") {
      trainData$Age[i] = mrs_age
    } else if (trainData$Name[i] == "Mr") {
      trainData$Age[i] = mr_age
    } else if (trainData$Name[i] == "Dr") {
      trainData$Age[i] = dr_age
    } else {
      print("Uncaught Title")
    }
  }
}

In the above code snippet, we use a if-statement. If statements are intended to help with decision making:

If-statement:
It executes a specific line of code, depending on certain true/false conditions.

if (some true/false statement_1) {
  #do this action if it is true
} else if (some other true/false statement_2) {
  #do this if the statement_1 wasn't true, but statement_2 ended up true
} else if (some other true/false statement_3) {
  #do this if the statement_1 and statement_2 were both not true, but statement_3 was true.
} else {
  #do this if none of the statement_* was true. Note that this last bit of "else" doesn't always have to happen.
}

Ultimately if statements allow people to let programs make decisions.

Quick Recap

At this point, we have accomplished the following:
– [x] load the data we intend to work with.
– [x] did some preliminary exploration into the data.
– [x] cleaned the data by converting the Sex variable to (0/1) and made inferences on the missing age entries.

Part of curating the data is also to create additional variables which we could use and may help with the classification and prediction of Test data passengers surviving.

Creating New Variables to Strengthen Our Model

By creating new variables we may be able to predict the survival of the passengers even more closely. This part of the walkthrough specifically includes three variables which we found to help our model. Think about what the added variables mean; do they make intuitive sense? How might these variables affect the survival rate?

Variable 1: Child.

This additional variable choice stems from the fact that we suspect that being a child might affect the survival rate of a passenger.

We start by creating a child variable. This is done by appending an empty column to the dataset, titled “Child”.
We then populate the column with value “1”, if the passenger is under the age of 12, and “2” otherwie.

trainData["Child"]
for (i in 1:nrow(trainData)) {
  if (trainData$Age[i] <= 12) {
    trainData$Child[i] = 1
  } else {
    trainData$Child[i] = 2
  }
}

Variable 2: Family

This variable is meant to represent the family size of each passenger by adding the number of Siblings/Spouses and Parents/Children (we add 1 so minimum becomes 1). We’re guessing that larger families are less likely to survive, or perhaps it is the other way around. The beautiful part is that it doesn’t matter! The model we build will optimize for the problem. All we’re indicating is that there might be a relationship between family size and survival rate.

trainData["Family"] = NA

for(i in 1:nrow(trainData)) {
  x = trainData$SibSp[i]
  y = trainData$Parch[i]
  trainData$Family[i] = x + y + 1
}

Varible 3: Mother

We add another variable indicating whether the passenger is a mother.
This is done by going through the passengers and checking to see if the title is Mrs and if the number of kids is greater than 0. This also includes any titles with Mrs and if the number of parents is greater than 0

trainData["Mother"] 
for(i in 1:nrow(trainData)) {
  if(trainData$Name[i] == "Mrs" & trainData$Parch[i] > 0) {
    trainData$Mother[i] = 1
  } else {
    trainData$Mother[i] = 2
  }
}

Now, we have a fully equipped training dataset!

Cleaning the TEST Data

Now that we have a cleaned and bolstered trainData, we repeat the exact process on the testData. The idea is to conduct the same steps (in terms of subsetting, cleaning, inference, adding more variables), so that both datasets are in the same state.

The only difference is the following: The test dataset doesn’t have the “Survived” variable (which is what we’re trying to predict), therefore the subsetting indexes are slightly different when cleaning the data. You should copy and paste the code below. Notice how similar the code is to what we used in trainData.

RCode to Clean the Test Data

PassengerId = testData[1]
testData = testData[-c(1, 8:11)]

testData$Sex = gsub("female", 1, testData$Sex)
testData$Sex = gsub("^male", 0, testData$Sex)

test_master_vector = grep("Master.",testData$Name)
test_miss_vector = grep("Miss.", testData$Name)
test_mrs_vector = grep("Mrs.", testData$Name)
test_mr_vector = grep("Mr.", testData$Name)
test_dr_vector = grep("Dr.", testData$Name)

for(i in test_master_vector) {
  testData[i, 2] = "Master"
}
for(i in test_miss_vector) {
  testData[i, 2] = "Miss"
}
for(i in test_mrs_vector) {
  testData[i, 2] = "Mrs"
}
for(i in test_mr_vector) {
  testData[i, 2] = "Mr"
}
for(i in test_dr_vector) {
  testData[i, 2] = "Dr"
}

test_master_age = round(mean(testData$Age[testData$Name == "Master"], na.rm = TRUE), digits = 2)
test_miss_age = round(mean(testData$Age[testData$Name == "Miss"], na.rm = TRUE), digits =2)
test_mrs_age = round(mean(testData$Age[testData$Name == "Mrs"], na.rm = TRUE), digits = 2)
test_mr_age = round(mean(testData$Age[testData$Name == "Mr"], na.rm = TRUE), digits = 2)
test_dr_age = round(mean(testData$Age[testData$Name == "Dr"], na.rm = TRUE), digits = 2)

for (i in 1:nrow(testData)) {
  if (is.na(testData[i,4])) {
    if (testData[i, 2] == "Master") {
      testData[i, 4] = test_master_age
    } else if (testData[i, 2] == "Miss") {
      testData[i, 4] = test_miss_age
    } else if (testData[i, 2] == "Mrs") {
      testData[i, 4] = test_mrs_age
    } else if (testData[i, 2] == "Mr") {
      testData[i, 4] = test_mr_age
    } else if (testData[i, 2] == "Dr") {
      testData[i, 4] = test_dr_age
    } else {
      print(paste("Uncaught title at: ", i, sep=""))
      print(paste("The title unrecognized was: ", testData[i,2], sep=""))
    }
  }
}

#We do a manual replacement here, because we weren't able to programmatically figure out the title.
#We figured out it was 89 because the above print statement should have warned us.
testData[89, 4] = test_miss_age

testData["Child"] = NA

for (i in 1:nrow(testData)) {
  if (testData[i, 4] <= 12) {
    testData[i, 7] = 1
  } else {
    testData[i, 7] = 1
  }
}

testData["Family"] = NA

for(i in 1:nrow(testData)) {
  testData[i, 8] = testData[i, 5] + testData[i, 6] + 1
}

testData["Mother"] = NA

for(i in 1:nrow(testData)) {
  if(testData[i, 2] == "Mrs" & testData[i, 6] > 0) {
    testData[i, 9] = 1
  } else {
    testData[i, 9] = 2
  }
}

Conclusion

At this point, we’ve finished preparing the data. testData and trainData will look very similar (after all they both underwent the very similar processes). Believe it or not, the hard part is over! Now that have clean Train and Test datasets, we will simply plug the Train data in a model (thus training the model). We then use the trained model to create predictions utilizing the Test data. The mathematically hardest part (solving some complex optimization problem within the model) is entirely done by R!

Go on to Part 3 Here

If you want more practice data projects, be sure to check out http://www.teamleada.com

29 thoughts on “Data Analytics for Beginners: Part 2”

josephschmoe said:

January 23, 2014 at 12:21 am

Thanks for your blog. How can we contact you re working with you?

Reply
- josephschmoe said:
  
  January 23, 2014 at 12:42 am
  
  peter.morgan@persontyle.com
  
  Reply
- statsguys said:
  
  January 25, 2014 at 10:23 am
  
  Hi Joseph!
  
  Appreciate the kind words. I’d go to our landing page at the end of part 3 and we will reach out 🙂
  
  Reply
luis said:

February 8, 2014 at 8:55 am

Thanks for your work, Ive learned a lot

Reply
Brian said:

February 10, 2014 at 2:04 pm

Guys, awesome work so far. I’ve encountered an error and cannot determine a resolution. After copying + pasting the logic for the ‘Child’, I receive the following error. Any suggestions? Thanks!

Error in if (trainData$Age[i] <= 12) { :
missing value where TRUE/FALSE needed

Reply
- statsguys said:
  
  February 10, 2014 at 2:20 pm
  
  Hi Brian,
  
  The error is indicating that R is confused, because it doesn’t see a true/false value inside the if() statement. if(true/false) statements operate on the premise that we give it an expression that evaluates to true/false inside it, so that it can properly decide what to do next.
  
  Can you enter the following in the console, and let me know what you get returned?
  
  trainData$Age[1:10] <= 12
  length(trainData$Age)
  dim(trainDage)
  
  I'm trying to figure out if there is a problem with the trainData, specifically missing the "Age" column.
  In the first line of code, I ask to return the first 10 true/false values, each corresponding to if each "Age" variable is greater than 12 yrs old or not.
  In the second line of code, I'm asking to check for the length of the "Age" column.
  The last line of code is asking to return the dimension of the trainData. We're trying to verify that the data we have is indeed what we think we have.
  
  Let me know what kind of outputs you get.
  
  – TheGuys
  
  Reply
  - Brian said:
    
    February 10, 2014 at 2:46 pm
    
    I ran the code you provided, and it executed without error. However, nothing was returned. in regards to “dim(trainDage)”, what does the ‘D’ within the parentheses indicate?
    
    From what I can tell, the ‘Child’ logic fails once it hits the row with the first ‘Age’ of NA. PassengerID = 6, I believe.
  - statsguys said:
    
    February 10, 2014 at 3:01 pm
    
    Oh, I see.
    
    First, I have incorrectly typed “trainDage”. I meant to write “dim(trainData)”. Sorry for the confusion!
    
    Taking into account that you’re seeing NA in age, I’m assuming you did not go through the steps that cleaned the trainData$Age variable. Initially, there is a good amount of NA’s in the data. As data wranglers, we often see incomplete data. To overcome this issue, we make a simple proxy for the missing age using Title based averages.
    
    Make sure that you’ve gone through the title reassignment correctly. Moreover, ensure that you’ve gone through the steps that replaces the missing age variables. Namely the part “Making Inference on Missing Age Values: Inputting Title-group averages”.
    
    The above steps should fill all NA values. By the time you’re working with additional variables, you shouldn’t have any missing $Age variable.
    
    Please let us know if it fixed the error.
    
    – TheGuys
Brian said:

February 10, 2014 at 3:01 pm

I’m thinking that may be a typo?

“dim(trainData)”,

Reply
- statsguys said:
  
  February 10, 2014 at 3:58 pm
  
  Please see above reply!
  
  Reply
Ram said:

March 11, 2014 at 4:22 pm

TheGuys,

I ran into the same issue that Brain ran into, namely, when I source the code, I get the following error:

Error in if (testData[i, 4] <= 12) { :
missing value where TRUE/FALSE needed

So I did some debugging, and basically inserted the following line just before testData["Child"] <- NA:

mTable <- table(is.na(testData[, 4]))

Basically to check if testData[,4] has any NA values. Lo & behold, I did find one:because mTable is as follows:

FALSE TRUE
417 1

I then realized that just after the code to replace missing data in test[,4] with averages, there is supposed to be this line

testData[89, 4] <- test_miss_age

I thought it was a typo and didn't include this line at first. But in fact removing that line caused the NA value. Once I included it, mTable is as follows

FALSE
418

I don't understand the rationale for that line. Because the code above that is supposed to remove all the NA value. Why does this one row have an NA value in column 4?

Cheers
Ram

Reply
- statsguys said:
  
  March 16, 2014 at 1:22 pm
  Hi Ram,
  
  Sorry for the delayed response.
  
  As for the question you had, replace the following code with the age-replacement code:
  
  for (i in 1:nrow(testData)) { if (is.na(testData[i,4])) { if (testData[i, 2] == "Master") { testData[i, 4] <- test_master_age } else if (testData[i, 2] == "Miss") { testData[i, 4] <- test_miss_age } else if (testData[i, 2] == "Mrs") { testData[i, 4] <- test_mrs_age } else if (testData[i, 2] == "Mr") { testData[i, 4] <- test_mr_age } else if (testData[i, 2] == "Dr") { testData[i, 4] <- test_dr_age } else { print(paste("Uncaught Title at: ", i, sep="")) print(paste("The Title unrecognized was: ", testData[i,2], sep="")) } } }
  
  This will show you that when going through the data, we were unable to figure out what title was for one specific person (namely row 89). We utilized the grep() function earlier to figure out the title of each passengers; unfortunately, we were unable to systematically grab this particular person's title. Upon manually inspecting the name, we see "O'Donoghue, Ms. Bridget". Therefore, we manually add the missing age prediction, utilizing test_miss_age as a proxy. This happens at:
  
  testData[89, 4] <- test_miss_age
  
  We should have explained the manual substitution better, and we'll edit the blog to reflect this change!
  
  Thanks for bring this up.
  
  TheGuys
  
  Reply
Evan Van Ness said:

April 5, 2014 at 12:43 pm

FYI there is a small problem in your code as posted. Right now your code does not classify anyone as a Mrs (and thus the additional Mother variable is useless) because the mr_vector immediately overwrites those “Mrs” with “Mr”

Per help(grep), the R function grep defaults to “If a character vector of length 2 or more is supplied, the first element is used with a warning.”

Thus, in order for your code to do what you want it to, you need to add fixed=TRUE, ie the lines should look like:

master_vector <- grep("Master.",trainData$Name, fixed=TRUE)
miss_vector <- grep("Miss.", trainData$Name, fixed=TRUE)
mrs_vector <- grep("Mrs.", trainData$Name, fixed=TRUE)
mr_vector <- grep("Mr.", trainData$Name, fixed=TRUE)
dr_vector <- grep("Dr.", trainData$Name, fixed=TRUE)

Reply
- statsguys said:
  
  April 5, 2014 at 1:09 pm
  
  Thanks for pointing that out! We’re learning something new everyday. The post should now reflect your update 🙂
  
  Reply
alejandrodumas said:

April 16, 2014 at 7:01 pm

Thanks for your post! As an alternative to using a for loop when changing Name field you could use: trainData[master_vector,]$Name <- "Master"

Reply
- statsguys said:
  
  April 16, 2014 at 7:42 pm
  
  That’s a great idea! That should definitely work. We should have favored the vector operation.
  
  Reply
sandip said:

April 16, 2014 at 7:46 pm

Hi,
Thank you so much for your work. I am a newbie in R and this helped me very much.

I did not understand part of the blog for age replacement. Code snippet given is

master_age miss_age mrs_age mr_age dr_age

for (i in 1:nrow(trainData)) {
if (is.na(trainData[i,5])) {
if (trainData$Name[i] == “Master”) {
trainData$Age[i]
} else if (trainData$Name[i] == “Miss”) {
trainData$Age[i]
} else if (trainData$Name[i] == “Mrs”) {
trainData$Age[i]
} else if (trainData$Name[i] == “Mr”) {
trainData$Age[i]
} else if (trainData$Name[i] == “Dr”) {
trainData$Age[i]
} else {
print(“Uncaught Title”)
}
}
}

Shouldn’t trainData$Age[i] be assigned to some value when it is ? Also the variables master_age, etc are not defined, and R throws an error. Is this a typo?

Reply
- statsguys said:
  
  April 17, 2014 at 9:46 am
  
  Hi Sadip,
  
  Thanks for pointing that out. We actually realized that a large portion of our tutorial code has been erased (namely all assignments via ‘<-'). We're not sure why this happened, but we'll be fixing the code (this time using '=' instead of '<-').
  
  Sorry for the inconvenience, we're just as confused as to why this happened.
  
  – TheGuys
  
  Reply
  - sandip said:
    
    April 18, 2014 at 4:18 am
    
    Thank you for quick response!
projectramowp said:

June 17, 2014 at 11:21 am

I think “gsub(“female”, 1, train$Sex)” should be “gsub(“1”, “female”, train$Sex)”. i.e. you may have got the order of the arguments wrong. ditto for men.

Reply
- statsguys said:
  
  June 17, 2014 at 12:19 pm
  
  Hello!
  
  Actually, I think it is correct as it is. Please see the R documentation at: http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html
  
  In this section, we’re looking to replace the word “female” with the integer (numerical) of 1.
  
  The order might be flipped in another language (with similar functionalities),
  
  Cheers!
  
  Reply
  - projectramowp said:
    
    June 17, 2014 at 12:41 pm
    
    Oh, for some reason I thought it came in as 1 and 0, and you were labeling it male and female. okay, thanks.
Charles Bordet said:

June 30, 2014 at 1:20 pm

Hi,

I really like your tutorial. I’m a junior statistician and am used to R so this isn’t complicated at all, but I don’t have the reflex nor the habit to look at the data (as in Part 1) and to clean it like we did in this article.
I think this is something that we can get only with experience, so this tutorial is great for that!

This has been slightly mentioned in the comments above, but I think you really should have favored the vector operations instead of the loops. I noticed there are quite a lot of R beginners in the comments and this is really not a good idea to introduce R with so much loops. It gives very bad habits from the beginning. Just a few examples to illustrate what I mean:

test[is.na(test$Age) & test$Name == “Master”, “Age”] <- mean(test$Age[test$Name == "Master"], na.rm = T)
to change the name of the people for "Master" and so on… Even if we don't have to condense it the way I did (we can introduce new variables), it's really shorter and also more explicit.

To declare the new variables:
test["Child"] <- ifelse(test$Age <= 12, 1, 2)
or
test["Family"] <- test$SibSp + test$Parch
For the Family variable, it's really not a good idea to use a loop. For bigger datasets, we will encounter severe computation time complications.

My two cents…

—
Charles

Reply
- statsguys said:
  
  June 30, 2014 at 3:09 pm
  
  Hi Charles,
  
  I absolutely agree! Our original intention was to first introduce for-loop (as we think it is easier to grasp/understand for-loop than vector operations), and later introduce vector-ops as a better substitute. However, in speaking with some other people, we’ve realized that it might have been better to introduce it from the beginning.
  
  On a different note, we’ve since moved to teamleada.com. Here, in our newer projects, we favor vector operations instead of for-loops (even for beginners). If you’re interested in looking at it more, please send me an email (tristan@[teamleada website domain here]) and I can throw an Early Access Code your way!
  
  Thanks for the feedback,
  Tristan
  
  Reply
Trevor Allen said:

November 15, 2014 at 2:26 pm

Many thanks on the tutorial! It’s a great into to R. I was walking through it and just wanted to mention a couple minor typos/errors in case they trip somebody else up like they did me, specifically the “RCode to Clean the Test Data” section.

–> lines 7-11: grep() function needs fixed=TRUE argument passed (same as Evan Van Ness’s above comment regarding the train data).
–> line 64: replace 1 with 2; as-is this typo makes everyone a Child

Finally, I’m not sure this is a mistake per se, but it caused a problem for me: you converted sex to 1/2 in the train data, but used 1/0 in the test data. I think this is what caused the following error for me when trying to estimate test-data values:

> p.hats <- predict.glm(train.glm, newdata=testData, type='response')
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Sex has new levels 0

But setting sex to 1/2 in the testData munging process got rid of the error.

Thanks again for the tutorial!

Trevor

Reply
Saikrishna said:

February 23, 2015 at 12:13 am

I am new to R and got a problem in applying this part of code where the name fields are changed, it shows
Warning messages:
1: In `[<-.factor`(`*tmp*`, i, value = c(NA, NA, NA, NA, NA, NA, NA, :
invalid factor level, NA generated

and every name is changed to NA
I even tried to do by vector operation as suggested in comments
trainData[master_vector,]$Name <- "Master"
but this one also giving me same warning messages and the name fields are changed to NA
Someone please give me a solution for this.
Thank you

Reply
Saikrishna said:

February 24, 2015 at 8:50 am

Got the answer

Reply
Pingback: First Kaggle submission : Predict survival on the Titanic