Data Analytics for Beginners: Part 3

10 Friday Jan 2014

Recap

You have now cleaned the data by doing the following:
1. Converted categorical variables to dummy variables
2. Added missing age values
3. Created new variables to better fit a model
You are now ready to build a model which will make predictions!

Training a Model

We first feed the training data into a model, and the model will optimize itself to give you the best explanation for your variables and outcome. The idea is that we build a model for predicting survival using the Train dataset. Then we input the observations from the Test dataset to predict their survival.

Fitting logistic regression model

R will take care of solving/optimizing the model. We don’t have to worry about any complicated Math! A logistic regression model is a generalized linear model which is used when your trying to predict something that is binary. Since whether a passenger survived or not is binary, we use logistic regression. The parameters we choose to predict survival are Passenger Class, Sex, Age, Child, an interaction variable of Sex AND Passenger Class, Family, and Mother.

train.glm <- glm(Survived ~ Pclass + Sex + Age + Child + Sex*Pclass + Family + Mother, family = binomial, data = trainData)

To see a summary of the model, and specifically the coefficients that are calculated to predict survival you can type:

summary(train.glm)

<h3>Fitting a Model</h3>
Now that the Test dataset is ready, we use an R function which calculates predictions for the survival of the passengers in the Test dataset. The predictions for each observation come in the form of a probability score for the response being 0 or 1. Therefore we must apply a cutoff value to determine which probability scores translate to a 1 and which translate to a 0. For simplicity it is generally most effective to choose a cutoff of .5 to minimize errors.

What is done here is R takes the coefficients calculated in the train.glm model and uses the variables Passenger Class, Sex, Age, Child, an interaction variable of Sex AND Passenger Class, Family, and Mother in the Test dataset to calculate survival predictions for the Test dataset observations.

p.hats <- predict.glm(train.glm, newdata = testData, type = "response")

survival <- vector()
for(i in 1:length(p.hats)) {
  if(p.hats[i] > .5) {
    survival[i] <- 1
  } else {
    survival[i] <- 0
  }
}

Creating a CSV to Submit to Kaggle

We now output the data into a csv file, which can be submitted on Kaggle for grading here

kaggle.sub <- cbind(PassengerId,survival)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "kaggle.csv", row.names = FALSE)

A file titled kaggle should now be in the same folder which you saved the original Test and Train datasets. Use this file to make a submission on the Kaggle website and see where you rank!

Note: Make sure the CSV you submit has only two columns: one labeled as “PassengerID” and another labeled as “Survived”.

If you liked our tutorial keep in touch at our landing page

Email as at: statsguys@gmail.com
Github Repo for entire RCode: https://github.com/tristantao/kaggle_survivor

28 thoughts on “Data Analytics for Beginners: Part 3”

sbtrct said:

January 23, 2014 at 9:28 am

Nice set of tutorials, thank you.

I’d still class myself as a beginner, but I do have a bit of R experience and I’m an ok Python programmer.

One minor thing I was wondering about – it’s not clear to me how you came up with that exact logistic regression formulaused here. I understand how it works, but how did you decide to use “Sex*Pclass”? Was it a case of trying a few different models? I’m assuming that the performance of this particular model was better than that of a Naive Bayes model (for example).

I’m also not sure how to go about interpreting the summary of the model “summary(train.glm)” – any pointers for some recommended reading I could do on the subject?

Thanks!

Reply
- statsguys said:
  
  January 24, 2014 at 5:40 pm
  
  So we created the variable “Sex*Pclass” basically from hypothesizing that the interaction between these two variables would have an effect on predicting the response.
  
  The presence of an interaction effect indicates that the effect of one predictor variable (Sex) on the response variable (Survival) is different at different values of the other predictor variable (PClass)
  
  We tried different models (Random forest, Lm, etc.) but yes found the logistic regression to be the most effective.
  
  Try this link out to learn more about the GLM:
  
  Click to access R11.pdf
  
  Reply
  - sbtrct said:
    
    January 24, 2014 at 6:46 pm
    
    Super, thanks for taking the time to reply.
  - wubr2000 said:
    
    April 21, 2014 at 5:09 am
    
    I just tried Random Forest (using the “randomForest” package) with the same exact methodology in cleaning the training and test data sets you guys used.
    
    The ranking on Kaggle improved quite a bit over linear regression…
  - wubr2000 said:
    
    April 21, 2014 at 5:29 am
    
    Boosting did even better than Random Forest!
  - statsguys said:
    
    April 21, 2014 at 9:14 am
    
    Hi Bruno!
    
    Thank you for following our blog posts! We’d love to speak to you via email/phone about your interest in learning more data analytics! Please shoot me an email at brian.liou91@gmail.com. Thx!
- raminlag (@raminlag) said:
  
  May 24, 2014 at 9:02 am
  
  I got the best result with these features using cforest from the “party” package. Code sample is below:
  
  library(party)
  set.seed(415)
  
  fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Child + Sex*Pclass + Family + Mother, data = trainData, controls=cforest_unbiased(ntree=2000, mtry=3))
  
  Prediction_cf <- predict(fit, testData, OOB=TRUE, type = "response")
  submission_cf <- data.frame(PassengerId = PassengerId, Survived = Prediction_cf)
  write.csv(submission_cf, file = "kaggle_cforest.csv", row.names = FALSE)
  
  Reply
Doc said:

January 29, 2014 at 2:02 pm

FIrst, thank you for the tutorial. You struck a good balance by making it short and understandable.

I downloaded the complete script from Github and pasted it into a new R-script window. When I ran the script, it ran fine until it these commands:

p.hats
> survival for(i in 1:length(p.hats)) {
+ if(p.hats[i] > .5) {
+ survival[i] <- 1
+ } else {
+ survival[i]
> kaggle.sub colnames(kaggle.sub) <- c("PassengerId", "Survived")
Error in colnames(kaggle.sub) write.csv(kaggle.sub, file = “kaggle.csv”, row.names = FALSE)
Error in is.data.frame(x) : object ‘kaggle.sub’ not found

I’m running RStudion v 0.98.501 on OSX 10.9.1

Reply
- statsguys said:
  
  January 29, 2014 at 2:32 pm
  
  Thanks for doing the tutorial!
  
  The code snippet you pasted is from near the end of the tutorial, where we’re trying to output the result into a csv file, for submission to kaggle. Try running the following code snippet instead:
  
  kaggle.sub <- cbind(PassengerId,survival)
  colnames(kaggle.sub) <- c("PassengerId", "Survived")
  write.csv(kaggle.sub, file = "kaggle.csv", row.names = FALSE)
  
  You're getting the error because you've omitted in the code the line "kaggle.sub <- cbind(PassengerId, survival)". This resulted in the columns kaggle.sub not properly being created. Please check that the code is copied properly (and that particular line isn't missing).
  
  Please let us know if this works!.
  
  Cheers,
  -StatsGuys
  
  Reply
Andrew Adams said:

February 7, 2014 at 9:34 am

Great tutorial! I am a complete beginner when it comes to data analysis and programming, but this was very easy to follow and understand. I’m looking forward to more!

Reply
Stefan said:

February 11, 2014 at 4:33 pm

Hi, thanks for the great tutorial!

Just curious, why did you choose to use the values of 1 and 2 for the new child and mother variables (as opposed to 0 and 1)?

Cheers,

Stefan

Reply
- statsguys said:
  
  February 11, 2014 at 9:27 pm
  
  Hi Stefan,
  
  We’re glad you’re enjoying the blog!
  
  As for the data you mentioned, the correct thing was for us to utilize factor() method and introduce them as factor data type in R. This would have added a new layer of complexity, therefore we opted to go with substituting dummy variables (after all everyone has heard of male/female being substitute as 0/1).
  
  The reason why we chose 1/2, is simply because 1/2 gave us the best result. Generally, dummy variables shouldn’t have too much affect on the model (as long as they are reasonable). However, we actually tried 0/1 combination and it performed horribly! We’re attributing that to the fact that the dataset is becoming a bit too sparse (with lots of 0/1s). After playing with the data a tiny bit, we decided on 1/2.
  
  I know this isn’t the direct answer you were looking for, but sometimes working with data requires a bit of intuition and workaround.
  
  TheGuys
  
  Reply
  - Stefan Zvonar said:
    
    February 11, 2014 at 10:07 pm
    
    Thanks for the reply.
    
    That’s what I love about this stuff – it’s a science and an art-form.
    
    Cheers!
Pingback: Data Analytics for Beginners: Part 2 | statsguys
Ram Sastry said:

March 18, 2014 at 5:47 am

The error was missing. Here is is

Gentleman,

When I go to submit my code through the link above, I have to sign. I did so using my Goggle account, and then I get the following error:

Oops

Something went wrong. The error has been logged for site administrators to review. Please feel free to contact us if this error keeps happening to you.

Back to the main site.

Kindly advcie.

Cheers
Ram,

Reply
- statsguys said:
  
  March 20, 2014 at 12:09 am
  
  Hi Ram,
  
  Are you logging in through your google account? We’re not 100% sure about that particular route; we’ve created our own Kaggle accounts, and we haven’t experienced any issues with the submission process.
  
  Moreover, this sounds like an issue with Kaggle; can you try submitting again in a day or two, and let us know if you’re successful?
  
  TheGuys
  
  Reply
Anne Walter said:

April 30, 2014 at 3:00 pm

This could not be a more rudimentary problem, but when I use the summary(train.glm) script, I don’t actually see anything. Where in RStudio should I see the info on the model?

Reply
- statsguys said:
  
  April 30, 2014 at 3:49 pm
  
  Try writing the code:
  “`
  summary(train.glm)
  “`
  in the console area, rather than the RScript area. Once you write the code in the console area, you should be able to hit the enter/return key to run the command. Even if the command wasn’t successful, you should see a print out of red words telling us about the issue with the command.
  
  Please try that and let us know what happens If you have any questions, fire away!
  
  – Tristan
  
  Reply
Syed said:

June 25, 2014 at 3:13 am

Fantastic tutorial. I am eager to learn more and looking forward to some guidance from your side.

Reply
water damage cleanup said:

August 12, 2014 at 6:16 pm

You really make it seem so easy with your presentation but I find this topic
to be actually something which I think I would never understand.
It seems too complex and very broad for me. I’m looking forward for your
next post, I will try to get the hang of it!

Reply
NBA Trends said:

September 20, 2014 at 12:06 pm

Hey thanks for this tutorial. I am new to kaggle and i wanted to get a feel for how people use it. One thing I would like to note is that you can replace the for loop for the glm predictions in one line:

survival .5)

which would give you 1s and 0s automaticall

Reply
- NBA Trends said:
  
  September 20, 2014 at 12:06 pm
  
  as.numeric(p.hats>.5)
  
  Reply
  - statsguys said:
    
    September 21, 2014 at 7:29 pm
    
    Thanks for the pointer!
sumit said:

December 22, 2014 at 1:56 am

I wanted to know that here survival was a probability function and when predicted it gave values betwwen 0-1,on which threshold was applied,but if those values of probability came>1 or<0 ,does it suggest that I have taken wrong factors for the prediction

?

Reply
- statsguys said:
  
  December 22, 2014 at 4:26 pm
  
  Well, there was no reason for the result to be less than 0 or greater than 1, since all of our y’s were either 1 or 0. By having 1.1 for example, you can be wrong by either 0.1 (if it was 1), or 1.1 (if the answer was 0). Therefore guessing between 0-1 would yield less error.
  
  Reply
AC said:

January 11, 2015 at 9:59 am

I want to know how does the general regression model works and what do the coefficients mean actually. Any reading you can point to for the same?

PS: Loved the tutorial. 🙂

Reply
David said:

March 9, 2015 at 10:19 am

Thank you so much for the tutorial. I would have never known how to parse out the Miss, Mrs, etc titles and never would have thought of it. Did you deliberately use Pclass, Mother, and Child as a numeric values rather than factors? If so, what was the thought process?

Reply
Pingback: First Kaggle submission : Predict survival on the Titanic - Analytics Khoj