### Recap

You have now cleaned the data by doing the following:

1. Converted categorical variables to dummy variables

2. Added missing age values

3. Created new variables to better fit a model

You are now ready to build a model which will make predictions!

### Training a Model

We first feed the training data into a model, and the model will optimize itself to give you the best explanation for your variables and outcome. The idea is that we build a model for predicting survival using the Train dataset. Then we input the observations from the Test dataset to predict their survival.

# Fitting logistic regression model

R will take care of solving/optimizing the model. We don’t have to worry about any complicated Math! A logistic regression model is a generalized linear model which is used when your trying to predict something that is binary. Since whether a passenger survived or not is binary, we use logistic regression. The parameters we choose to predict survival are Passenger Class, Sex, Age, Child, an interaction variable of Sex AND Passenger Class, Family, and Mother.

train.glm <- glm(Survived ~ Pclass + Sex + Age + Child + Sex*Pclass + Family + Mother, family = binomial, data = trainData)

To see a summary of the model, and specifically the coefficients that are calculated to predict survival you can type:

summary(train.glm)

<h3>Fitting a Model</h3>

Now that the Test dataset is ready, we use an R function which calculates predictions for the survival of the passengers in the Test dataset. The predictions for each observation come in the form of a probability score for the response being 0 or 1. Therefore we must apply a cutoff value to determine which probability scores translate to a 1 and which translate to a 0. For simplicity it is generally most effective to choose a cutoff of .5 to minimize errors.

What is done here is R takes the coefficients calculated in the `train.glm`

model and uses the variables Passenger Class, Sex, Age, Child, an interaction variable of Sex AND Passenger Class, Family, and Mother in the **Test dataset** to calculate survival predictions for the Test dataset observations.

p.hats <- predict.glm(train.glm, newdata = testData, type = "response") survival <- vector() for(i in 1:length(p.hats)) { if(p.hats[i] > .5) { survival[i] <- 1 } else { survival[i] <- 0 } }

## Creating a CSV to Submit to Kaggle

We now output the data into a csv file, which can be submitted on Kaggle for grading here

kaggle.sub <- cbind(PassengerId,survival) colnames(kaggle.sub) <- c("PassengerId", "Survived") write.csv(kaggle.sub, file = "kaggle.csv", row.names = FALSE)

A file titled kaggle should now be in the same folder which you saved the original Test and Train datasets. Use this file to make a submission on the Kaggle website and see where you rank!

**Note:** Make sure the CSV you submit has only two columns: one labeled as “PassengerID” and another labeled as “Survived”.

If you liked our tutorial keep in touch at our landing page

Email as at: statsguys@gmail.com

Github Repo for entire RCode: https://github.com/tristantao/kaggle_survivor

sbtrct

said:Nice set of tutorials, thank you.

I’d still class myself as a beginner, but I do have a bit of R experience and I’m an ok Python programmer.

One minor thing I was wondering about – it’s not clear to me how you came up with that exact logistic regression formulaused here. I understand how it works, but how did you decide to use “Sex*Pclass”? Was it a case of trying a few different models? I’m assuming that the performance of this particular model was better than that of a Naive Bayes model (for example).

I’m also not sure how to go about interpreting the summary of the model “summary(train.glm)” – any pointers for some recommended reading I could do on the subject?

Thanks!

statsguys

said:So we created the variable “Sex*Pclass” basically from hypothesizing that the interaction between these two variables would have an effect on predicting the response.

The presence of an interaction effect indicates that the effect of one predictor variable (Sex) on the response variable (Survival) is different at different values of the other predictor variable (PClass)

We tried different models (Random forest, Lm, etc.) but yes found the logistic regression to be the most effective.

Try this link out to learn more about the GLM:

Click to access R11.pdf

sbtrct

said:Super, thanks for taking the time to reply.

wubr2000

said:I just tried Random Forest (using the “randomForest” package) with the same exact methodology in cleaning the training and test data sets you guys used.

The ranking on Kaggle improved quite a bit over linear regression…

wubr2000

said:Boosting did even better than Random Forest!

statsguys

said:Hi Bruno!

Thank you for following our blog posts! We’d love to speak to you via email/phone about your interest in learning more data analytics! Please shoot me an email at brian.liou91@gmail.com. Thx!

raminlag (@raminlag)

said:I got the best result with these features using cforest from the “party” package. Code sample is below:

library(party)

set.seed(415)

fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Child + Sex*Pclass + Family + Mother, data = trainData, controls=cforest_unbiased(ntree=2000, mtry=3))

Prediction_cf <- predict(fit, testData, OOB=TRUE, type = "response")

submission_cf <- data.frame(PassengerId = PassengerId, Survived = Prediction_cf)

write.csv(submission_cf, file = "kaggle_cforest.csv", row.names = FALSE)

Doc

said:FIrst, thank you for the tutorial. You struck a good balance by making it short and understandable.

I downloaded the complete script from Github and pasted it into a new R-script window. When I ran the script, it ran fine until it these commands:

p.hats

> survival for(i in 1:length(p.hats)) {

+ if(p.hats[i] > .5) {

+ survival[i] <- 1

+ } else {

+ survival[i]

> kaggle.sub colnames(kaggle.sub) <- c("PassengerId", "Survived")

Error in colnames(kaggle.sub) write.csv(kaggle.sub, file = “kaggle.csv”, row.names = FALSE)

Error in is.data.frame(x) : object ‘kaggle.sub’ not found

I’m running RStudion v 0.98.501 on OSX 10.9.1

statsguys

said:Thanks for doing the tutorial!

The code snippet you pasted is from near the end of the tutorial, where we’re trying to output the result into a csv file, for submission to kaggle. Try running the following code snippet instead:

kaggle.sub <- cbind(PassengerId,survival)

colnames(kaggle.sub) <- c("PassengerId", "Survived")

write.csv(kaggle.sub, file = "kaggle.csv", row.names = FALSE)

You're getting the error because you've omitted in the code the line "kaggle.sub <- cbind(PassengerId, survival)". This resulted in the columns kaggle.sub not properly being created. Please check that the code is copied properly (and that particular line isn't missing).

Please let us know if this works!.

Cheers,

-StatsGuys

Andrew Adams

said:Great tutorial! I am a complete beginner when it comes to data analysis and programming, but this was very easy to follow and understand. I’m looking forward to more!

Stefan

said:Hi, thanks for the great tutorial!

Just curious, why did you choose to use the values of 1 and 2 for the new child and mother variables (as opposed to 0 and 1)?

Cheers,

Stefan

statsguys

said:Hi Stefan,

We’re glad you’re enjoying the blog!

As for the data you mentioned, the correct thing was for us to utilize factor() method and introduce them as factor data type in R. This would have added a new layer of complexity, therefore we opted to go with substituting dummy variables (after all everyone has heard of male/female being substitute as 0/1).

The reason why we chose 1/2, is simply because 1/2 gave us the best result. Generally, dummy variables shouldn’t have too much affect on the model (as long as they are reasonable). However, we actually tried 0/1 combination and it performed horribly! We’re attributing that to the fact that the dataset is becoming a bit too sparse (with lots of 0/1s). After playing with the data a tiny bit, we decided on 1/2.

I know this isn’t the direct answer you were looking for, but sometimes working with data requires a bit of intuition and workaround.

TheGuys

Stefan Zvonar

said:Thanks for the reply.

That’s what I love about this stuff – it’s a science and an art-form.

Cheers!

Pingback: Data Analytics for Beginners: Part 2 | statsguys

Ram Sastry

said:The error was missing. Here is is

Gentleman,

When I go to submit my code through the link above, I have to sign. I did so using my Goggle account, and then I get the following error:

Oops

Something went wrong. The error has been logged for site administrators to review. Please feel free to contact us if this error keeps happening to you.

Back to the main site.

Kindly advcie.

Cheers

Ram,

statsguys

said:Hi Ram,

Are you logging in through your google account? We’re not 100% sure about that particular route; we’ve created our own Kaggle accounts, and we haven’t experienced any issues with the submission process.

Moreover, this sounds like an issue with Kaggle; can you try submitting again in a day or two, and let us know if you’re successful?

TheGuys

Anne Walter

said:This could not be a more rudimentary problem, but when I use the summary(train.glm) script, I don’t actually see anything. Where in RStudio should I see the info on the model?

statsguys

said:Try writing the code:

“`

summary(train.glm)

“`

in the console area, rather than the RScript area. Once you write the code in the console area, you should be able to hit the enter/return key to run the command. Even if the command wasn’t successful, you should see a print out of red words telling us about the issue with the command.

Please try that and let us know what happens If you have any questions, fire away!

– Tristan

Syed

said:Fantastic tutorial. I am eager to learn more and looking forward to some guidance from your side.

water damage cleanup

said:You really make it seem so easy with your presentation but I find this topic

to be actually something which I think I would never understand.

It seems too complex and very broad for me. I’m looking forward for your

next post, I will try to get the hang of it!

NBA Trends

said:Hey thanks for this tutorial. I am new to kaggle and i wanted to get a feel for how people use it. One thing I would like to note is that you can replace the for loop for the glm predictions in one line:

survival .5)

which would give you 1s and 0s automaticall

NBA Trends

said:as.numeric(p.hats>.5)

statsguys

said:Thanks for the pointer!

sumit

said:I wanted to know that here survival was a probability function and when predicted it gave values betwwen 0-1,on which threshold was applied,but if those values of probability came>1 or<0 ,does it suggest that I have taken wrong factors for the prediction

?

statsguys

said:Well, there was no reason for the result to be less than 0 or greater than 1, since all of our y’s were either 1 or 0. By having 1.1 for example, you can be wrong by either 0.1 (if it was 1), or 1.1 (if the answer was 0). Therefore guessing between 0-1 would yield less error.

AC

said:I want to know how does the general regression model works and what do the coefficients mean actually. Any reading you can point to for the same?

PS: Loved the tutorial. 🙂

David

said:Thank you so much for the tutorial. I would have never known how to parse out the Miss, Mrs, etc titles and never would have thought of it. Did you deliberately use Pclass, Mother, and Child as a numeric values rather than factors? If so, what was the thought process?

Pingback: First Kaggle submission : Predict survival on the Titanic - Analytics Khoj