Recap

You have now cleaned the data by doing the following:
1. Converted categorical variables to dummy variables
2. Added missing age values
3. Created new variables to better fit a model
You are now ready to build a model which will make predictions!

Training a Model

We first feed the training data into a model, and the model will optimize itself to give you the best explanation for your variables and outcome. The idea is that we build a model for predicting survival using the Train dataset. Then we input the observations from the Test dataset to predict their survival.

Fitting logistic regression model

R will take care of solving/optimizing the model. We don’t have to worry about any complicated Math! A logistic regression model is a generalized linear model which is used when your trying to predict something that is binary. Since whether a passenger survived or not is binary, we use logistic regression. The parameters we choose to predict survival are Passenger Class, Sex, Age, Child, an interaction variable of Sex AND Passenger Class, Family, and Mother.

train.glm <- glm(Survived ~ Pclass + Sex + Age + Child + Sex*Pclass + Family + Mother, family = binomial, data = trainData)

To see a summary of the model, and specifically the coefficients that are calculated to predict survival you can type:

summary(train.glm)

<h3>Fitting a Model</h3>
Now that the Test dataset is ready, we use an R function which calculates predictions for the survival of the passengers in the Test dataset. The predictions for each observation come in the form of a probability score for the response being 0 or 1. Therefore we must apply a cutoff value to determine which probability scores translate to a 1 and which translate to a 0. For simplicity it is generally most effective to choose a cutoff of .5 to minimize errors.

What is done here is R takes the coefficients calculated in the train.glm model and uses the variables Passenger Class, Sex, Age, Child, an interaction variable of Sex AND Passenger Class, Family, and Mother in the Test dataset to calculate survival predictions for the Test dataset observations.

p.hats <- predict.glm(train.glm, newdata = testData, type = "response")

survival <- vector()
for(i in 1:length(p.hats)) {
  if(p.hats[i] > .5) {
    survival[i] <- 1
  } else {
    survival[i] <- 0
  }
}

Creating a CSV to Submit to Kaggle

We now output the data into a csv file, which can be submitted on Kaggle for grading here

kaggle.sub <- cbind(PassengerId,survival)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "kaggle.csv", row.names = FALSE)

A file titled kaggle should now be in the same folder which you saved the original Test and Train datasets. Use this file to make a submission on the Kaggle website and see where you rank!

Note: Make sure the CSV you submit has only two columns: one labeled as “PassengerID” and another labeled as “Survived”.

If you liked our tutorial keep in touch at our landing page

Email as at: statsguys@gmail.com
Github Repo for entire RCode: https://github.com/tristantao/kaggle_survivor