Mar 9, 2017

Disaster Relief - Titanic Survival

Posted in Projects

Disaster Relief

Titanic Survival

Scenario

As a data scientist with a research firm that specializes in emergency management. In advance of client work, you’ve been asked to create and train a logistic regression model that can show off the firm’s capabilities in disaster analysis.

Frequently after a disaster, researchers and firms will come in to give an independent review of an incident. While your firm doesn’t have any current client data that it can share with you so that you may test and deploy your model, it does have data from the 1912 titanic disaster that it has stored in a remote database.

The goal of this study is to train a model to predict survival of future disasters and factors involved with likelihood of survival.

Risks & Assumptions:

VARIABLE DESCRIPTIONS:

SURVIVAL

Survival
(0 = No; 1 = Yes)

PCLASS

Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)

name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin embarked: Port of Embarkation

    (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:

1. Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

2. Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5

3.  With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.

The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic

Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)

Parent: Mother or Father of Passenger Aboard Titanic

Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Using Postgre SQL

Acquiring the titanic data set required connecting to a database in AWS (Amazon Web Services) and loading the data locally to perform analysis and predict titanic surival based on various features.

Data Cleaning/ Munging

Since the dataset is a famous one, used throughout the education world and beyond, the data was already clean and there was no steps necessary to clean up the data for analysis.

Data Analysis / EDA

In analyzing the titanic data set, SQLAlchemy, a Python SQL toolkit, was used to implement SQL queries within Jupyter notebook and the Python framework. Initally, a simple SQL query was used to check features of the data and get a glimpse of the data that will be modeled. By using aggregation fuctions within SQL, the max fare was found to be $512.33. The average fare price came out to $32.20. The average age of passengers on the titanic was around 30. Using SQL, the number of passengers in each class (from classes 1 to 3) were also found. Class 3 had 491 passengers, class 2 had 184, and class 1 had 216 passengers.

The histogram of the age distribution shows that the majority of passengers are in their 20s and 30s and there is a declining trend of lower count of passengers as age increases from ages of 20 and older.

The scatter plot shows the relationship of fare and age. The two most expensive fares of around $500 were bought by two people in their high 30’s. The vast majority of titanic fare prices were int he below 100 range. There was a group of people in their 20’s buying tickets between $200 to $300.

Data Mining/ Wrangling

From there the data was converted to a dataframe. After looking at the data, it was determined that the column for cabin number was irrelevant in determing whether a passenger would survive a disaster because of the random allotment of cabin numbers to passengers. Therefore, the cabin column was dropped. Dummy variables were created for sex of the passenger. One for male and one for female. For the remaining variables/ columns (embarked, passenger class, number of siblings/ spouses, and number of children/parents), Patsy was used to create dummy variables.

Logistic Regression and Model Validation

Since some variables had continuous values and others were discrete values, each column was treated differently. Continous variables were left alone while discrete variables were dummified. These were columns were then concatenated to form the X variable for modeling purposes. The “Survived” column was set as the y with 1 being “survived” and 0 being “not survived”. Next a train, test split was done to randomize the training data and testing data. From there, in order to normalize the continous variables and transform the data through scaling, a standard scaler was used to scale the data based on the mean.

Next the logistic regression model was created using sci-kit learn. After running the logistic regression, the model was fitted and used to predict survival. By examining the coefficient values of each feature, the impact each feature has on the model could be determined. In this particular model, the male sex, having 3 siblings or spouses, and passenger class 3 were found to have the highest impact on survival aboard the titanic. A negative coefficient indicated a higher chance of dying while a positive coefficient relayed a higher chance of survival.

Classificaiton Report and Benchmarks

The model was then scored to test the accuracy. The accuracy score came out to .78 and the mean of the cross validation score came out to .77. Next a classficaiton report gave us the precision (TP/(TP + FP), recall (TP/(TP + FN), and f1 scores (mean of precision and recall). The average precision score, recall score, and f1 score all came out to .78.

Confusion Matrix

The confusion matrix shows us the performance of the classfication model showing actual vs predictions and how well the model did. It gave a true positive value of 69, true negative value of 115, false positive value of 24 and false negative value of 27.

ROC Curve

The ROC curve is the plot between the True Positive rate vs the False Positive rate for survived/not survived at different points. The greater the area of the ROC curve the more accurate the test is. The ROC shows the tradeoff between sensitivity(TP - Benefits) and specificity(FP - Costs). This means that the higher you go on the y-axis, the probability of getting survival predictions increases, while the probability of getting a false prediction of survival increases and vice versa. In a perfect world, the true posistive rate would equal 1 and the false positive would equal 0. The area under the ROC curve had a value of .86. The area of .86 for the ROC Curve is a high indicator that the model is fairly accurate.

Gridsearch

In order to search for the optimal parameters of the logistic regression, a gridsearch was used to determine which penalty to use, L1 or L2 and which C value, between -5 and 50, outputs the best score for the model. After using gridsearch, it was found that the L1 penalty and a C of 5.69 were the best parameters to reach a score of .80. Compared to the vanilla logistic regression (.78), the gridsearchCV is 2% better (.80). There was basically no significant difference though in the precision, recall, and F1 scores from the vanilla logistic regression scores. Gridsearching in this instance didn’t have a major impact in improving the model score and the vanilla logistic regression was just as good at determining titanic survival accurately.

Gridsearch & KNN

Next a gridsearch was performed, but instead of using logistic regression as was done before, KNeighborsClassifiers were used instead. The best parameters for the model were 21 neighbors with uniform weights. The best score for the gridsearch KNN was .79. The classification report shows an average precision score of .78, an average recall score of .78, and an F1 score of .77. The confusion matrix showed a true positive of 62, true negative of 121, a false negative of 34, and a false positive of 18. The KNN, another classification method, was used to see if classifying this way would provide better scores and results.

The chart shows an ROC AUC score of .85. The ROC AUC score is very close to the score from the gridsearch KNN model. The plots show the two ROC AUC scores and the area they cover. It is interesting to note that the logistic gridsearch is more jagged compared to the more smooth plot of the KNN plot.

Precision- Recall

The “average precision” method of scoring optimizes parameters for the area under the precision- recall curve and not accuracy. By gridsearching using this scoring method, the best score comes to .82 with a C of 1.05 and a L1 penalty.

This chart shows the relationship between the precision and recall as recall score increases, precision score will decrease.

Conclusion:

After running various classfication models such as the logistic regression, gridsearch logistic regression, gridsearch KNN, these different models didn’t seem to have much variance in accuracy scores. During the exploratory data analysis process, SQL and SQL Alchemy were the tools used to analyze the data and manipulate it. By looking at coefficient importance, it can be determined in the logistic regression model that the male sex, passengers with 3 siblings our spouses, and passengers in class 3 had the biggest negative impact on survival on the titanic, meaning passengers with these qualities had the highest chance of not surviving. Fare and passengers with one parent/ child or one sibling/ spouse, were more likely to survive. Classification reports and confusion matrices were run to find the scores of the models and to see the total true positives, true negatives, false positives, false negatives. The ROC AUC scores provided insight into the relationship between the true positive and true negative rates. Lastly a Precision- Recall curve was also implemented to check relationship between the precision scores and recall scores of the models. Ultimately, to some extent, through the various classification models and different ways of improving the scores of these models whether it be accuracy, precision, recall, f1, and etc, some insight was gained on a passengers survival probability and which factors were important in the survival decision.

Link to Jupyter Notebook

Data Scientist