Walid Birouk

The Titanic Dataset (EDA & ML)

Pandas & Numpy & Matplot

Preparing the Titanic Dataset for Machine Learning

Github: Repository.

Introduction

In this comprehensive guide, we'll delve into the meticulous process of preparing the Titanic dataset for machine learning analysis using a Random Forest model. The Titanic dataset, a compilation of data on passengers aboard the ill-fated RMS Titanic, has become a classic dataset for data science projects, particularly for those new to the field. Our goal was to clean and encode the dataset, ensuring it's ready for predictive modeling. Here's a step-by-step walkthrough of what we did and why.

Ⅰ. Data Cleaning and Preparation

Reading and Renaming Columns

We began by importing the necessary Python library, pandas, and loading the dataset from a CSV file named titanic3.csv. The first step in cleaning involved renaming the columns to ensure consistency and ease of access. Spaces were replaced with underscores, and all column names were converted to lowercase. This standardization helps avoid errors and confusion in later analysis.

Handling Missing Values

The age column had missing values, which were imputed with the median age of the dataset. Using the median is a common practice for dealing with missing numeric data because it's robust to outliers. We then rounded the ages to integers for uniformity and easier categorization.

Fares equal to zero were temporarily replaced with NaN to accurately impute missing fare values later. For passengers with null fares, we imputed values based on the median fare of similar passengers, considering their class, embarkation point, and age. This approach provides a more accurate estimation of fare values than using a broader measure like the overall median fare.

Dropping Columns

Columns with a high percentage of missing values or those unlikely to influence the outcome (survival) were dropped. The cabin column was removed due to its sparsity and questionable impact on survival predictions.

Feature Engineering

We made several adjustments to the dataset to better capture the nuances of the data:

The boat column, indicating lifeboat assignment, was simplified to retain only the first part of any entry. This simplification helped reduce the feature's complexity without losing significant information. We then applied one-hot encoding to transform this categorical variable into a format suitable for machine learning algorithms.
The body column, indicating body identification numbers for those who didn't survive, was converted into a binary feature. This transformation simplifies the model's task by focusing on the presence or absence of a body ID rather than the specific ID number.
A new feature, fare_per_person, was created by adjusting the fare based on family size. This adjustment provides a more accurate representation of the fare's impact on survival by considering the economic status on a per-person basis.
The dataset was further enriched by categorizing ages into meaningful groups and applying one-hot encoding. This process allows the model to recognize patterns across different age groups more effectively.

Final Adjustments and Encoding

We applied one-hot encoding to the sex column, transforming it into binary sex_male and sex_female columns. This step is crucial for models that require numeric input. Finally, unnecessary columns were dropped, and the cleaned dataset was saved to a new CSV file, titanicEncodedv2.csv, ready for machine learning analysis.

Ⅱ. Conclusion

The preparation of the Titanic dataset involved careful consideration of each feature's relevance and representation. Through cleaning, imputing missing values, feature engineering, and encoding, we've transformed the raw data into a form that's not only cleaner but also more conducive to uncovering insights with machine learning. The next step involves building and training a Random Forest model to predict survival on the Titanic, with the dataset now fully prepped for this task. Our methodology ensures that the data fed into the model is of high quality, which is crucial for developing accurate and reliable predictive models.