Skip to content

Exploratory Data Analysis (EDA) on Titanic

Goal

Perform EDA on the Titanic dataset and produce:

  • Data quality findings (missing values, types)
  • A handful of clear plots
  • Insights about survival patterns

Dataset

Common sources:

  • Kaggle: Titanic - Machine Learning from Disaster

Typical columns:

  • SurvivedSurvived, PclassPclass, SexSex, AgeAge, SibSpSibSp, ParchParch, FareFare, EmbarkedEmbarked

Step 1: Load data

Load Titanic CSV
import pandas as pd
 
df = pd.read_csv("data/titanic.csv")
print(df.shape)
print(df.head())
Load Titanic CSV
import pandas as pd
 
df = pd.read_csv("data/titanic.csv")
print(df.shape)
print(df.head())

Step 2: Schema and missingness

Info + missing
print(df.info())
 
missing = (df.isna().mean() * 100).sort_values(ascending=False)
print(missing)
Info + missing
print(df.info())
 
missing = (df.isna().mean() * 100).sort_values(ascending=False)
print(missing)

Focus on missing in:

  • AgeAge
  • CabinCabin
  • EmbarkedEmbarked

Step 3: Clean minimal issues

Handle missing Embarked (small)

Embarked fill
if "Embarked" in df.columns:
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode().iloc[0])
Embarked fill
if "Embarked" in df.columns:
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode().iloc[0])

Keep Cabin as “has cabin” flag

Cabin flag
if "Cabin" in df.columns:
    df["has_cabin"] = df["Cabin"].notna()
Cabin flag
if "Cabin" in df.columns:
    df["has_cabin"] = df["Cabin"].notna()

Step 4: Univariate plots

Survival distribution

Survival count
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Survived")
plt.title("Survival counts")
plt.tight_layout()
plt.show()
Survival count
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Survived")
plt.title("Survival counts")
plt.tight_layout()
plt.show()

Age distribution

Age distribution
import seaborn as sns
import matplotlib.pyplot as plt
 
if "Age" in df.columns:
    plt.figure(figsize=(7, 4))
    sns.histplot(df["Age"].dropna(), bins=30, kde=True)
    plt.title("Age distribution")
    plt.tight_layout()
    plt.show()
Age distribution
import seaborn as sns
import matplotlib.pyplot as plt
 
if "Age" in df.columns:
    plt.figure(figsize=(7, 4))
    sns.histplot(df["Age"].dropna(), bins=30, kde=True)
    plt.title("Age distribution")
    plt.tight_layout()
    plt.show()

Step 5: Survival by category

Survival by sex

Survival by sex
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(7, 4))
sns.barplot(data=df, x="Sex", y="Survived")
plt.title("Survival rate by sex")
plt.tight_layout()
plt.show()
Survival by sex
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(7, 4))
sns.barplot(data=df, x="Sex", y="Survived")
plt.title("Survival rate by sex")
plt.tight_layout()
plt.show()

Survival by passenger class

Survival by class
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(7, 4))
sns.barplot(data=df, x="Pclass", y="Survived")
plt.title("Survival rate by class")
plt.tight_layout()
plt.show()
Survival by class
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(7, 4))
sns.barplot(data=df, x="Pclass", y="Survived")
plt.title("Survival rate by class")
plt.tight_layout()
plt.show()

Step 6: Numeric relationships

Fare vs survival (boxplot)

Fare vs survival
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(7, 4))
sns.boxplot(data=df, x="Survived", y="Fare")
plt.title("Fare vs survival")
plt.tight_layout()
plt.show()
Fare vs survival
import seaborn as sns
import matplotlib.pyplot as plt
 
plt.figure(figsize=(7, 4))
sns.boxplot(data=df, x="Survived", y="Fare")
plt.title("Fare vs survival")
plt.tight_layout()
plt.show()

Step 7: Write insights (example)

Write 5–10 bullet insights such as:

  • Survival rate is higher for females.
  • Higher class passengers survived more.
  • Passengers who paid higher fare tended to survive more.
  • Missingness is high in Cabin; treat as a feature (“has_cabin”).

Deliverable

Save a cleaned dataset version:

Save output
df.to_csv("output/titanic_cleaned.csv", index=False)
Save output
df.to_csv("output/titanic_cleaned.csv", index=False)

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did