Skip to content

Credit Card Fraud Detection

Goal

Fraud datasets are usually highly imbalanced.

You will:

  • Explore class imbalance
  • Check feature distributions
  • Build a baseline model evaluation plan

Step 1: Load

Load
import pandas as pd
 
df = pd.read_csv("data/fraud.csv")
print(df.shape)
print(df.head())
Load
import pandas as pd
 
df = pd.read_csv("data/fraud.csv")
print(df.shape)
print(df.head())

Step 2: Class imbalance

Imbalance
print(df["fraud"].value_counts())
print(df["fraud"].value_counts(normalize=True))
Imbalance
print(df["fraud"].value_counts())
print(df["fraud"].value_counts(normalize=True))

Step 3: Visualize distributions

Feature distribution
import seaborn as sns
import matplotlib.pyplot as plt
 
# Example: compare one feature by class
feature = "amount"
 
plt.figure(figsize=(7, 4))
sns.kdeplot(data=df, x=feature, hue="fraud", common_norm=False)
plt.title(f"{feature} distribution by class")
plt.tight_layout()
plt.show()
Feature distribution
import seaborn as sns
import matplotlib.pyplot as plt
 
# Example: compare one feature by class
feature = "amount"
 
plt.figure(figsize=(7, 4))
sns.kdeplot(data=df, x=feature, hue="fraud", common_norm=False)
plt.title(f"{feature} distribution by class")
plt.tight_layout()
plt.show()

Step 4: Baseline modeling note

For fraud, accuracy is misleading.

Prefer:

  • Precision / Recall
  • F1 score
  • ROC-AUC
  • PR-AUC

Deliverable

  • How imbalanced is the dataset?
  • Which features differ between classes?
  • What metric will you optimize?

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did