Understanding Data Quality
Why data quality matters
Bad data quality leads to:
- Wrong decisions
- Broken dashboards
- Unreliable ML models
- Loss of trust
Data preprocessing is the discipline of making data usable and trustworthy.
Key dimensions of data quality
1) Completeness
Are required values missing?
df.isna().sum()df.isna().sum()- Missing percentage per column
2) Accuracy
Do values match reality?
- Negative ages
- Impossible dates (future DOB)
- Wrong units (₹ vs paise)
3) Consistency
Do values follow the same format across rows?
- “delhi”, “Delhi”, ” DELHI ”
- Mixed currencies
4) Validity
Do values fit the allowed set/range?
statusstatusshould be one of: closedratingratingmust be 1–5
5) Uniqueness
Are there duplicates?
- Duplicate rows
- Duplicate IDs
Practical quality checks (Pandas)
Quick quality checks
# shape
print(df.shape)
# schema
print(df.dtypes)
# missing
print(df.isna().sum().sort_values(ascending=False).head(20))
# duplicates
print("Duplicate rows:", df.duplicated().sum())
# basic stats
print(df.describe(include="all"))Quick quality checks
# shape
print(df.shape)
# schema
print(df.dtypes)
# missing
print(df.isna().sum().sort_values(ascending=False).head(20))
# duplicates
print("Duplicate rows:", df.duplicated().sum())
# basic stats
print(df.describe(include="all"))Typical preprocessing outputs
- Cleaned columns (trimmed text, fixed casing)
- Correct dtypes (numeric/date)
- Handled missing values
- Outlier strategy selected (remove/cap/keep)
- Encoded categories for modeling
Rule of thumb
Always document:
- What you changed
- Why you changed it
- What assumptions you made
If this helped you, consider buying me a coffee ☕
Buy me a coffeeWas this page helpful?
Let us know how we did
