Skip to content

Understanding Data Quality

Why data quality matters

Bad data quality leads to:

  • Wrong decisions
  • Broken dashboards
  • Unreliable ML models
  • Loss of trust

Data preprocessing is the discipline of making data usable and trustworthy.

Key dimensions of data quality

1) Completeness

Are required values missing?

  • df.isna().sum()df.isna().sum()
  • Missing percentage per column

2) Accuracy

Do values match reality?

  • Negative ages
  • Impossible dates (future DOB)
  • Wrong units (₹ vs paise)

3) Consistency

Do values follow the same format across rows?

  • “delhi”, “Delhi”, ” DELHI ”
  • Mixed currencies

4) Validity

Do values fit the allowed set/range?

  • statusstatus should be one of: closed
  • ratingrating must be 1–5

5) Uniqueness

Are there duplicates?

  • Duplicate rows
  • Duplicate IDs

Practical quality checks (Pandas)

Quick quality checks
# shape
print(df.shape)
 
# schema
print(df.dtypes)
 
# missing
print(df.isna().sum().sort_values(ascending=False).head(20))
 
# duplicates
print("Duplicate rows:", df.duplicated().sum())
 
# basic stats
print(df.describe(include="all"))
Quick quality checks
# shape
print(df.shape)
 
# schema
print(df.dtypes)
 
# missing
print(df.isna().sum().sort_values(ascending=False).head(20))
 
# duplicates
print("Duplicate rows:", df.duplicated().sum())
 
# basic stats
print(df.describe(include="all"))

Typical preprocessing outputs

  • Cleaned columns (trimmed text, fixed casing)
  • Correct dtypes (numeric/date)
  • Handled missing values
  • Outlier strategy selected (remove/cap/keep)
  • Encoded categories for modeling

Rule of thumb

Always document:

  • What you changed
  • Why you changed it
  • What assumptions you made

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did