Skip to content

Data Type Conversion and Validation

Why dtype conversion matters

Many datasets arrive with wrong dtypes:

  • numbers stored as strings
  • dates stored as strings
  • categories stored inconsistently

If dtypes are wrong, your stats and charts can be wrong.

Convert to numeric safely

to_numeric
import pandas as pd
 
df = pd.DataFrame({"amount": ["1,200", "500", "oops", " 700 "]})
 
df["amount"] = df["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
 
print(df)
print(df.isna().sum())
to_numeric
import pandas as pd
 
df = pd.DataFrame({"amount": ["1,200", "500", "oops", " 700 "]})
 
df["amount"] = df["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
 
print(df)
print(df.isna().sum())

Convert to datetime

to_datetime
import pandas as pd
 
df = pd.DataFrame({"date": ["2025-01-01", "2025/01/02", "invalid"]})
 
df["date"] = pd.to_datetime(df["date"], errors="coerce")
print(df)
to_datetime
import pandas as pd
 
df = pd.DataFrame({"date": ["2025-01-01", "2025/01/02", "invalid"]})
 
df["date"] = pd.to_datetime(df["date"], errors="coerce")
print(df)

Categories

category dtype
import pandas as pd
 
df = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"]})
 
df["city"] = df["city"].astype("category")
print(df.dtypes)
category dtype
import pandas as pd
 
df = pd.DataFrame({"city": ["Pune", "Delhi", "Pune"]})
 
df["city"] = df["city"].astype("category")
print(df.dtypes)

Validate assumptions

Typical checks:

  • ID columns contain no duplicates
  • numeric columns are non-negative
  • date columns are within expected range
Validation examples
# no duplicate ids
# assert df["id"].is_unique
 
# non-negative values
# assert (df["amount"] >= 0).all()
 
# date range
# assert df["date"].min() >= pd.Timestamp("2020-01-01")
Validation examples
# no duplicate ids
# assert df["id"].is_unique
 
# non-negative values
# assert (df["amount"] >= 0).all()
 
# date range
# assert df["date"].min() >= pd.Timestamp("2020-01-01")

Tip

Convert + validate early. It prevents subtle bugs later.

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did