Merging and Joining Data (merge, join, concat)
Why merges matter
Most real analytics uses multiple tables:
- Customers table
- Orders table
- Products table
You typically need to combine them to answer questions.
Example tables
Customers and orders
import pandas as pd
customers = pd.DataFrame({
"customer_id": [1, 2, 3],
"name": ["Asha", "Ravi", "Meera"],
})
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104],
"customer_id": [1, 2, 2, 4],
"amount": [250, 180, 90, 300],
})
print(customers)
print(orders)Customers and orders
import pandas as pd
customers = pd.DataFrame({
"customer_id": [1, 2, 3],
"name": ["Asha", "Ravi", "Meera"],
})
orders = pd.DataFrame({
"order_id": [101, 102, 103, 104],
"customer_id": [1, 2, 2, 4],
"amount": [250, 180, 90, 300],
})
print(customers)
print(orders)mergemerge: SQL-style joins
Inner join (only matching keys)
Inner join
inner = customers.merge(orders, on="customer_id", how="inner")
print(inner)Inner join
inner = customers.merge(orders, on="customer_id", how="inner")
print(inner)Left join (keep all left rows)
Left join
left = customers.merge(orders, on="customer_id", how="left")
print(left)Left join
left = customers.merge(orders, on="customer_id", how="left")
print(left)Outer join (keep everything)
Outer join
outer = customers.merge(orders, on="customer_id", how="outer", indicator=True)
print(outer)Outer join
outer = customers.merge(orders, on="customer_id", how="outer", indicator=True)
print(outer)indicator=Trueindicator=True adds a _merge_merge column so you can see where each row came from.
joinjoin: align by index
joinjoin is convenient when keys are indices.
join by index
customers_idx = customers.set_index("customer_id")
orders_idx = orders.set_index("customer_id")
joined = customers_idx.join(orders_idx, how="left")
print(joined)join by index
customers_idx = customers.set_index("customer_id")
orders_idx = orders.set_index("customer_id")
joined = customers_idx.join(orders_idx, how="left")
print(joined)concatconcat: stack DataFrames
Append rows (same columns)
concat rows
jan = pd.DataFrame({"customer_id": [1, 2], "amount": [100, 200]})
feb = pd.DataFrame({"customer_id": [2, 3], "amount": [150, 50]})
all_orders = pd.concat([jan, feb], ignore_index=True)
print(all_orders)concat rows
jan = pd.DataFrame({"customer_id": [1, 2], "amount": [100, 200]})
feb = pd.DataFrame({"customer_id": [2, 3], "amount": [150, 50]})
all_orders = pd.concat([jan, feb], ignore_index=True)
print(all_orders)Add columns side-by-side
concat columns
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [10, 20]})
combined = pd.concat([df1, df2], axis=1)
print(combined)concat columns
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [10, 20]})
combined = pd.concat([df1, df2], axis=1)
print(combined)Merge tips
- Use
validate=validate=to enforce expected relationships (e.g., one-to-many). - Watch for duplicate column names; use
suffixes=("_left", "_right")suffixes=("_left", "_right")when needed. - After a left join, missing matches become
NaNNaNβhandle them deliberately.
π§ͺ Try It Yourself
Exercise 1 β Create a DataFrame
Exercise 2 β Select a Column
Exercise 3 β Filter Rows
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
