Skip to content

Merging and Joining Data (merge, join, concat)

Why merges matter

Most real analytics uses multiple tables:

  • Customers table
  • Orders table
  • Products table

You typically need to combine them to answer questions.

Example tables

Customers and orders
import pandas as pd
 
customers = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "name": ["Asha", "Ravi", "Meera"],
})
 
orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104],
    "customer_id": [1, 2, 2, 4],
    "amount": [250, 180, 90, 300],
})
 
print(customers)
print(orders)
Customers and orders
import pandas as pd
 
customers = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "name": ["Asha", "Ravi", "Meera"],
})
 
orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104],
    "customer_id": [1, 2, 2, 4],
    "amount": [250, 180, 90, 300],
})
 
print(customers)
print(orders)

mergemerge: SQL-style joins

Inner join (only matching keys)

Inner join
inner = customers.merge(orders, on="customer_id", how="inner")
print(inner)
Inner join
inner = customers.merge(orders, on="customer_id", how="inner")
print(inner)

Left join (keep all left rows)

Left join
left = customers.merge(orders, on="customer_id", how="left")
print(left)
Left join
left = customers.merge(orders, on="customer_id", how="left")
print(left)

Outer join (keep everything)

Outer join
outer = customers.merge(orders, on="customer_id", how="outer", indicator=True)
print(outer)
Outer join
outer = customers.merge(orders, on="customer_id", how="outer", indicator=True)
print(outer)

indicator=Trueindicator=True adds a _merge_merge column so you can see where each row came from.

joinjoin: align by index

joinjoin is convenient when keys are indices.

join by index
customers_idx = customers.set_index("customer_id")
orders_idx = orders.set_index("customer_id")
 
joined = customers_idx.join(orders_idx, how="left")
print(joined)
join by index
customers_idx = customers.set_index("customer_id")
orders_idx = orders.set_index("customer_id")
 
joined = customers_idx.join(orders_idx, how="left")
print(joined)

concatconcat: stack DataFrames

Append rows (same columns)

concat rows
jan = pd.DataFrame({"customer_id": [1, 2], "amount": [100, 200]})
feb = pd.DataFrame({"customer_id": [2, 3], "amount": [150, 50]})
 
all_orders = pd.concat([jan, feb], ignore_index=True)
print(all_orders)
concat rows
jan = pd.DataFrame({"customer_id": [1, 2], "amount": [100, 200]})
feb = pd.DataFrame({"customer_id": [2, 3], "amount": [150, 50]})
 
all_orders = pd.concat([jan, feb], ignore_index=True)
print(all_orders)

Add columns side-by-side

concat columns
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [10, 20]})
 
combined = pd.concat([df1, df2], axis=1)
print(combined)
concat columns
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [10, 20]})
 
combined = pd.concat([df1, df2], axis=1)
print(combined)

Merge tips

  • Use validate=validate= to enforce expected relationships (e.g., one-to-many).
  • Watch for duplicate column names; use suffixes=("_left", "_right")suffixes=("_left", "_right") when needed.
  • After a left join, missing matches become NaNNaNβ€”handle them deliberately.

πŸ§ͺ Try It Yourself

Exercise 1 – Create a DataFrame

Exercise 2 – Select a Column

Exercise 3 – Filter Rows

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did