Merging two csv files

Questions

What is the best way to merge two csv files?

Objectives

Use Pandas to merge two csv files?

Merging two files with tabular data in each is one of the most common data processing tasks that occurs that at the start of a data science project. Below, we merge two files:

One with geochemistry data, which shows at different depths the concentration of specific minerals
One with lithology data, which provides annotation around which lithology group is observed at different depths at specific sites

# load in pandas 
import pandas as pd

#Load in the two files as dataframes
dfgeo = pd.read_csv("../userdata/Geochemistry.csv")
dflit = pd.read_csv("../userdata/Lithology.csv")

dfgeo

	CollarId	FromDepth	ToDepth	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM
0	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5
1	224202	4.5	4.6	3	93000	-9999	-9999	-9999	3.5
2	224202	4.6	5.0	2	94800	-9999	-9999	-9999	4.5
3	224202	5.0	6.0	3	93000	-9999	-9999	-9999	3.5
4	224202	6.0	7.0	2	94800	-9999	-9999	-9999	4.5
...	...	...	...	...	...	...	...	...	...
72	975622	37.0	38.0	2	111000	-9999	-9999	-9999	13.5
73	975622	38.0	39.0	2	111000	-9999	-9999	-9999	13.5
74	975622	39.0	40.0	2	111000	-9999	-9999	-9999	13.5
75	975622	40.0	41.0	2	111000	-9999	-9999	-9999	13.5
76	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5

77 rows × 9 columns

dflit

	CollarId	FromDepth	ToDepth	lithology_group
0	224202	3.0	3.1	sediment
1	224202	3.1	4.0	sediment
2	224202	4.0	10.0	clay
3	224202	10.0	15.0	basalt
4	224202	15.0	20.0	granite
5	224202	20.0	22.0	pegmatite
6	224202	22.0	35.0	basalt
7	224202	35.0	37.0	pegmatite
8	975622	0.0	5.0	sediment
9	975622	5.0	10.0	clay
10	975622	10.0	15.0	diorite
11	975622	15.0	30.0	basalt
12	975622	30.0	40.0	pegmatite
13	975622	40.0	41.0	quartz
14	975622	41.0	42.0	granite

The CollarId is the column that is common between the two datasets, so can be used as a so-called “join key”.

The FromDepth and ToDepth are also common, but if we merge using the default settings, only columns where both values perfectly match will be joined - this isn’t quite what we want:

dfbadmerge = pd.merge(dfgeo,dflit)
dfbadmerge

	CollarId	FromDepth	ToDepth	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	lithology_group
0	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	sediment
1	975622	40.0	41.0	2	111000	-9999	-9999	-9999	13.5	quartz
2	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	granite

The next step merges the two data frames, joining all columns based only on the CollarID. The _x and _y denote the values that were in the original columns FromDepth and ToDepth in dfgeo and dflit, respectively.

#Merge them based on CollarId
dfmerge=pd.merge(dfgeo,dflit,how="outer",on=["CollarId"])
dfmerge

	CollarId	FromDepth_x	ToDepth_x	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	FromDepth_y	ToDepth_y	lithology_group
0	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	3.0	3.1	sediment
1	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	3.1	4.0	sediment
2	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	4.0	10.0	clay
3	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	10.0	15.0	basalt
4	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	15.0	20.0	granite
...	...	...	...	...	...	...	...	...	...	...	...	...
569	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	10.0	15.0	diorite
570	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	15.0	30.0	basalt
571	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite
572	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	40.0	41.0	quartz
573	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	41.0	42.0	granite

574 rows × 12 columns

Now drop any rows outside the From/To depth of the Lithology:

df = dfmerge[(dfmerge.FromDepth_x >= dfmerge.FromDepth_y) & (dfmerge.ToDepth_x <= dfmerge.ToDepth_y)].reset_index(drop=True)
df

	CollarId	FromDepth_x	ToDepth_x	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	FromDepth_y	ToDepth_y	lithology_group
0	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	3.0	3.1	sediment
1	224202	4.5	4.6	3	93000	-9999	-9999	-9999	3.5	4.0	10.0	clay
2	224202	4.6	5.0	2	94800	-9999	-9999	-9999	4.5	4.0	10.0	clay
3	224202	5.0	6.0	3	93000	-9999	-9999	-9999	3.5	4.0	10.0	clay
4	224202	6.0	7.0	2	94800	-9999	-9999	-9999	4.5	4.0	10.0	clay
...	...	...	...	...	...	...	...	...	...	...	...	...
71	975622	37.0	38.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite
72	975622	38.0	39.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite
73	975622	39.0	40.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite
74	975622	40.0	41.0	2	111000	-9999	-9999	-9999	13.5	40.0	41.0	quartz
75	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	41.0	42.0	granite

76 rows × 12 columns

I would expect that to equal 77 (the original number of Geochemistry points)

Why is one missing? (TODO show how you found out which one was missing!)

#After some digging the missing value is line 50 from the dfgeo dataframe
dfgeo.loc[49:51]

	CollarId	FromDepth	ToDepth	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM
49	975622	14.2	14.8	2	111000	-9999	-9999	-9999	13.5
50	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5
51	975622	16.0	17.0	2	111000	-9999	-9999	-9999	13.5

#Compared with the final dataframe
df.loc[49:51]

	CollarId	FromDepth_x	ToDepth_x	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	FromDepth_y	ToDepth_y	lithology_group
49	975622	14.2	14.8	2	111000	-9999	-9999	-9999	13.5	10.0	15.0	diorite
50	975622	16.0	17.0	2	111000	-9999	-9999	-9999	13.5	15.0	30.0	basalt
51	975622	17.0	18.0	2	111000	-9999	-9999	-9999	13.5	15.0	30.0	basalt

#Finding that row in the complete merged dataframe
dfmerge.loc[(dfmerge['FromDepth_x'] == 14.8) & (dfmerge['ToDepth_x'] == 16.0)]

	CollarId	FromDepth_x	ToDepth_x	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	FromDepth_y	ToDepth_y	lithology_group
385	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	0.0	5.0	sediment
386	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	5.0	10.0	clay
387	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	10.0	15.0	diorite
388	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	15.0	30.0	basalt
389	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite
390	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	40.0	41.0	quartz
391	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	41.0	42.0	granite

We see that the ACTUAL lithology group should be diorite AND basalt. So based on our criteria it actually chose neither.

It will take a bit more hacking to deal with an edge cases like this… but I will leave that up to you.

Next, we do TODO FIXME.

dfmerge['depthInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_x, dfmerge.ToDepth_x,closed='left')
dfmerge['lithInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_y, dfmerge.ToDepth_y,closed='left')

And then we do TODO FIXME.

df = dfmerge[dfmerge.apply(lambda row: row['depthInterval'].overlaps(row['lithInterval']),axis=1)].reset_index(drop=True)
df

	CollarId	FromDepth_x	ToDepth_x	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	FromDepth_y	ToDepth_y	lithology_group	depthInterval	lithInterval
0	224202	3.0	3.1	2	94800	-9999	-9999	-9999	4.5	3.0	3.1	sediment	[3.0, 3.1)	[3.0, 3.1)
1	224202	4.5	4.6	3	93000	-9999	-9999	-9999	3.5	4.0	10.0	clay	[4.5, 4.6)	[4.0, 10.0)
2	224202	4.6	5.0	2	94800	-9999	-9999	-9999	4.5	4.0	10.0	clay	[4.6, 5.0)	[4.0, 10.0)
3	224202	5.0	6.0	3	93000	-9999	-9999	-9999	3.5	4.0	10.0	clay	[5.0, 6.0)	[4.0, 10.0)
4	224202	6.0	7.0	2	94800	-9999	-9999	-9999	4.5	4.0	10.0	clay	[6.0, 7.0)	[4.0, 10.0)
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
73	975622	37.0	38.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite	[37.0, 38.0)	[30.0, 40.0)
74	975622	38.0	39.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite	[38.0, 39.0)	[30.0, 40.0)
75	975622	39.0	40.0	2	111000	-9999	-9999	-9999	13.5	30.0	40.0	pegmatite	[39.0, 40.0)	[30.0, 40.0)
76	975622	40.0	41.0	2	111000	-9999	-9999	-9999	13.5	40.0	41.0	quartz	[40.0, 41.0)	[40.0, 41.0)
77	975622	41.0	42.0	2	111000	-9999	-9999	-9999	13.5	41.0	42.0	granite	[41.0, 42.0)	[41.0, 42.0)

78 rows × 14 columns

df.loc[49:51]

	CollarId	FromDepth_x	ToDepth_x	Ag_PPM	Al_PPM	Al2O_PPM	Al2O3_PPM	AlO_PPM	As_PPM	FromDepth_y	ToDepth_y	lithology_group	depthInterval	lithInterval
49	975622	14.2	14.8	2	111000	-9999	-9999	-9999	13.5	10.0	15.0	diorite	[14.2, 14.8)	[10.0, 15.0)
50	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	10.0	15.0	diorite	[14.8, 16.0)	[10.0, 15.0)
51	975622	14.8	16.0	2	111000	-9999	-9999	-9999	13.5	15.0	30.0	basalt	[14.8, 16.0)	[15.0, 30.0)

Key points

Pandas default merge command will force all columns names that match to join on all of the common columns.
We demonstrated how to merge based on a human-sensible range.

All materials copyright Sydney Informatics Hub, University of Sydney