Merging two csv files

Questions

  • What is the best way to merge two csv files?

Objectives

  • Use Pandas to merge two csv files?

Merging two files with tabular data in each is one of the most common data processing tasks that occurs that at the start of a data science project. Below, we merge two files:

# load in pandas 
import pandas as pd

#Load in the two files as dataframes
dfgeo = pd.read_csv("../userdata/Geochemistry.csv")
dflit = pd.read_csv("../userdata/Lithology.csv")
dfgeo
CollarId FromDepth ToDepth Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM
0 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5
1 224202 4.5 4.6 3 93000 -9999 -9999 -9999 3.5
2 224202 4.6 5.0 2 94800 -9999 -9999 -9999 4.5
3 224202 5.0 6.0 3 93000 -9999 -9999 -9999 3.5
4 224202 6.0 7.0 2 94800 -9999 -9999 -9999 4.5
... ... ... ... ... ... ... ... ... ...
72 975622 37.0 38.0 2 111000 -9999 -9999 -9999 13.5
73 975622 38.0 39.0 2 111000 -9999 -9999 -9999 13.5
74 975622 39.0 40.0 2 111000 -9999 -9999 -9999 13.5
75 975622 40.0 41.0 2 111000 -9999 -9999 -9999 13.5
76 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5

77 rows × 9 columns

dflit
CollarId FromDepth ToDepth lithology_group
0 224202 3.0 3.1 sediment
1 224202 3.1 4.0 sediment
2 224202 4.0 10.0 clay
3 224202 10.0 15.0 basalt
4 224202 15.0 20.0 granite
5 224202 20.0 22.0 pegmatite
6 224202 22.0 35.0 basalt
7 224202 35.0 37.0 pegmatite
8 975622 0.0 5.0 sediment
9 975622 5.0 10.0 clay
10 975622 10.0 15.0 diorite
11 975622 15.0 30.0 basalt
12 975622 30.0 40.0 pegmatite
13 975622 40.0 41.0 quartz
14 975622 41.0 42.0 granite

The CollarId is the column that is common between the two datasets, so can be used as a so-called “join key”.

The FromDepth and ToDepth are also common, but if we merge using the default settings, only columns where both values perfectly match will be joined - this isn’t quite what we want:

dfbadmerge = pd.merge(dfgeo,dflit)
dfbadmerge
CollarId FromDepth ToDepth Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM lithology_group
0 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 sediment
1 975622 40.0 41.0 2 111000 -9999 -9999 -9999 13.5 quartz
2 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 granite

The next step merges the two data frames, joining all columns based only on the CollarID. The _x and _y denote the values that were in the original columns FromDepth and ToDepth in dfgeo and dflit, respectively.

#Merge them based on CollarId
dfmerge=pd.merge(dfgeo,dflit,how="outer",on=["CollarId"])
dfmerge
CollarId FromDepth_x ToDepth_x Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM FromDepth_y ToDepth_y lithology_group
0 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 3.0 3.1 sediment
1 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 3.1 4.0 sediment
2 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 4.0 10.0 clay
3 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 10.0 15.0 basalt
4 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 15.0 20.0 granite
... ... ... ... ... ... ... ... ... ... ... ... ...
569 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 10.0 15.0 diorite
570 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 15.0 30.0 basalt
571 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite
572 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 40.0 41.0 quartz
573 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 41.0 42.0 granite

574 rows × 12 columns

Now drop any rows outside the From/To depth of the Lithology:

df = dfmerge[(dfmerge.FromDepth_x >= dfmerge.FromDepth_y) & (dfmerge.ToDepth_x <= dfmerge.ToDepth_y)].reset_index(drop=True)
df
CollarId FromDepth_x ToDepth_x Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM FromDepth_y ToDepth_y lithology_group
0 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 3.0 3.1 sediment
1 224202 4.5 4.6 3 93000 -9999 -9999 -9999 3.5 4.0 10.0 clay
2 224202 4.6 5.0 2 94800 -9999 -9999 -9999 4.5 4.0 10.0 clay
3 224202 5.0 6.0 3 93000 -9999 -9999 -9999 3.5 4.0 10.0 clay
4 224202 6.0 7.0 2 94800 -9999 -9999 -9999 4.5 4.0 10.0 clay
... ... ... ... ... ... ... ... ... ... ... ... ...
71 975622 37.0 38.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite
72 975622 38.0 39.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite
73 975622 39.0 40.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite
74 975622 40.0 41.0 2 111000 -9999 -9999 -9999 13.5 40.0 41.0 quartz
75 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 41.0 42.0 granite

76 rows × 12 columns

I would expect that to equal 77 (the original number of Geochemistry points)

Why is one missing? (TODO show how you found out which one was missing!)

#After some digging the missing value is line 50 from the dfgeo dataframe
dfgeo.loc[49:51]
CollarId FromDepth ToDepth Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM
49 975622 14.2 14.8 2 111000 -9999 -9999 -9999 13.5
50 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5
51 975622 16.0 17.0 2 111000 -9999 -9999 -9999 13.5
#Compared with the final dataframe
df.loc[49:51]
CollarId FromDepth_x ToDepth_x Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM FromDepth_y ToDepth_y lithology_group
49 975622 14.2 14.8 2 111000 -9999 -9999 -9999 13.5 10.0 15.0 diorite
50 975622 16.0 17.0 2 111000 -9999 -9999 -9999 13.5 15.0 30.0 basalt
51 975622 17.0 18.0 2 111000 -9999 -9999 -9999 13.5 15.0 30.0 basalt
#Finding that row in the complete merged dataframe
dfmerge.loc[(dfmerge['FromDepth_x'] == 14.8) & (dfmerge['ToDepth_x'] == 16.0)]
CollarId FromDepth_x ToDepth_x Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM FromDepth_y ToDepth_y lithology_group
385 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 0.0 5.0 sediment
386 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 5.0 10.0 clay
387 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 10.0 15.0 diorite
388 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 15.0 30.0 basalt
389 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite
390 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 40.0 41.0 quartz
391 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 41.0 42.0 granite

We see that the ACTUAL lithology group should be diorite AND basalt. So based on our criteria it actually chose neither.

It will take a bit more hacking to deal with an edge cases like this… but I will leave that up to you.

Next, we do TODO FIXME.

dfmerge['depthInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_x, dfmerge.ToDepth_x,closed='left')
dfmerge['lithInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_y, dfmerge.ToDepth_y,closed='left')

And then we do TODO FIXME.

df = dfmerge[dfmerge.apply(lambda row: row['depthInterval'].overlaps(row['lithInterval']),axis=1)].reset_index(drop=True)
df
CollarId FromDepth_x ToDepth_x Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM FromDepth_y ToDepth_y lithology_group depthInterval lithInterval
0 224202 3.0 3.1 2 94800 -9999 -9999 -9999 4.5 3.0 3.1 sediment [3.0, 3.1) [3.0, 3.1)
1 224202 4.5 4.6 3 93000 -9999 -9999 -9999 3.5 4.0 10.0 clay [4.5, 4.6) [4.0, 10.0)
2 224202 4.6 5.0 2 94800 -9999 -9999 -9999 4.5 4.0 10.0 clay [4.6, 5.0) [4.0, 10.0)
3 224202 5.0 6.0 3 93000 -9999 -9999 -9999 3.5 4.0 10.0 clay [5.0, 6.0) [4.0, 10.0)
4 224202 6.0 7.0 2 94800 -9999 -9999 -9999 4.5 4.0 10.0 clay [6.0, 7.0) [4.0, 10.0)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73 975622 37.0 38.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite [37.0, 38.0) [30.0, 40.0)
74 975622 38.0 39.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite [38.0, 39.0) [30.0, 40.0)
75 975622 39.0 40.0 2 111000 -9999 -9999 -9999 13.5 30.0 40.0 pegmatite [39.0, 40.0) [30.0, 40.0)
76 975622 40.0 41.0 2 111000 -9999 -9999 -9999 13.5 40.0 41.0 quartz [40.0, 41.0) [40.0, 41.0)
77 975622 41.0 42.0 2 111000 -9999 -9999 -9999 13.5 41.0 42.0 granite [41.0, 42.0) [41.0, 42.0)

78 rows × 14 columns

df.loc[49:51]
CollarId FromDepth_x ToDepth_x Ag_PPM Al_PPM Al2O_PPM Al2O3_PPM AlO_PPM As_PPM FromDepth_y ToDepth_y lithology_group depthInterval lithInterval
49 975622 14.2 14.8 2 111000 -9999 -9999 -9999 13.5 10.0 15.0 diorite [14.2, 14.8) [10.0, 15.0)
50 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 10.0 15.0 diorite [14.8, 16.0) [10.0, 15.0)
51 975622 14.8 16.0 2 111000 -9999 -9999 -9999 13.5 15.0 30.0 basalt [14.8, 16.0) [15.0, 30.0)

Key points

  • Pandas default merge command will force all columns names that match to join on all of the common columns.
  • We demonstrated how to merge based on a human-sensible range.

All materials copyright Sydney Informatics Hub, University of Sydney