# load in pandas
import pandas as pd
#Load in the two files as dataframes
= pd.read_csv("../userdata/Geochemistry.csv")
dfgeo = pd.read_csv("../userdata/Lithology.csv") dflit
Merging two csv files
Questions
- What is the best way to merge two csv files?
Objectives
- Use Pandas to merge two csv files?
Merging two files with tabular data in each is one of the most common data processing tasks that occurs that at the start of a data science project. Below, we merge two files:
- One with geochemistry data, which shows at different depths the concentration of specific minerals
- One with lithology data, which provides annotation around which lithology group is observed at different depths at specific sites
dfgeo
CollarId | FromDepth | ToDepth | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | |
---|---|---|---|---|---|---|---|---|---|
0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 |
1 | 224202 | 4.5 | 4.6 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 |
2 | 224202 | 4.6 | 5.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 |
3 | 224202 | 5.0 | 6.0 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 |
4 | 224202 | 6.0 | 7.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72 | 975622 | 37.0 | 38.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
73 | 975622 | 38.0 | 39.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
74 | 975622 | 39.0 | 40.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
75 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
76 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
77 rows × 9 columns
dflit
CollarId | FromDepth | ToDepth | lithology_group | |
---|---|---|---|---|
0 | 224202 | 3.0 | 3.1 | sediment |
1 | 224202 | 3.1 | 4.0 | sediment |
2 | 224202 | 4.0 | 10.0 | clay |
3 | 224202 | 10.0 | 15.0 | basalt |
4 | 224202 | 15.0 | 20.0 | granite |
5 | 224202 | 20.0 | 22.0 | pegmatite |
6 | 224202 | 22.0 | 35.0 | basalt |
7 | 224202 | 35.0 | 37.0 | pegmatite |
8 | 975622 | 0.0 | 5.0 | sediment |
9 | 975622 | 5.0 | 10.0 | clay |
10 | 975622 | 10.0 | 15.0 | diorite |
11 | 975622 | 15.0 | 30.0 | basalt |
12 | 975622 | 30.0 | 40.0 | pegmatite |
13 | 975622 | 40.0 | 41.0 | quartz |
14 | 975622 | 41.0 | 42.0 | granite |
The CollarId
is the column that is common between the two datasets, so can be used as a so-called “join key”.
The FromDepth
and ToDepth
are also common, but if we merge using the default settings, only columns where both values perfectly match will be joined - this isn’t quite what we want:
= pd.merge(dfgeo,dflit)
dfbadmerge dfbadmerge
CollarId | FromDepth | ToDepth | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | lithology_group | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | sediment |
1 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | quartz |
2 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | granite |
The next step merges the two data frames, joining all columns based only on the CollarID
. The _x
and _y
denote the values that were in the original columns FromDepth
and ToDepth
in dfgeo
and dflit
, respectively.
#Merge them based on CollarId
=pd.merge(dfgeo,dflit,how="outer",on=["CollarId"])
dfmerge dfmerge
CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.0 | 3.1 | sediment |
1 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.1 | 4.0 | sediment |
2 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay |
3 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 10.0 | 15.0 | basalt |
4 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 15.0 | 20.0 | granite |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
569 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite |
570 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
571 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
572 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz |
573 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite |
574 rows × 12 columns
Now drop any rows outside the From/To depth of the Lithology:
= dfmerge[(dfmerge.FromDepth_x >= dfmerge.FromDepth_y) & (dfmerge.ToDepth_x <= dfmerge.ToDepth_y)].reset_index(drop=True)
df df
CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.0 | 3.1 | sediment |
1 | 224202 | 4.5 | 4.6 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay |
2 | 224202 | 4.6 | 5.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay |
3 | 224202 | 5.0 | 6.0 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay |
4 | 224202 | 6.0 | 7.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
71 | 975622 | 37.0 | 38.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
72 | 975622 | 38.0 | 39.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
73 | 975622 | 39.0 | 40.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
74 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz |
75 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite |
76 rows × 12 columns
I would expect that to equal 77 (the original number of Geochemistry points)
Why is one missing? (TODO show how you found out which one was missing!)
#After some digging the missing value is line 50 from the dfgeo dataframe
49:51] dfgeo.loc[
CollarId | FromDepth | ToDepth | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | |
---|---|---|---|---|---|---|---|---|---|
49 | 975622 | 14.2 | 14.8 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
50 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
51 | 975622 | 16.0 | 17.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
#Compared with the final dataframe
49:51] df.loc[
CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
49 | 975622 | 14.2 | 14.8 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite |
50 | 975622 | 16.0 | 17.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
51 | 975622 | 17.0 | 18.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
#Finding that row in the complete merged dataframe
'FromDepth_x'] == 14.8) & (dfmerge['ToDepth_x'] == 16.0)] dfmerge.loc[(dfmerge[
CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
385 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 0.0 | 5.0 | sediment |
386 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 5.0 | 10.0 | clay |
387 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite |
388 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
389 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
390 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz |
391 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite |
We see that the ACTUAL lithology group should be diorite AND basalt. So based on our criteria it actually chose neither.
It will take a bit more hacking to deal with an edge cases like this… but I will leave that up to you.
Next, we do TODO FIXME.
'depthInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_x, dfmerge.ToDepth_x,closed='left')
dfmerge['lithInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_y, dfmerge.ToDepth_y,closed='left') dfmerge[
And then we do TODO FIXME.
= dfmerge[dfmerge.apply(lambda row: row['depthInterval'].overlaps(row['lithInterval']),axis=1)].reset_index(drop=True)
df df
CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | depthInterval | lithInterval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.0 | 3.1 | sediment | [3.0, 3.1) | [3.0, 3.1) |
1 | 224202 | 4.5 | 4.6 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay | [4.5, 4.6) | [4.0, 10.0) |
2 | 224202 | 4.6 | 5.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay | [4.6, 5.0) | [4.0, 10.0) |
3 | 224202 | 5.0 | 6.0 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay | [5.0, 6.0) | [4.0, 10.0) |
4 | 224202 | 6.0 | 7.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay | [6.0, 7.0) | [4.0, 10.0) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
73 | 975622 | 37.0 | 38.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite | [37.0, 38.0) | [30.0, 40.0) |
74 | 975622 | 38.0 | 39.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite | [38.0, 39.0) | [30.0, 40.0) |
75 | 975622 | 39.0 | 40.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite | [39.0, 40.0) | [30.0, 40.0) |
76 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz | [40.0, 41.0) | [40.0, 41.0) |
77 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite | [41.0, 42.0) | [41.0, 42.0) |
78 rows × 14 columns
49:51] df.loc[
CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | depthInterval | lithInterval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49 | 975622 | 14.2 | 14.8 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite | [14.2, 14.8) | [10.0, 15.0) |
50 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite | [14.8, 16.0) | [10.0, 15.0) |
51 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt | [14.8, 16.0) | [15.0, 30.0) |
Key points
- Pandas default merge command will force all columns names that match to join on all of the common columns.
- We demonstrated how to merge based on a human-sensible range.
All materials copyright Sydney Informatics Hub, University of Sydney