# load in pandas
import pandas as pd
#Load in the two files as dataframes
dfgeo = pd.read_csv("../userdata/Geochemistry.csv")
dflit = pd.read_csv("../userdata/Lithology.csv")Merging two csv files
Questions
- What is the best way to merge two csv files?
Objectives
- Use Pandas to merge two csv files?
Merging two files with tabular data in each is one of the most common data processing tasks that occurs that at the start of a data science project. Below, we merge two files:
- One with geochemistry data, which shows at different depths the concentration of specific minerals
- One with lithology data, which provides annotation around which lithology group is observed at different depths at specific sites
dfgeo| CollarId | FromDepth | ToDepth | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 |
| 1 | 224202 | 4.5 | 4.6 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 |
| 2 | 224202 | 4.6 | 5.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 |
| 3 | 224202 | 5.0 | 6.0 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 |
| 4 | 224202 | 6.0 | 7.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 72 | 975622 | 37.0 | 38.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
| 73 | 975622 | 38.0 | 39.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
| 74 | 975622 | 39.0 | 40.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
| 75 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
| 76 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
77 rows × 9 columns
dflit| CollarId | FromDepth | ToDepth | lithology_group | |
|---|---|---|---|---|
| 0 | 224202 | 3.0 | 3.1 | sediment |
| 1 | 224202 | 3.1 | 4.0 | sediment |
| 2 | 224202 | 4.0 | 10.0 | clay |
| 3 | 224202 | 10.0 | 15.0 | basalt |
| 4 | 224202 | 15.0 | 20.0 | granite |
| 5 | 224202 | 20.0 | 22.0 | pegmatite |
| 6 | 224202 | 22.0 | 35.0 | basalt |
| 7 | 224202 | 35.0 | 37.0 | pegmatite |
| 8 | 975622 | 0.0 | 5.0 | sediment |
| 9 | 975622 | 5.0 | 10.0 | clay |
| 10 | 975622 | 10.0 | 15.0 | diorite |
| 11 | 975622 | 15.0 | 30.0 | basalt |
| 12 | 975622 | 30.0 | 40.0 | pegmatite |
| 13 | 975622 | 40.0 | 41.0 | quartz |
| 14 | 975622 | 41.0 | 42.0 | granite |
The CollarId is the column that is common between the two datasets, so can be used as a so-called “join key”.
The FromDepth and ToDepth are also common, but if we merge using the default settings, only columns where both values perfectly match will be joined - this isn’t quite what we want:
dfbadmerge = pd.merge(dfgeo,dflit)
dfbadmerge| CollarId | FromDepth | ToDepth | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | lithology_group | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | sediment |
| 1 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | quartz |
| 2 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | granite |
The next step merges the two data frames, joining all columns based only on the CollarID. The _x and _y denote the values that were in the original columns FromDepth and ToDepth in dfgeo and dflit, respectively.
#Merge them based on CollarId
dfmerge=pd.merge(dfgeo,dflit,how="outer",on=["CollarId"])
dfmerge| CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.0 | 3.1 | sediment |
| 1 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.1 | 4.0 | sediment |
| 2 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay |
| 3 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 10.0 | 15.0 | basalt |
| 4 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 15.0 | 20.0 | granite |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 569 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite |
| 570 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
| 571 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
| 572 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz |
| 573 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite |
574 rows × 12 columns
Now drop any rows outside the From/To depth of the Lithology:
df = dfmerge[(dfmerge.FromDepth_x >= dfmerge.FromDepth_y) & (dfmerge.ToDepth_x <= dfmerge.ToDepth_y)].reset_index(drop=True)
df| CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.0 | 3.1 | sediment |
| 1 | 224202 | 4.5 | 4.6 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay |
| 2 | 224202 | 4.6 | 5.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay |
| 3 | 224202 | 5.0 | 6.0 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay |
| 4 | 224202 | 6.0 | 7.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 71 | 975622 | 37.0 | 38.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
| 72 | 975622 | 38.0 | 39.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
| 73 | 975622 | 39.0 | 40.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
| 74 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz |
| 75 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite |
76 rows × 12 columns
I would expect that to equal 77 (the original number of Geochemistry points)
Why is one missing? (TODO show how you found out which one was missing!)
#After some digging the missing value is line 50 from the dfgeo dataframe
dfgeo.loc[49:51]| CollarId | FromDepth | ToDepth | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | |
|---|---|---|---|---|---|---|---|---|---|
| 49 | 975622 | 14.2 | 14.8 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
| 50 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
| 51 | 975622 | 16.0 | 17.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 |
#Compared with the final dataframe
df.loc[49:51]| CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 49 | 975622 | 14.2 | 14.8 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite |
| 50 | 975622 | 16.0 | 17.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
| 51 | 975622 | 17.0 | 18.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
#Finding that row in the complete merged dataframe
dfmerge.loc[(dfmerge['FromDepth_x'] == 14.8) & (dfmerge['ToDepth_x'] == 16.0)]| CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 385 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 0.0 | 5.0 | sediment |
| 386 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 5.0 | 10.0 | clay |
| 387 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite |
| 388 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt |
| 389 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite |
| 390 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz |
| 391 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite |
We see that the ACTUAL lithology group should be diorite AND basalt. So based on our criteria it actually chose neither.
It will take a bit more hacking to deal with an edge cases like this… but I will leave that up to you.
Next, we do TODO FIXME.
dfmerge['depthInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_x, dfmerge.ToDepth_x,closed='left')
dfmerge['lithInterval'] = pd.arrays.IntervalArray.from_arrays(dfmerge.FromDepth_y, dfmerge.ToDepth_y,closed='left')And then we do TODO FIXME.
df = dfmerge[dfmerge.apply(lambda row: row['depthInterval'].overlaps(row['lithInterval']),axis=1)].reset_index(drop=True)
df| CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | depthInterval | lithInterval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 224202 | 3.0 | 3.1 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 3.0 | 3.1 | sediment | [3.0, 3.1) | [3.0, 3.1) |
| 1 | 224202 | 4.5 | 4.6 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay | [4.5, 4.6) | [4.0, 10.0) |
| 2 | 224202 | 4.6 | 5.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay | [4.6, 5.0) | [4.0, 10.0) |
| 3 | 224202 | 5.0 | 6.0 | 3 | 93000 | -9999 | -9999 | -9999 | 3.5 | 4.0 | 10.0 | clay | [5.0, 6.0) | [4.0, 10.0) |
| 4 | 224202 | 6.0 | 7.0 | 2 | 94800 | -9999 | -9999 | -9999 | 4.5 | 4.0 | 10.0 | clay | [6.0, 7.0) | [4.0, 10.0) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 73 | 975622 | 37.0 | 38.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite | [37.0, 38.0) | [30.0, 40.0) |
| 74 | 975622 | 38.0 | 39.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite | [38.0, 39.0) | [30.0, 40.0) |
| 75 | 975622 | 39.0 | 40.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 30.0 | 40.0 | pegmatite | [39.0, 40.0) | [30.0, 40.0) |
| 76 | 975622 | 40.0 | 41.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 40.0 | 41.0 | quartz | [40.0, 41.0) | [40.0, 41.0) |
| 77 | 975622 | 41.0 | 42.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 41.0 | 42.0 | granite | [41.0, 42.0) | [41.0, 42.0) |
78 rows × 14 columns
df.loc[49:51]| CollarId | FromDepth_x | ToDepth_x | Ag_PPM | Al_PPM | Al2O_PPM | Al2O3_PPM | AlO_PPM | As_PPM | FromDepth_y | ToDepth_y | lithology_group | depthInterval | lithInterval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 49 | 975622 | 14.2 | 14.8 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite | [14.2, 14.8) | [10.0, 15.0) |
| 50 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 10.0 | 15.0 | diorite | [14.8, 16.0) | [10.0, 15.0) |
| 51 | 975622 | 14.8 | 16.0 | 2 | 111000 | -9999 | -9999 | -9999 | 13.5 | 15.0 | 30.0 | basalt | [14.8, 16.0) | [15.0, 30.0) |
Key points
- Pandas default merge command will force all columns names that match to join on all of the common columns.
- We demonstrated how to merge based on a human-sensible range.
All materials copyright Sydney Informatics Hub, University of Sydney