Note: in the time it took me to figure out this code, I could have manually transcribed about 50 of these tables I reckon! Just because you can does not mean you should.
There seem to be a few approaches to reading PDFs with Python. If the PDF is already searchable and you just want to transcribe it, then this notebook using the tabula-py library seems like a good method.
If your PDF is just a plain image, a more versatile approach is to use an OCR on your document or to convert it to and image. Adjust to your needs, but these workflows and libraries may be helpful: - https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052, or - https://pypi.org/project/ocrmypdf/
Note that this requres Java! To install on a Mac via Homebrew, follow the instructions here.
#https://pypi.org/project/tabula-py/import tabula# Read pdf into list of DataFramedfs = tabula.read_pdf("../userdata//G32716A3.pdf", pages='all', pandas_options={"header":None})# Read remote pdf into list of DataFrame#dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")# convert PDF into CSV file#tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')# convert all PDFs in a directory#tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
Got stderr: Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug 30, 2022 3:04:08 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 948 fonts
#What format has the load returned?type(dfs)
list
#Import pandas to do some table manipulationimport pandas as pd
# Fill "forward" all the approriate groupsdf[["SampleID","Project","OrigGeo"]] = df[["SampleID","Project","OrigGeo"]].fillna(method="ffill")#Group by the unique sample id...#...then fill all the nan values in that groupdf['Season'] = df.groupby('SampleID').Season.transform('first')df['CoreName'] = df.groupby('SampleID').CoreName.transform('first')df['Geochronology'] = df.groupby('SampleID').Geochronology.transform('first')#..then combine strings if the group has multiple lines of text, note what we want to pad each bit of text withdf['Lithology'] = df.groupby(['SampleID'])['Lithology'].transform(lambda x: ' '.join(x.dropna()))df['CoreDepth'] = df.groupby(['SampleID'])['CoreDepth'].transform(lambda x: ''.join(x.dropna()))
#Drop all the repeated lines to get the final tabledf = df.drop_duplicates(keep='first')df
SampleID
Project
Season
OrigGeo
Lithology
CoreName
CoreDepth
Geochronology
0
214830.0
53.0
NaN
Markwitz
Quartz-garnet gneiss (PJO)
None
1246.75-1246.6
Yes
3
214831.0
53.0
NaN
Markwitz
Cordierite-sillimanite-garnet gneiss (PJO)
None
1244.5-1244.3
No
6
214832.0
53.0
NaN
Markwitz
Cordierite-sillimanite-garnet gneiss (PJO)
None
1241.4-1241.3
No
9
214833.0
53.0
2014.0
Markwitz
Sandstone – Tumblagooda Sandstone
Wendy-1
1209.72-1209.0
Yes
13
214834.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
1134.4-1133.9
Yes
16
214835.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
1055.2-1054.9
No
19
214836.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
932.15-931.75
Yes
22
214837.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
915.2-915
Yes
25
214839.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
1082.0-1082.3
Yes
28
214840.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
1072-1071.7
Yes
31
214841.0
53.0
2015.0
Markwitz
Sandstone – Tumblagooda Sandstone
Coburn 1
1065.8-1066.0
Yes
35
214842.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
1048.5-1048.8
Yes
38
214843.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
1014.7-1015.0
Yes
41
214844.0
53.0
NaN
Markwitz
Sandstone – Tumblagooda Sandstone
None
985.8-986.1
Yes
#df.to_csv()
All materials copyright Sydney Informatics Hub, University of Sydney