Reading PDF files into Python

Questions

How do I read PDF files in Python?

Objectives

Use tabula-py to work with PDF files in Python

Note: in the time it took me to figure out this code, I could have manually transcribed about 50 of these tables I reckon! Just because you can does not mean you should.

There seem to be a few approaches to reading PDFs with Python. If the PDF is already searchable and you just want to transcribe it, then this notebook using the tabula-py library seems like a good method.

If your PDF is just a plain image, a more versatile approach is to use an OCR on your document or to convert it to and image. Adjust to your needs, but these workflows and libraries may be helpful: - https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052, or - https://pypi.org/project/ocrmypdf/

Note that this requres Java! To install on a Mac via Homebrew, follow the instructions here.

!pip install tabula-py

Collecting tabula-py
  Downloading tabula_py-2.5.1-py3-none-any.whl (12.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 87.4 MB/s eta 0:00:00a 0:00:01
Collecting distro
  Downloading distro-1.7.0-py3-none-any.whl (20 kB)
Requirement already satisfied: pandas>=0.25.3 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from tabula-py) (1.4.3)
Requirement already satisfied: numpy in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from tabula-py) (1.23.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from pandas>=0.25.3->tabula-py) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from pandas>=0.25.3->tabula-py) (2022.1)
Requirement already satisfied: six>=1.5 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=0.25.3->tabula-py) (1.16.0)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.7.0 tabula-py-2.5.1

#https://pypi.org/project/tabula-py/

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("../userdata//G32716A3.pdf", pages='all', pandas_options={"header":None})

# Read remote pdf into list of DataFrame
#dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
#tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
#tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

Got stderr: Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug 30, 2022 3:04:08 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 948 fonts

#What format has the load returned?
type(dfs)

list

#Import pandas to do some table manipulation
import pandas as pd

colnames = ["SampleID", "Project", "Season", "OrigGeo", "Lithology", "CoreName", "CoreDepth", "Geochronology"]
df = dfs[0]
df.columns=colnames
df

	SampleID	Project	Season	OrigGeo	Lithology	CoreName	CoreDepth	Geochronology
0	214830.0	53.0	NaN	Markwitz	Quartz-garnet gneiss (PJO)	NaN	1246.75-	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
2	NaN	NaN	NaN	NaN	NaN	NaN	1246.6	NaN
3	214831.0	53.0	NaN	Markwitz	Cordierite-sillimanite-garnet	NaN	1244.5-	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	No
5	NaN	NaN	NaN	NaN	gneiss (PJO)	NaN	1244.3	NaN
6	214832.0	53.0	NaN	Markwitz	Cordierite-sillimanite-garnet	NaN	1241.4-	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	No
8	NaN	NaN	NaN	NaN	gneiss (PJO)	NaN	1241.3	NaN
9	214833.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1209.72-	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
11	NaN	NaN	NaN	NaN	Sandstone	NaN	1209.0	NaN
12	NaN	NaN	2014.0	NaN	NaN	Wendy-1	NaN	NaN
13	214834.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1134.4-	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
15	NaN	NaN	NaN	NaN	Sandstone	NaN	1133.9	NaN
16	214835.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1055.2-	NaN
17	NaN	NaN	NaN	NaN	NaN	NaN	NaN	No
18	NaN	NaN	NaN	NaN	Sandstone	NaN	1054.9	NaN
19	214836.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	932.15-	NaN
20	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
21	NaN	NaN	NaN	NaN	Sandstone	NaN	931.75	NaN
22	214837.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	NaN	NaN
23	NaN	NaN	NaN	NaN	NaN	NaN	915.2-915	Yes
24	NaN	NaN	NaN	NaN	Sandstone	NaN	NaN	NaN
25	214839.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1082.0-	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
27	NaN	NaN	NaN	NaN	Sandstone	NaN	1082.3	NaN
28	214840.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	1072-1071.7	Yes
30	NaN	NaN	NaN	NaN	Sandstone	NaN	NaN	NaN
31	214841.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1065.8-	NaN
32	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
33	NaN	NaN	NaN	NaN	Sandstone	NaN	1066.0	NaN
34	NaN	NaN	2015.0	NaN	NaN	Coburn 1	NaN	NaN
35	214842.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1048.5-	NaN
36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
37	NaN	NaN	NaN	NaN	Sandstone	NaN	1048.8	NaN
38	214843.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	1014.7-	NaN
39	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes
40	NaN	NaN	NaN	NaN	Sandstone	NaN	1015.0	NaN
41	214844.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda	NaN	NaN	NaN
42	NaN	NaN	NaN	NaN	NaN	NaN	985.8-986.1	Yes
43	NaN	NaN	NaN	NaN	Sandstone	NaN	NaN	NaN

# Fill "forward" all the approriate groups
df[["SampleID","Project","OrigGeo"]] = df[["SampleID","Project","OrigGeo"]].fillna(method="ffill")

#Group by the unique sample id...

#...then fill all the nan values in that group
df['Season'] = df.groupby('SampleID').Season.transform('first')
df['CoreName'] = df.groupby('SampleID').CoreName.transform('first')
df['Geochronology'] = df.groupby('SampleID').Geochronology.transform('first')

#..then combine strings if the group has multiple lines of text, note what we want to pad each bit of text with
df['Lithology'] = df.groupby(['SampleID'])['Lithology'].transform(lambda x: ' '.join(x.dropna()))
df['CoreDepth'] = df.groupby(['SampleID'])['CoreDepth'].transform(lambda x: ''.join(x.dropna()))

#Drop all the repeated lines to get the final table
df = df.drop_duplicates(keep='first')
df

	SampleID	Project	Season	OrigGeo	Lithology	CoreName	CoreDepth	Geochronology
0	214830.0	53.0	NaN	Markwitz	Quartz-garnet gneiss (PJO)	None	1246.75-1246.6	Yes
3	214831.0	53.0	NaN	Markwitz	Cordierite-sillimanite-garnet gneiss (PJO)	None	1244.5-1244.3	No
6	214832.0	53.0	NaN	Markwitz	Cordierite-sillimanite-garnet gneiss (PJO)	None	1241.4-1241.3	No
9	214833.0	53.0	2014.0	Markwitz	Sandstone – Tumblagooda Sandstone	Wendy-1	1209.72-1209.0	Yes
13	214834.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	1134.4-1133.9	Yes
16	214835.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	1055.2-1054.9	No
19	214836.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	932.15-931.75	Yes
22	214837.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	915.2-915	Yes
25	214839.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	1082.0-1082.3	Yes
28	214840.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	1072-1071.7	Yes
31	214841.0	53.0	2015.0	Markwitz	Sandstone – Tumblagooda Sandstone	Coburn 1	1065.8-1066.0	Yes
35	214842.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	1048.5-1048.8	Yes
38	214843.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	1014.7-1015.0	Yes
41	214844.0	53.0	NaN	Markwitz	Sandstone – Tumblagooda Sandstone	None	985.8-986.1	Yes

#df.to_csv()

All materials copyright Sydney Informatics Hub, University of Sydney