Reading PDF files into Python

Questions

  • How do I read PDF files in Python?

Objectives

  • Use tabula-py to work with PDF files in Python

Note: in the time it took me to figure out this code, I could have manually transcribed about 50 of these tables I reckon! Just because you can does not mean you should.

There seem to be a few approaches to reading PDFs with Python. If the PDF is already searchable and you just want to transcribe it, then this notebook using the tabula-py library seems like a good method.

If your PDF is just a plain image, a more versatile approach is to use an OCR on your document or to convert it to and image. Adjust to your needs, but these workflows and libraries may be helpful: - https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052, or - https://pypi.org/project/ocrmypdf/

Note that this requres Java! To install on a Mac via Homebrew, follow the instructions here.

!pip install tabula-py
Collecting tabula-py
  Downloading tabula_py-2.5.1-py3-none-any.whl (12.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 87.4 MB/s eta 0:00:00a 0:00:01
Collecting distro
  Downloading distro-1.7.0-py3-none-any.whl (20 kB)
Requirement already satisfied: pandas>=0.25.3 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from tabula-py) (1.4.3)
Requirement already satisfied: numpy in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from tabula-py) (1.23.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from pandas>=0.25.3->tabula-py) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from pandas>=0.25.3->tabula-py) (2022.1)
Requirement already satisfied: six>=1.5 in /Users/darya/miniconda3/envs/geopy/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=0.25.3->tabula-py) (1.16.0)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.7.0 tabula-py-2.5.1
#https://pypi.org/project/tabula-py/

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("../userdata//G32716A3.pdf", pages='all', pandas_options={"header":None})

# Read remote pdf into list of DataFrame
#dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
#tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
#tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
Got stderr: Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug 30, 2022 3:04:07 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug 30, 2022 3:04:08 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 948 fonts
#What format has the load returned?
type(dfs)
list
#Import pandas to do some table manipulation
import pandas as pd
colnames = ["SampleID", "Project", "Season", "OrigGeo", "Lithology", "CoreName", "CoreDepth", "Geochronology"]
df = dfs[0]
df.columns=colnames
df
SampleID Project Season OrigGeo Lithology CoreName CoreDepth Geochronology
0 214830.0 53.0 NaN Markwitz Quartz-garnet gneiss (PJO) NaN 1246.75- NaN
1 NaN NaN NaN NaN NaN NaN NaN Yes
2 NaN NaN NaN NaN NaN NaN 1246.6 NaN
3 214831.0 53.0 NaN Markwitz Cordierite-sillimanite-garnet NaN 1244.5- NaN
4 NaN NaN NaN NaN NaN NaN NaN No
5 NaN NaN NaN NaN gneiss (PJO) NaN 1244.3 NaN
6 214832.0 53.0 NaN Markwitz Cordierite-sillimanite-garnet NaN 1241.4- NaN
7 NaN NaN NaN NaN NaN NaN NaN No
8 NaN NaN NaN NaN gneiss (PJO) NaN 1241.3 NaN
9 214833.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1209.72- NaN
10 NaN NaN NaN NaN NaN NaN NaN Yes
11 NaN NaN NaN NaN Sandstone NaN 1209.0 NaN
12 NaN NaN 2014.0 NaN NaN Wendy-1 NaN NaN
13 214834.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1134.4- NaN
14 NaN NaN NaN NaN NaN NaN NaN Yes
15 NaN NaN NaN NaN Sandstone NaN 1133.9 NaN
16 214835.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1055.2- NaN
17 NaN NaN NaN NaN NaN NaN NaN No
18 NaN NaN NaN NaN Sandstone NaN 1054.9 NaN
19 214836.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 932.15- NaN
20 NaN NaN NaN NaN NaN NaN NaN Yes
21 NaN NaN NaN NaN Sandstone NaN 931.75 NaN
22 214837.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN NaN NaN
23 NaN NaN NaN NaN NaN NaN 915.2-915 Yes
24 NaN NaN NaN NaN Sandstone NaN NaN NaN
25 214839.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1082.0- NaN
26 NaN NaN NaN NaN NaN NaN NaN Yes
27 NaN NaN NaN NaN Sandstone NaN 1082.3 NaN
28 214840.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN NaN NaN
29 NaN NaN NaN NaN NaN NaN 1072-1071.7 Yes
30 NaN NaN NaN NaN Sandstone NaN NaN NaN
31 214841.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1065.8- NaN
32 NaN NaN NaN NaN NaN NaN NaN Yes
33 NaN NaN NaN NaN Sandstone NaN 1066.0 NaN
34 NaN NaN 2015.0 NaN NaN Coburn 1 NaN NaN
35 214842.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1048.5- NaN
36 NaN NaN NaN NaN NaN NaN NaN Yes
37 NaN NaN NaN NaN Sandstone NaN 1048.8 NaN
38 214843.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN 1014.7- NaN
39 NaN NaN NaN NaN NaN NaN NaN Yes
40 NaN NaN NaN NaN Sandstone NaN 1015.0 NaN
41 214844.0 53.0 NaN Markwitz Sandstone – Tumblagooda NaN NaN NaN
42 NaN NaN NaN NaN NaN NaN 985.8-986.1 Yes
43 NaN NaN NaN NaN Sandstone NaN NaN NaN
# Fill "forward" all the approriate groups
df[["SampleID","Project","OrigGeo"]] = df[["SampleID","Project","OrigGeo"]].fillna(method="ffill")

#Group by the unique sample id...

#...then fill all the nan values in that group
df['Season'] = df.groupby('SampleID').Season.transform('first')
df['CoreName'] = df.groupby('SampleID').CoreName.transform('first')
df['Geochronology'] = df.groupby('SampleID').Geochronology.transform('first')

#..then combine strings if the group has multiple lines of text, note what we want to pad each bit of text with
df['Lithology'] = df.groupby(['SampleID'])['Lithology'].transform(lambda x: ' '.join(x.dropna()))
df['CoreDepth'] = df.groupby(['SampleID'])['CoreDepth'].transform(lambda x: ''.join(x.dropna()))
#Drop all the repeated lines to get the final table
df = df.drop_duplicates(keep='first')
df
SampleID Project Season OrigGeo Lithology CoreName CoreDepth Geochronology
0 214830.0 53.0 NaN Markwitz Quartz-garnet gneiss (PJO) None 1246.75-1246.6 Yes
3 214831.0 53.0 NaN Markwitz Cordierite-sillimanite-garnet gneiss (PJO) None 1244.5-1244.3 No
6 214832.0 53.0 NaN Markwitz Cordierite-sillimanite-garnet gneiss (PJO) None 1241.4-1241.3 No
9 214833.0 53.0 2014.0 Markwitz Sandstone – Tumblagooda Sandstone Wendy-1 1209.72-1209.0 Yes
13 214834.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 1134.4-1133.9 Yes
16 214835.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 1055.2-1054.9 No
19 214836.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 932.15-931.75 Yes
22 214837.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 915.2-915 Yes
25 214839.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 1082.0-1082.3 Yes
28 214840.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 1072-1071.7 Yes
31 214841.0 53.0 2015.0 Markwitz Sandstone – Tumblagooda Sandstone Coburn 1 1065.8-1066.0 Yes
35 214842.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 1048.5-1048.8 Yes
38 214843.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 1014.7-1015.0 Yes
41 214844.0 53.0 NaN Markwitz Sandstone – Tumblagooda Sandstone None 985.8-986.1 Yes
#df.to_csv()

All materials copyright Sydney Informatics Hub, University of Sydney