Reading in a PDF Table using Python

Often we have to process data in annoying formats. One exampel are PDF Tables. But luckily Python can help! Here we will read in a table from a pdf file using Python. For more information see this link. We are reading in the tables from the annex of this document. [Picture by Markus Winkler]

!pip install -q tabula-py

import tabula
import pandas as pd

C:\Users\Rude\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\compat\_optional.py:138: UserWarning: Pandas requires version '2.7.0' or newer of 'numexpr' (version '2.6.8' currently installed).
  warnings.warn(msg, UserWarning)

pdf_path = r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\OECD_AIPatents.pdf"

Let’s read in our keyword table on page 66 and 67

dfs = tabula.read_pdf(pdf_path, pages=[66,67], columns=None, pandas_options={'header': None})

The command gives us two pandas dataframes, one for each page.

print(len(dfs))
dfs[0]

	0	1	2
0	action recognition	activity recognition	adaboost
1	human action recognition	human activity recognition	NaN
2	adaptive boosting	adversarial network	ambient intelligence
3	NaN	generative adversarial network	NaN
4	ant colony	artificial intelligence	association rule
5	ant colony optimisation	human aware artificial intelligence	NaN
6	autoencoder	autonomic computing	autonomous vehicle
7	autonomous weapon	backpropagation	Bayesian learning
8	bayesian network	bee colony	blind signal separation
9	NaN	artificial bee colony algorithm	NaN
10	bootstrap aggregation	brain computer interface	brownboost
11	chatbot	classification tree	cluster analysis
12	cognitive automation	cognitive computing	cognitive insight system
13	cognitive modelling	collaborative filtering	collision avoidance
14	community detection	computational intelligence	computational pathology
15	computer vision	cyber physical system	data mining
16	decision tree	deep belief network	deep learning
17	dictionary learning	dimensionality reduction	dynamic time warping
18	emotion recognition	ensemble learning	evolutionary algorithm
19	NaN	NaN	differential evolution algorithm
20	NaN	NaN	multi-objective evolutionary algorithm
21	evolutionary computation	face recognition	facial expression recognition
22	factorisation machine	feature engineering	feature extraction
23	feature learning	feature selection	firefly algorithm
24	fuzzy c	gaussian mixture model	gaussian process
25	fuzzy environment	NaN	NaN
26	fuzzy logic	NaN	NaN
27	fuzzy number	NaN	NaN
28	fuzzy set	NaN	NaN
29	intuitionistic fuzzy set	NaN	NaN
30	fuzzy system	NaN	NaN
31	t s fuzzy system	NaN	NaN
32	Takagi-Sugeno fuzzy systems	NaN	NaN
33	genetic algorithm	genetic programming	gesture recognition
34	gradient boosting	graphical model	gravitational search algorithm
35	gradient tree boosting	NaN	NaN
36	hebbian learning	hierarchical clustering	high-dimensional data
37	NaN	NaN	high-dimensional feature
38	NaN	NaN	high-dimensional input
39	NaN	NaN	high-dimensional model
40	NaN	NaN	high-dimensional space
41	NaN	NaN	high-dimensional system
42	image classification	image processing	image recognition
43	image retrieval	image segmentation	independent component analysis
44	inductive monitoring	instance-based learning	intelligence augmentation
45	intelligent agent	intelligent classifier	intelligent geometric computing
46	intelligent software agent	NaN	NaN
47	intelligent infrastructure	Kernel learning	K-means
48	latent dirichlet allocation	latent semantic analysis	latent variable

dfs[1]

	0	1	2
0	layered control system	learning automata	link prediction
1	logitboost	long short term memory (LSTM)	lpboost
2	machine intelligence	machine learning\rextreme machine learning	machine translation
3	machine vision	madaboost	MapReduce
4	Markovian\rhidden Markov model	memetic algorithm	meta learning
5	motion planning	multi task learning	multi-agent system
6	multi-label classification	multi-layer perceptron	multinomial naïve Bayes
7	multi-objective optimisation	naïve Bayes classifier	natural gradient
8	natural language generation	natural language processing	natural language understanding
9	nearest neighbour algorithm	neural network\rartificial neural network\rcon...	neural turing\rneural turing machine
10	neuromorphic computing	non negative matrix factorisation	object detection
11	object recognition	obstacle avoidance	pattern recognition
12	pedestrian detection	policy gradient methods	Q-learning
13	random field	random forest	rankboost
14	recommender system	regression tree	reinforcement learning
15	relational learning\rstatistical relational le...	robot\rbiped robot\rhumanoid robot\rhuman-robo...	rough set
16	rule learning\rrule-based learning	self-organising map	self-organising structure
17	semantic web	semi-supervised learning	sensor fusion\rsensor data fusion\rmulti-senso...
18	sentiment analysis	similarity learning	simultaneous localisation mapping
19	single-linkage clustering	sparse representation	spectral clustering
20	speech recognition	speech to text	stacked generalisation
21	stochastic gradient	supervised learning	support vector regression
22	swarm intelligence	swarm optimisation\rparticle swarm optimisation	temporal difference learning
23	text mining	text to speech	topic model
24	totalboost	trajectory planning	trajectory tracking
25	transfer learning	trust region policy optimisation	unmanned aerial vehicle
26	unsupervised learning	variational inference	vector machine\rsupport vector machine
27	virtual assistant	visual servoing	xgboost

Let’s append both dataframes

df = dfs[0].append(dfs[1])

df.head()

	0	1	2
0	action recognition	activity recognition	adaboost
1	human action recognition	human activity recognition	NaN
2	adaptive boosting	adversarial network	ambient intelligence
3	NaN	generative adversarial network	NaN
4	ant colony	artificial intelligence	association rule

df.tail()

	0	1	2
23	text mining	text to speech	topic model
24	totalboost	trajectory planning	trajectory tracking
25	transfer learning	trust region policy optimisation	unmanned aerial vehicle
26	unsupervised learning	variational inference	vector machine\rsupport vector machine
27	virtual assistant	visual servoing	xgboost

df.to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\AI_Keywords.xlsx", header=False, index=False)

Let’s read in our IPC Codes on page 67

dfs = tabula.read_pdf(pdf_path, pages=[68], columns=None, pandas_options={'header': None})

Again, we read in the 3 tables as 3 different data frames.

print(len(dfs))
dfs[0]

	0	1	2	3
0	G06N3	G06N5	G06N20	G06F15/18
1	G06T1/40	G16C20/70	G16B40/20	G16B40/30

dfs[0].to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\IPC_Codes_withoutKeywords.xlsx", header=False, index=False)

dfs[1]

	0	1	2	3
0	G01R31/367	G06F17/(20-28, 30)	G06F19/24	G06K9/00
1	G06K9/(46-52, 60-82)	G06N7	G06N10	G06N99
2	G06Q	G06T7/00-20	G10L15	G10L21
3	G16B40/(00-10)	G16H50/20-70	H01M8/04992	H04N21/466

dfs[1].to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\IPC_Codes_withKeywords.xlsx", header=False, index=False)

That’s it! Good luck with making your life easier using Python’s tabula and pandas. :-)