Often we have to process data in annoying formats. One exampel are PDF Tables. But luckily Python can help! Here we will read in a table from a pdf file using Python. For more information see this link. We are reading in the tables from the annex of this document. [Picture by Markus Winkler]

!pip install -q tabula-py
import tabula
import pandas as pd
C:\Users\Rude\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\compat\_optional.py:138: UserWarning: Pandas requires version '2.7.0' or newer of 'numexpr' (version '2.6.8' currently installed).
  warnings.warn(msg, UserWarning)
pdf_path = r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\OECD_AIPatents.pdf"

Let’s read in our keyword table on page 66 and 67

dfs = tabula.read_pdf(pdf_path, pages=[66,67], columns=None, pandas_options={'header': None})

The command gives us two pandas dataframes, one for each page.

print(len(dfs))
dfs[0]
2
0 1 2
0 action recognition activity recognition adaboost
1 human action recognition human activity recognition NaN
2 adaptive boosting adversarial network ambient intelligence
3 NaN generative adversarial network NaN
4 ant colony artificial intelligence association rule
5 ant colony optimisation human aware artificial intelligence NaN
6 autoencoder autonomic computing autonomous vehicle
7 autonomous weapon backpropagation Bayesian learning
8 bayesian network bee colony blind signal separation
9 NaN artificial bee colony algorithm NaN
10 bootstrap aggregation brain computer interface brownboost
11 chatbot classification tree cluster analysis
12 cognitive automation cognitive computing cognitive insight system
13 cognitive modelling collaborative filtering collision avoidance
14 community detection computational intelligence computational pathology
15 computer vision cyber physical system data mining
16 decision tree deep belief network deep learning
17 dictionary learning dimensionality reduction dynamic time warping
18 emotion recognition ensemble learning evolutionary algorithm
19 NaN NaN differential evolution algorithm
20 NaN NaN multi-objective evolutionary algorithm
21 evolutionary computation face recognition facial expression recognition
22 factorisation machine feature engineering feature extraction
23 feature learning feature selection firefly algorithm
24 fuzzy c gaussian mixture model gaussian process
25 fuzzy environment NaN NaN
26 fuzzy logic NaN NaN
27 fuzzy number NaN NaN
28 fuzzy set NaN NaN
29 intuitionistic fuzzy set NaN NaN
30 fuzzy system NaN NaN
31 t s fuzzy system NaN NaN
32 Takagi-Sugeno fuzzy systems NaN NaN
33 genetic algorithm genetic programming gesture recognition
34 gradient boosting graphical model gravitational search algorithm
35 gradient tree boosting NaN NaN
36 hebbian learning hierarchical clustering high-dimensional data
37 NaN NaN high-dimensional feature
38 NaN NaN high-dimensional input
39 NaN NaN high-dimensional model
40 NaN NaN high-dimensional space
41 NaN NaN high-dimensional system
42 image classification image processing image recognition
43 image retrieval image segmentation independent component analysis
44 inductive monitoring instance-based learning intelligence augmentation
45 intelligent agent intelligent classifier intelligent geometric computing
46 intelligent software agent NaN NaN
47 intelligent infrastructure Kernel learning K-means
48 latent dirichlet allocation latent semantic analysis latent variable
dfs[1]
0 1 2
0 layered control system learning automata link prediction
1 logitboost long short term memory (LSTM) lpboost
2 machine intelligence machine learning\rextreme machine learning machine translation
3 machine vision madaboost MapReduce
4 Markovian\rhidden Markov model memetic algorithm meta learning
5 motion planning multi task learning multi-agent system
6 multi-label classification multi-layer perceptron multinomial naïve Bayes
7 multi-objective optimisation naïve Bayes classifier natural gradient
8 natural language generation natural language processing natural language understanding
9 nearest neighbour algorithm neural network\rartificial neural network\rcon... neural turing\rneural turing machine
10 neuromorphic computing non negative matrix factorisation object detection
11 object recognition obstacle avoidance pattern recognition
12 pedestrian detection policy gradient methods Q-learning
13 random field random forest rankboost
14 recommender system regression tree reinforcement learning
15 relational learning\rstatistical relational le... robot\rbiped robot\rhumanoid robot\rhuman-robo... rough set
16 rule learning\rrule-based learning self-organising map self-organising structure
17 semantic web semi-supervised learning sensor fusion\rsensor data fusion\rmulti-senso...
18 sentiment analysis similarity learning simultaneous localisation mapping
19 single-linkage clustering sparse representation spectral clustering
20 speech recognition speech to text stacked generalisation
21 stochastic gradient supervised learning support vector regression
22 swarm intelligence swarm optimisation\rparticle swarm optimisation temporal difference learning
23 text mining text to speech topic model
24 totalboost trajectory planning trajectory tracking
25 transfer learning trust region policy optimisation unmanned aerial vehicle
26 unsupervised learning variational inference vector machine\rsupport vector machine
27 virtual assistant visual servoing xgboost

Let’s append both dataframes

df = dfs[0].append(dfs[1])
df.head()
0 1 2
0 action recognition activity recognition adaboost
1 human action recognition human activity recognition NaN
2 adaptive boosting adversarial network ambient intelligence
3 NaN generative adversarial network NaN
4 ant colony artificial intelligence association rule
df.tail()
0 1 2
23 text mining text to speech topic model
24 totalboost trajectory planning trajectory tracking
25 transfer learning trust region policy optimisation unmanned aerial vehicle
26 unsupervised learning variational inference vector machine\rsupport vector machine
27 virtual assistant visual servoing xgboost
df.to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\AI_Keywords.xlsx", header=False, index=False)

Let’s read in our IPC Codes on page 67

dfs = tabula.read_pdf(pdf_path, pages=[68], columns=None, pandas_options={'header': None})

Again, we read in the 3 tables as 3 different data frames.

print(len(dfs))
dfs[0]
3
0 1 2 3
0 G06N3 G06N5 G06N20 G06F15/18
1 G06T1/40 G16C20/70 G16B40/20 G16B40/30
dfs[0].to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\IPC_Codes_withoutKeywords.xlsx", header=False, index=False)
dfs[1]
0 1 2 3
0 G01R31/367 G06F17/(20-28, 30) G06F19/24 G06K9/00
1 G06K9/(46-52, 60-82) G06N7 G06N10 G06N99
2 G06Q G06T7/00-20 G10L15 G10L21
3 G16B40/(00-10) G16H50/20-70 H01M8/04992 H04N21/466
dfs[1].to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\IPC_Codes_withKeywords.xlsx", header=False, index=False)

That’s it! Good luck with making your life easier using Python’s tabula and pandas. :-)