Often we have to process data in annoying formats. One exampel are PDF Tables. But luckily Python can help! Here we will read in a table from a pdf file using Python. For more information see this link. We are reading in the tables from the annex of this document. [Picture by Markus Winkler]
!pip install -q tabula-py
import tabula
import pandas as pd
C:\Users\Rude\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\compat\_optional.py:138: UserWarning: Pandas requires version '2.7.0' or newer of 'numexpr' (version '2.6.8' currently installed).
warnings.warn(msg, UserWarning)
pdf_path = r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\OECD_AIPatents.pdf"
Let’s read in our keyword table on page 66 and 67
dfs = tabula.read_pdf(pdf_path, pages=[66,67], columns=None, pandas_options={'header': None})
The command gives us two pandas dataframes, one for each page.
|
0 |
1 |
2 |
0 |
action recognition |
activity recognition |
adaboost |
1 |
human action recognition |
human activity recognition |
NaN |
2 |
adaptive boosting |
adversarial network |
ambient intelligence |
3 |
NaN |
generative adversarial network |
NaN |
4 |
ant colony |
artificial intelligence |
association rule |
5 |
ant colony optimisation |
human aware artificial intelligence |
NaN |
6 |
autoencoder |
autonomic computing |
autonomous vehicle |
7 |
autonomous weapon |
backpropagation |
Bayesian learning |
8 |
bayesian network |
bee colony |
blind signal separation |
9 |
NaN |
artificial bee colony algorithm |
NaN |
10 |
bootstrap aggregation |
brain computer interface |
brownboost |
11 |
chatbot |
classification tree |
cluster analysis |
12 |
cognitive automation |
cognitive computing |
cognitive insight system |
13 |
cognitive modelling |
collaborative filtering |
collision avoidance |
14 |
community detection |
computational intelligence |
computational pathology |
15 |
computer vision |
cyber physical system |
data mining |
16 |
decision tree |
deep belief network |
deep learning |
17 |
dictionary learning |
dimensionality reduction |
dynamic time warping |
18 |
emotion recognition |
ensemble learning |
evolutionary algorithm |
19 |
NaN |
NaN |
differential evolution algorithm |
20 |
NaN |
NaN |
multi-objective evolutionary algorithm |
21 |
evolutionary computation |
face recognition |
facial expression recognition |
22 |
factorisation machine |
feature engineering |
feature extraction |
23 |
feature learning |
feature selection |
firefly algorithm |
24 |
fuzzy c |
gaussian mixture model |
gaussian process |
25 |
fuzzy environment |
NaN |
NaN |
26 |
fuzzy logic |
NaN |
NaN |
27 |
fuzzy number |
NaN |
NaN |
28 |
fuzzy set |
NaN |
NaN |
29 |
intuitionistic fuzzy set |
NaN |
NaN |
30 |
fuzzy system |
NaN |
NaN |
31 |
t s fuzzy system |
NaN |
NaN |
32 |
Takagi-Sugeno fuzzy systems |
NaN |
NaN |
33 |
genetic algorithm |
genetic programming |
gesture recognition |
34 |
gradient boosting |
graphical model |
gravitational search algorithm |
35 |
gradient tree boosting |
NaN |
NaN |
36 |
hebbian learning |
hierarchical clustering |
high-dimensional data |
37 |
NaN |
NaN |
high-dimensional feature |
38 |
NaN |
NaN |
high-dimensional input |
39 |
NaN |
NaN |
high-dimensional model |
40 |
NaN |
NaN |
high-dimensional space |
41 |
NaN |
NaN |
high-dimensional system |
42 |
image classification |
image processing |
image recognition |
43 |
image retrieval |
image segmentation |
independent component analysis |
44 |
inductive monitoring |
instance-based learning |
intelligence augmentation |
45 |
intelligent agent |
intelligent classifier |
intelligent geometric computing |
46 |
intelligent software agent |
NaN |
NaN |
47 |
intelligent infrastructure |
Kernel learning |
K-means |
48 |
latent dirichlet allocation |
latent semantic analysis |
latent variable |
|
0 |
1 |
2 |
0 |
layered control system |
learning automata |
link prediction |
1 |
logitboost |
long short term memory (LSTM) |
lpboost |
2 |
machine intelligence |
machine learning\rextreme machine learning |
machine translation |
3 |
machine vision |
madaboost |
MapReduce |
4 |
Markovian\rhidden Markov model |
memetic algorithm |
meta learning |
5 |
motion planning |
multi task learning |
multi-agent system |
6 |
multi-label classification |
multi-layer perceptron |
multinomial naïve Bayes |
7 |
multi-objective optimisation |
naïve Bayes classifier |
natural gradient |
8 |
natural language generation |
natural language processing |
natural language understanding |
9 |
nearest neighbour algorithm |
neural network\rartificial neural network\rcon... |
neural turing\rneural turing machine |
10 |
neuromorphic computing |
non negative matrix factorisation |
object detection |
11 |
object recognition |
obstacle avoidance |
pattern recognition |
12 |
pedestrian detection |
policy gradient methods |
Q-learning |
13 |
random field |
random forest |
rankboost |
14 |
recommender system |
regression tree |
reinforcement learning |
15 |
relational learning\rstatistical relational le... |
robot\rbiped robot\rhumanoid robot\rhuman-robo... |
rough set |
16 |
rule learning\rrule-based learning |
self-organising map |
self-organising structure |
17 |
semantic web |
semi-supervised learning |
sensor fusion\rsensor data fusion\rmulti-senso... |
18 |
sentiment analysis |
similarity learning |
simultaneous localisation mapping |
19 |
single-linkage clustering |
sparse representation |
spectral clustering |
20 |
speech recognition |
speech to text |
stacked generalisation |
21 |
stochastic gradient |
supervised learning |
support vector regression |
22 |
swarm intelligence |
swarm optimisation\rparticle swarm optimisation |
temporal difference learning |
23 |
text mining |
text to speech |
topic model |
24 |
totalboost |
trajectory planning |
trajectory tracking |
25 |
transfer learning |
trust region policy optimisation |
unmanned aerial vehicle |
26 |
unsupervised learning |
variational inference |
vector machine\rsupport vector machine |
27 |
virtual assistant |
visual servoing |
xgboost |
Let’s append both dataframes
df = dfs[0].append(dfs[1])
|
0 |
1 |
2 |
0 |
action recognition |
activity recognition |
adaboost |
1 |
human action recognition |
human activity recognition |
NaN |
2 |
adaptive boosting |
adversarial network |
ambient intelligence |
3 |
NaN |
generative adversarial network |
NaN |
4 |
ant colony |
artificial intelligence |
association rule |
|
0 |
1 |
2 |
23 |
text mining |
text to speech |
topic model |
24 |
totalboost |
trajectory planning |
trajectory tracking |
25 |
transfer learning |
trust region policy optimisation |
unmanned aerial vehicle |
26 |
unsupervised learning |
variational inference |
vector machine\rsupport vector machine |
27 |
virtual assistant |
visual servoing |
xgboost |
df.to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\AI_Keywords.xlsx", header=False, index=False)
Let’s read in our IPC Codes on page 67
dfs = tabula.read_pdf(pdf_path, pages=[68], columns=None, pandas_options={'header': None})
Again, we read in the 3 tables as 3 different data frames.
|
0 |
1 |
2 |
3 |
0 |
G06N3 |
G06N5 |
G06N20 |
G06F15/18 |
1 |
G06T1/40 |
G16C20/70 |
G16B40/20 |
G16B40/30 |
dfs[0].to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\IPC_Codes_withoutKeywords.xlsx", header=False, index=False)
|
0 |
1 |
2 |
3 |
0 |
G01R31/367 |
G06F17/(20-28, 30) |
G06F19/24 |
G06K9/00 |
1 |
G06K9/(46-52, 60-82) |
G06N7 |
G06N10 |
G06N99 |
2 |
G06Q |
G06T7/00-20 |
G10L15 |
G10L21 |
3 |
G16B40/(00-10) |
G16H50/20-70 |
H01M8/04992 |
H04N21/466 |
dfs[1].to_excel(r"F:\Working with Github\Horizon2020_migration\raw_data_supplements\Patent_Data\IPC_Codes_withKeywords.xlsx", header=False, index=False)
That’s it! Good luck with making your life easier using Python’s tabula and pandas. :-)