LDA Topic Modeling - The example of CEDEFOP Skill Descriptions
Often we would like to identify topics in a large set of data. This could be the case if we analyze a large set of newspaper articles, for example. Going through all newspapers of the last year would be an insane task, but luckily Machine Learning can help! Hello AI! - Let’s have a look at our first AI application example in Python. :-) [Picture by Arie Wubben]
In this post, we will have a look at LDA Topic Modeling. Topic Modeling is nothing else than a Classification Problem and as it does not require any prelabeling or pretraining, belongs to Unsupervised Learning. It is also a category of Natural Language Processing.
In this application, we will follow this video as well as this one. As an example, we will use the Skill Descriptions of ESCO Skills developed by CEDEFOP. You can download the data here. Just choose the skill bundle.
Data Preparation and Set-up
Let’s load our packages.
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models
import en_core_web_sm
nlp = en_core_web_sm.load()
Let’s load our dataset.
data = pd.read_csv(r"C:\Users\Rude\Documents\Horizon2020\skills_en.csv")
data.head()
conceptType | conceptUri | skillType | reuseLevel | preferredLabel | altLabels | hiddenLabels | status | modifiedDate | scopeNote | definition | inScheme | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KnowledgeSkillCompetence | http://data.europa.eu/esco/skill/0005c151-5b5a... | skill/competence | sector-specific | manage musical staff | manage staff of music\ncoordinate duties of mu... | NaN | released | 2016-12-20T17:43:43Z | NaN | NaN | http://data.europa.eu/esco/concept-scheme/skil... | Assign and manage staff tasks in areas such as... |
1 | KnowledgeSkillCompetence | http://data.europa.eu/esco/skill/00064735-8fad... | skill/competence | occupation-specific | supervise correctional procedures | oversee prison procedures\nmanage correctional... | NaN | released | 2016-12-20T20:17:49Z | NaN | NaN | http://data.europa.eu/esco/concept-scheme/memb... | Supervise the operations of a correctional fac... |
2 | KnowledgeSkillCompetence | http://data.europa.eu/esco/skill/000709ed-2be5... | skill/competence | sector-specific | apply anti-oppressive practices | apply non-oppressive practices\napply an anti-... | NaN | released | 2016-12-20T19:18:19Z | NaN | NaN | http://data.europa.eu/esco/concept-scheme/skil... | Identify oppression in societies, economies, c... |
3 | KnowledgeSkillCompetence | http://data.europa.eu/esco/skill/0007bdc2-dd15... | skill/competence | sector-specific | control compliance of railway vehicles regulat... | monitoring of compliance with railway vehicles... | NaN | released | 2016-12-20T20:02:19Z | NaN | NaN | http://data.europa.eu/esco/concept-scheme/skil... | Inspect rolling stock, components and systems ... |
4 | KnowledgeSkillCompetence | http://data.europa.eu/esco/skill/00090cc1-1f27... | skill/competence | cross-sector | identify available services | establish available services\ndetermine rehabi... | NaN | released | 2016-12-20T20:15:17Z | NaN | NaN | http://data.europa.eu/esco/concept-scheme/memb... | Identify the different services available for ... |
Data preparation
We want to remove stopwords as they do not contain so much valuable information and only increase the dimensionality of our dataset.
stopwords = stopwords.words('english')
stopwords[:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
Prepare pandas column
# remove anything but characters and spaces
text = data['description'].str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
text.head()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version.
0 Assign and manage staff tasks in areas such as...
1 Supervise the operations of a correctional fac...
2 Identify oppression in societies economies cul...
3 Inspect rolling stock components and systems t...
4 Identify the different services available for ...
Name: description, dtype: object
#text = text.apply(word_tokenize)
#text.head()
print(text[0][0:90])
Assign and manage staff tasks in areas such as scoring arranging copying music and vocal c
Lemmatization
Let’s further reduce the dimensionality of our dataset through accounting for the morphological analysis of the words. What this means is that the words “see” and its past tense “saw” will get the same value. This is why lemmatization is in many cases much better than stemmatization. For a detailed explanation see here.
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
texts_out = []
for text in texts:
doc = nlp(text)
new_text = []
for token in doc:
if token.pos_ in allowed_postags:
new_text.append(token.lemma_)
final = " ".join(new_text)
texts_out.append(final)
return (texts_out)
lemmatized_texts = lemmatization(text)
print (lemmatized_texts[0][0:90])
assign manage staff task area such score arrange copy music vocal coaching
Remove stopwords
def gen_words(texts):
final = []
for text in texts:
new = gensim.utils.simple_preprocess(text, deacc=True)
final.append(new)
return (final)
data_words = gen_words(lemmatized_texts)
print (data_words[0][0:20])
['assign', 'manage', 'staff', 'task', 'area', 'such', 'score', 'arrange', 'copy', 'music', 'vocal', 'coaching']
Bigrams and Trigrams
Let’s have a look at word associations. Here we can take advantage of bigrams (which 2 words do often appear together) and trigrams (which 3 words to often appear together). For mor details see here.
bigram_phrases = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=100)
bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)
def make_bigrams(texts):
return([bigram[doc] for doc in texts])
def make_trigrams(texts):
return ([trigram[bigram[doc]] for doc in texts])
data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)
print (data_bigrams_trigrams[0][0:20])
['assign', 'manage', 'staff', 'task', 'area', 'such', 'score', 'arrange', 'copy', 'music', 'vocal', 'coaching']
print (data_bigrams_trigrams)
TF-IDF
#TF-IDF REMOVAL
from gensim.models import TfidfModel
id2word = corpora.Dictionary(data_bigrams_trigrams)
texts = data_bigrams_trigrams
corpus = [id2word.doc2bow(text) for text in texts]
# print (corpus[0][0:20])
tfidf = TfidfModel(corpus, id2word=id2word)
low_value = 0.03
words = []
words_missing_in_tfidf = []
for i in range(0, len(corpus)):
bow = corpus[i]
low_value_words = [] #reinitialize to be safe. You can skip this.
tfidf_ids = [id for id, value in tfidf[bow]]
bow_ids = [id for id, value in bow]
low_value_words = [id for id, value in tfidf[bow] if value < low_value]
drops = low_value_words+words_missing_in_tfidf
for item in drops:
words.append(id2word[item])
words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing
new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
corpus[i] = new_bow
Dictionary and Bag of Words
NOTE: We do not use this anymore as we use TF-IDF.
'''
id2word = corpora.Dictionary(data_words)
corpus = []
for text in data_words:
new = id2word.doc2bow(text)
corpus.append(new)
print (corpus[0][0:20])
word = id2word[[0][:1][0]]
print (word)
'''
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]
Create Topic Model
Let’s run our topic model. We have to choose the number of topics we want to display in our data beforehand (in this case 20).
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha="auto")
Visualize Data
Let’s visualize our data.
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:247: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
by='saliency', ascending=False).head(R).drop('saliency', 1)
Save
Let’s save our final dataset containing the topic.
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 15)])
df = pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P'])
df.head()
Topic | Word | P | |
---|---|---|---|
0 | 0 | provide | 0.151524 |
1 | 0 | create | 0.092743 |
2 | 0 | test | 0.077340 |
3 | 0 | prepare | 0.075709 |
4 | 0 | analyse | 0.068877 |
df.loc[df.Word.str.contains('data', case=False)]
Topic | Word | P | |
---|---|---|---|
104 | 6 | database | 0.017024 |
179 | 11 | data | 0.014281 |
df.query('Topic==6')
Topic | Word | P | |
---|---|---|---|
90 | 6 | accord | 0.155170 |
91 | 6 | datum | 0.087237 |
92 | 6 | measure | 0.081973 |
93 | 6 | company | 0.068412 |
94 | 6 | management | 0.061046 |
95 | 6 | specification | 0.057844 |
96 | 6 | policy | 0.053550 |
97 | 6 | level | 0.053254 |
98 | 6 | implement | 0.049561 |
99 | 6 | element | 0.035853 |
100 | 6 | waste | 0.033441 |
101 | 6 | site | 0.031845 |
102 | 6 | medium | 0.020775 |
103 | 6 | avoid | 0.020167 |
104 | 6 | database | 0.017024 |