LDA Topic Modeling - The example of CEDEFOP Skill Descriptions

Often we would like to identify topics in a large set of data. This could be the case if we analyze a large set of newspaper articles, for example. Going through all newspapers of the last year would be an insane task, but luckily Machine Learning can help! Hello AI! - Let’s have a look at our first AI application example in Python. :-) [Picture by Arie Wubben]

In this post, we will have a look at LDA Topic Modeling. Topic Modeling is nothing else than a Classification Problem and as it does not require any prelabeling or pretraining, belongs to Unsupervised Learning. It is also a category of Natural Language Processing.

In this application, we will follow this video as well as this one. As an example, we will use the Skill Descriptions of ESCO Skills developed by CEDEFOP. You can download the data here. Just choose the skill bundle.

Data Preparation and Set-up

Let’s load our packages.

import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy

import pyLDAvis
import pyLDAvis.gensim_models

import en_core_web_sm
nlp = en_core_web_sm.load()

Let’s load our dataset.

data = pd.read_csv(r"C:\Users\Rude\Documents\Horizon2020\skills_en.csv")
data.head()

	conceptType	conceptUri	skillType	reuseLevel	preferredLabel	altLabels	hiddenLabels	status	modifiedDate	scopeNote	definition	inScheme	description
0	KnowledgeSkillCompetence	http://data.europa.eu/esco/skill/0005c151-5b5a...	skill/competence	sector-specific	manage musical staff	manage staff of music\ncoordinate duties of mu...	NaN	released	2016-12-20T17:43:43Z	NaN	NaN	http://data.europa.eu/esco/concept-scheme/skil...	Assign and manage staff tasks in areas such as...
1	KnowledgeSkillCompetence	http://data.europa.eu/esco/skill/00064735-8fad...	skill/competence	occupation-specific	supervise correctional procedures	oversee prison procedures\nmanage correctional...	NaN	released	2016-12-20T20:17:49Z	NaN	NaN	http://data.europa.eu/esco/concept-scheme/memb...	Supervise the operations of a correctional fac...
2	KnowledgeSkillCompetence	http://data.europa.eu/esco/skill/000709ed-2be5...	skill/competence	sector-specific	apply anti-oppressive practices	apply non-oppressive practices\napply an anti-...	NaN	released	2016-12-20T19:18:19Z	NaN	NaN	http://data.europa.eu/esco/concept-scheme/skil...	Identify oppression in societies, economies, c...
3	KnowledgeSkillCompetence	http://data.europa.eu/esco/skill/0007bdc2-dd15...	skill/competence	sector-specific	control compliance of railway vehicles regulat...	monitoring of compliance with railway vehicles...	NaN	released	2016-12-20T20:02:19Z	NaN	NaN	http://data.europa.eu/esco/concept-scheme/skil...	Inspect rolling stock, components and systems ...
4	KnowledgeSkillCompetence	http://data.europa.eu/esco/skill/00090cc1-1f27...	skill/competence	cross-sector	identify available services	establish available services\ndetermine rehabi...	NaN	released	2016-12-20T20:15:17Z	NaN	NaN	http://data.europa.eu/esco/concept-scheme/memb...	Identify the different services available for ...

Data preparation

We want to remove stopwords as they do not contain so much valuable information and only increase the dimensionality of our dataset.

stopwords = stopwords.words('english')
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Prepare pandas column

# remove anything but characters and spaces
text = data['description'].str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
text.head()

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version.
  
  Assign and manage staff tasks in areas such as...
  Supervise the operations of a correctional fac...
  Identify oppression in societies economies cul...
  Inspect rolling stock components and systems t...
  Identify the different services available for ...
Name: description, dtype: object

#text = text.apply(word_tokenize)
#text.head()

print(text[0][0:90])

Assign and manage staff tasks in areas such as scoring arranging copying music and vocal c

Lemmatization

Let’s further reduce the dimensionality of our dataset through accounting for the morphological analysis of the words. What this means is that the words “see” and its past tense “saw” will get the same value. This is why lemmatization is in many cases much better than stemmatization. For a detailed explanation see here.

def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)

lemmatized_texts = lemmatization(text)
print (lemmatized_texts[0][0:90])

assign manage staff task area such score arrange copy music vocal coaching

Remove stopwords

def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[0][0:20])

['assign', 'manage', 'staff', 'task', 'area', 'such', 'score', 'arrange', 'copy', 'music', 'vocal', 'coaching']

Bigrams and Trigrams

Let’s have a look at word associations. Here we can take advantage of bigrams (which 2 words do often appear together) and trigrams (which 3 words to often appear together). For mor details see here.

bigram_phrases = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=100)

bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

def make_bigrams(texts):
    return([bigram[doc] for doc in texts])

def make_trigrams(texts):
    return ([trigram[bigram[doc]] for doc in texts])

data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

print (data_bigrams_trigrams[0][0:20])

['assign', 'manage', 'staff', 'task', 'area', 'such', 'score', 'arrange', 'copy', 'music', 'vocal', 'coaching']

print (data_bigrams_trigrams)

TF-IDF

#TF-IDF REMOVAL
from gensim.models import TfidfModel

id2word = corpora.Dictionary(data_bigrams_trigrams)

texts = data_bigrams_trigrams

corpus = [id2word.doc2bow(text) for text in texts]
# print (corpus[0][0:20])

tfidf = TfidfModel(corpus, id2word=id2word)

low_value = 0.03
words  = []
words_missing_in_tfidf = []
for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    drops = low_value_words+words_missing_in_tfidf
    for item in drops:
        words.append(id2word[item])
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
    corpus[i] = new_bow

Dictionary and Bag of Words

NOTE: We do not use this anymore as we use TF-IDF.

'''
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

print (corpus[0][0:20])

word = id2word[[0][:1][0]]
print (word)
'''

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]

Create Topic Model

Let’s run our topic model. We have to choose the number of topics we want to display in our data beforehand (in this case 20).

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto")

Visualize Data

Let’s visualize our data.

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:247: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
  by='saliency', ascending=False).head(R).drop('saliency', 1)

Save

Let’s save our final dataset containing the topic.

top_words_per_topic = []
for t in range(lda_model.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 15)])

df = pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P'])

df.head()

	Word	P
0	provide	0.151524
1	create	0.092743
2	test	0.077340
3	prepare	0.075709
4	analyse	0.068877

df.loc[df.Word.str.contains('data', case=False)]

	Topic	Word	P
104	6	database	0.017024
179	11	data	0.014281

df.query('Topic==6')

	Topic	Word	P
90	6	accord	0.155170
91	6	datum	0.087237
92	6	measure	0.081973
93	6	company	0.068412
94	6	management	0.061046
95	6	specification	0.057844
96	6	policy	0.053550
97	6	level	0.053254
98	6	implement	0.049561
99	6	element	0.035853
100	6	waste	0.033441
101	6	site	0.031845
102	6	medium	0.020775
103	6	avoid	0.020167
104	6	database	0.017024