Often we would like to identify topics in a large set of data. This could be the case if we analyze a large set of newspaper articles, for example. Going through all newspapers of the last year would be an insane task, but luckily Machine Learning can help! Hello AI! - Let’s have a look at our first AI application example in Python. :-) [Picture by Arie Wubben]

In this post, we will have a look at LDA Topic Modeling. Topic Modeling is nothing else than a Classification Problem and as it does not require any prelabeling or pretraining, belongs to Unsupervised Learning. It is also a category of Natural Language Processing.

In this application, we will follow this video as well as this one. As an example, we will use the Skill Descriptions of ESCO Skills developed by CEDEFOP. You can download the data here. Just choose the skill bundle.

Data Preparation and Set-up

Let’s load our packages.

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy

import pyLDAvis
import pyLDAvis.gensim_models
import en_core_web_sm
nlp = en_core_web_sm.load()

Let’s load our dataset.

data = pd.read_csv(r"C:\Users\Rude\Documents\Horizon2020\skills_en.csv")
data.head()
conceptType conceptUri skillType reuseLevel preferredLabel altLabels hiddenLabels status modifiedDate scopeNote definition inScheme description
0 KnowledgeSkillCompetence http://data.europa.eu/esco/skill/0005c151-5b5a... skill/competence sector-specific manage musical staff manage staff of music\ncoordinate duties of mu... NaN released 2016-12-20T17:43:43Z NaN NaN http://data.europa.eu/esco/concept-scheme/skil... Assign and manage staff tasks in areas such as...
1 KnowledgeSkillCompetence http://data.europa.eu/esco/skill/00064735-8fad... skill/competence occupation-specific supervise correctional procedures oversee prison procedures\nmanage correctional... NaN released 2016-12-20T20:17:49Z NaN NaN http://data.europa.eu/esco/concept-scheme/memb... Supervise the operations of a correctional fac...
2 KnowledgeSkillCompetence http://data.europa.eu/esco/skill/000709ed-2be5... skill/competence sector-specific apply anti-oppressive practices apply non-oppressive practices\napply an anti-... NaN released 2016-12-20T19:18:19Z NaN NaN http://data.europa.eu/esco/concept-scheme/skil... Identify oppression in societies, economies, c...
3 KnowledgeSkillCompetence http://data.europa.eu/esco/skill/0007bdc2-dd15... skill/competence sector-specific control compliance of railway vehicles regulat... monitoring of compliance with railway vehicles... NaN released 2016-12-20T20:02:19Z NaN NaN http://data.europa.eu/esco/concept-scheme/skil... Inspect rolling stock, components and systems ...
4 KnowledgeSkillCompetence http://data.europa.eu/esco/skill/00090cc1-1f27... skill/competence cross-sector identify available services establish available services\ndetermine rehabi... NaN released 2016-12-20T20:15:17Z NaN NaN http://data.europa.eu/esco/concept-scheme/memb... Identify the different services available for ...

Data preparation

We want to remove stopwords as they do not contain so much valuable information and only increase the dimensionality of our dataset.

stopwords = stopwords.words('english')
stopwords[:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Prepare pandas column

# remove anything but characters and spaces
text = data['description'].str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
text.head()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version.
  





0    Assign and manage staff tasks in areas such as...
1    Supervise the operations of a correctional fac...
2    Identify oppression in societies economies cul...
3    Inspect rolling stock components and systems t...
4    Identify the different services available for ...
Name: description, dtype: object
#text = text.apply(word_tokenize)
#text.head()
print(text[0][0:90])
Assign and manage staff tasks in areas such as scoring arranging copying music and vocal c

Lemmatization

Let’s further reduce the dimensionality of our dataset through accounting for the morphological analysis of the words. What this means is that the words “see” and its past tense “saw” will get the same value. This is why lemmatization is in many cases much better than stemmatization. For a detailed explanation see here.

def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)
lemmatized_texts = lemmatization(text)
print (lemmatized_texts[0][0:90])
assign manage staff task area such score arrange copy music vocal coaching

Remove stopwords

def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)
data_words = gen_words(lemmatized_texts)

print (data_words[0][0:20])
['assign', 'manage', 'staff', 'task', 'area', 'such', 'score', 'arrange', 'copy', 'music', 'vocal', 'coaching']

Bigrams and Trigrams

Let’s have a look at word associations. Here we can take advantage of bigrams (which 2 words do often appear together) and trigrams (which 3 words to often appear together). For mor details see here.

bigram_phrases = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=100)

bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

def make_bigrams(texts):
    return([bigram[doc] for doc in texts])

def make_trigrams(texts):
    return ([trigram[bigram[doc]] for doc in texts])

data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

print (data_bigrams_trigrams[0][0:20])
['assign', 'manage', 'staff', 'task', 'area', 'such', 'score', 'arrange', 'copy', 'music', 'vocal', 'coaching']
print (data_bigrams_trigrams)

TF-IDF

#TF-IDF REMOVAL
from gensim.models import TfidfModel

id2word = corpora.Dictionary(data_bigrams_trigrams)

texts = data_bigrams_trigrams

corpus = [id2word.doc2bow(text) for text in texts]
# print (corpus[0][0:20])

tfidf = TfidfModel(corpus, id2word=id2word)

low_value = 0.03
words  = []
words_missing_in_tfidf = []
for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    drops = low_value_words+words_missing_in_tfidf
    for item in drops:
        words.append(id2word[item])
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
    corpus[i] = new_bow

Dictionary and Bag of Words

NOTE: We do not use this anymore as we use TF-IDF.

'''
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

print (corpus[0][0:20])

word = id2word[[0][:1][0]]
print (word)
'''
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]

Create Topic Model

Let’s run our topic model. We have to choose the number of topics we want to display in our data beforehand (in this case 20).

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto")

Visualize Data

Let’s visualize our data.

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:247: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
  by='saliency', ascending=False).head(R).drop('saliency', 1)

Save

Let’s save our final dataset containing the topic.

top_words_per_topic = []
for t in range(lda_model.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 15)])

df = pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P'])
df.head()
Topic Word P
0 0 provide 0.151524
1 0 create 0.092743
2 0 test 0.077340
3 0 prepare 0.075709
4 0 analyse 0.068877
df.loc[df.Word.str.contains('data', case=False)]
Topic Word P
104 6 database 0.017024
179 11 data 0.014281
df.query('Topic==6')
Topic Word P
90 6 accord 0.155170
91 6 datum 0.087237
92 6 measure 0.081973
93 6 company 0.068412
94 6 management 0.061046
95 6 specification 0.057844
96 6 policy 0.053550
97 6 level 0.053254
98 6 implement 0.049561
99 6 element 0.035853
100 6 waste 0.033441
101 6 site 0.031845
102 6 medium 0.020775
103 6 avoid 0.020167
104 6 database 0.017024