LDA: A front runner in topic modelling

Bratati Mohapatra
Topic model is a statistical model that captures the latent semantic structure of a document. It is used as a text mining tool in the field of machine learning as well as a source of contextual information. The results from the topic model can be further used as an input parameter for document categorization. The generation of topic models comes under unsupervised learning where a user does not give training or example documents to create a model. One of the earliest topic models was probabilistic latent semantic analysis (1999). Latent Dirichlet Allocation (LDA) is the most frequently used topic model. LDA is used in classifying the text. There are different variations and extensions to LDA to improve the generated topics of the model. We will explore the base model for LDA to understand the generation of topics based on the document corpus.
“LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.”
The above description of LDA is from the paper titled ‘Latent Dirichlet Allocation’ (Blei, Ng and Jordan, 2003)[1] stating that the model predicts the topics based on the occurrence of the words in the document corpus. In other words, given the words in the document as observations, the model formulates the topics based on the words. A topic is a latent feature of the text and represents a central theme. It contains a set of words and conveys a single theme. An example of a topic is as below:
T1 - <communication, wireless, network, signal, mobile, transmission, radio, data, station, access, networks, information, terminal, channel, device, control, service, frequency, carrier, base>
T2- <dog, cat, camel, loyal, feline>
In the above case T1 represents a topic related to wireless communication channels whereas topic T2 is related to animals. A document can contain more than one topic.
LDA Model building
Let us delve into the algorithm behind the Latent dirichlet allocation model with minimal mathematical explanation. The steps for the basic LDA algorithm are as follows:
- Remove all frequently occurring words like ‘are’, ‘and’, ‘to’ etc.
- Build the list of unique words present in the corpus
- Decide on the number of topics for the corpus
- Randomly assign topics to the words in each document
- Calculate the following probabilities for each word in each document:
- p(t|d): probability of topic t given document d. It captures the number of words belonging to topic t in the document d
- p(w|t) probability of word w given topic t. It captures the number of documents in topic t due to word w.
6. Update the probability of the word belonging to topic t in the following way:
p(w for topic t) = p(t|d) * p(w|t)
7. Repeat steps 4-5 over all the documents for each topic and we will get accurate probabilities for each word with respect to each topic.
8. After multiple iteration the calculated probabilities of the words would not change and the top n words with higher probabilities for a given topic will represent the theme of the topic
Let us implement the above steps on the below document corpus:
D1: Dogs are loyal animals
D2: Dogs like to chew bones
D3: Oranges, mangoes and jackfruits are seasonal fruits
D4: Mary likes to eat oranges
D5: Some dogs like to eat mangoes
As can be seen above, there are predominantly two topics (T1: Dog_related, T2: Fruit_related) in the corpus. The highlighted words are the frequently occurring words that would not be helpful in topic modelling. The final list of unique words are as follows:
[Dogs, loyal, animals, like, chew, bones, oranges, mangoes, jackfruits, seasonal, fruits, Mary, eat, likes]
Random assignment of topics to the documents:
D1: Dogs are loyal animals
T1 T2 T1
D2: Dogs like to chew bones
T1 T2 T1 T2
D3: Oranges, mangoes and jackfruits are seasonal fruits
T1 T2 T1 T2 T1
D4: Mary likes to eat oranges
T1 T2 T1 T2
D5: Some dogs like to eat mangoes
T1 T2 T1 T2
Calculation of probabilities:
Probability calculation 1 (Document-topic):
p(T1|D1) = 2/3 p(T2|D1) = 1/3
p(T1|D2) = 2/4 p(T2|D2) = 2/4
p(T1|D3) = 3/5 p(T2|D3) = 2/5
p(T1|D4) = 2/4 p(T2|D4) = 2/4
p(T1|D5) = 2/4 p(T2|D5) = 2/4
Probability calculation 2 (Word-topic):
p(dogs|T1) = 3/5 p(dogs|T2) = 0/5
p(loyal|T1) = 0/5 p(loyal|T2) = 1/5
p(animals|T1) = 1/5 p(animals|T2) = 0/5
p(like|T1) = 0/5 p(like|T2) = 2/5
p(chew|T1) = 1/5 p(chew|T2) = 0/5
p(bones|T1) = 0/5 p(bones|T2) = 1/5
p(oranges|T1) = 1/5 p(oranges|T2) = 1/5
p(mangoes|T1) = 0/5 p(mangoes|T2) = 2/5
p(jackfruits|T1) = 1/5 p(jackfruits|T2) = 0/5
p(seasonal|T1) = 0/5 p(seasonal|T2) = 1/5
p(fruits|T1) = 1/5 p(fruits|T2) = 0/5
p(Mary|T1) = 1/5 p(Mary|T2) = 0/5
p(eat|T1) = 2/5 p(eat|T2) = 0/5
p(likes|T1) = 0/5 p(likes|T2) = 1/5
Probability calculation 3:
p(dogs for T1) = p(T1|D1) * p(dogs|T1) = ⅔ * ⅗ = 2/5
p(loyal for T1) = p(T1|D1) * p(loyal|T1) = ⅔ * 0 = 0
p(animals for T1) = p(T1|D1) * p(animals|T1) = ⅔ * ⅕ = 2/15
p(like for T1) = p(T1|D1) * p(like|T1) = ⅔ * 0 = 0
p(chew for T1) = p(T1|D1) * p(chew|T1) = ⅔ * ⅕ =2/15
p(bones for T1) = p(T1|D1) * p(bones|T1) = ⅔ * 0 = 0
p(oranges for T1) = p(T1|D1) * p(oranges|T1) = ⅔ * ⅕ = 2/15
p(mangoes for T1) = p(T1|D1) * p(mangoes|T1) = ⅔ * 0 = 0
p(jackfruits for T1) = p(T1|D1) * p(jackfruits|T1) = ⅔ * ⅕ = 2/15
p(seasonal for T1) = p(T1|D1) * p(seasonal|T1) =⅔ * 0 = 0
p(fruits for T1) = p(T1|D1) * p(fruits|T1) = ⅔ * ⅕ = 2/15
p(Mary for T1) = p(T1|D1) * p(Mary|T1) = ⅔ * ⅕ = 2/15
p(eat for T1) = p(T1|D1) * p(eat|T1) = ⅔ * ⅖ = 4/15
p(likes for T1) = p(T1|D1) * p(likes|T1) = ⅔ * 0 = 0
Let us just consider ‘dogs’ and ‘fruits’ as words for further calculations:
p(dogs for T1) = p(T1|D2) * p(dogs|T1) = 2/4 * 3/5 = 3/10
p(dogs for T1) = p(T1|D3) * p(dogs|T1) = 3/5 * 3/5 = 9/25
p(dogs for T1) = p(T1|D4) * p(dogs|T1) = 2/4 * 3/5 = 3/10
p(dogs for T1) = p(T1|D5) * p(dogs|T1) = 2/4 * 3/5 = 3/10
p(dogs for T2) = p(T2|D1) * p(dogs|T2) = 1/3 * 0 = 0
p(dogs for T2) = p(T2|D2) * p(dogs|T2) = 2/4 * 0 = 0
p(dogs for T2) = p(T2|D3) * p(dogs|T2) = 2/5 * 0 = 0
p(dogs for T2) = p(T2|D4) * p(dogs|T2) = 2/4 * 0 = 0
p(dogs for T2) = p(T2|D5) * p(dogs|T2) = 2/4 * 0 = 0
p(fruits for T1) = p(T1|D2) * p(fruits|T1) = 2/4 * 1/5 = 1/10
p(fruits for T1) = p(T1|D3) * p(fruits|T1) = 3/5 * 1/5 = 3/25
p(fruits for T1) = p(T1|D4) * p(fruits|T1) = 2/4 * 1/5 = 1/10
p(fruits for T1) = p(T1|D5) * p(fruits|T1) = 2/4 * 1/5 = 1/10
p(fruits for T2) = p(T2|D1) * p(fruits|T2) = 1/3 * 0 = 0
p(fruits for T2) = p(T2|D2) * p(fruits|T2) = 2/4 * 0 = 0
p(fruits for T2) = p(T2|D3) * p(fruits|T2) = 2/5 * 0 = 0
p(fruits for T2) = p(T2|D4) * p(fruits|T2) = 2/4 * 0 = 0
p(fruits for T2) = p(T2|D5) * p(fruits|T2) = 2/4 * 0 = 0
At the end of first iteration, we have:
T1: <Dogs, animals, chew, oranges, jackfruits, fruits, Mary, eat >
T2: <Loyal, like, bones, mangoes, seasonal, likes, oranges>
As seen from above, ‘dogs’ as well as ‘fruits’ belong to the same topic and ‘oranges’ belongs to both the topics.
D1: 66% T1, 34% T2
D2: 50% T1, 50% T2
D3: 60% T1, 40% T2
D4: 50% T1, 50% T2
D5: 50% T1, 50% T2
After completion of the first iteration the topics are again randomly assigned, and the steps are repeated till we get a good set of words for each topic T1 and T2 where the calculated probabilities do not change in subsequent iterations. Ideally the following will be the word set for each topic after the system converges:
T1: <Dogs, animals, chew, loyal, bones>
T2:<Oranges, mangoes, jackfruits, fruits, seasonal, eat, mary, likes, like>
D1: 100% T1
D2: 100% T1
D3: 100% T2
D4: 100% T2
D5: 25% T1, 75% T2
Topic models, especially LDA, are powerful tools in the document classification field as we are not required to create training sets. Nowadays topic models like LDA are used extensively in spam filtering and information retrieval. They are good at distinguishing documents when the classes are differentiable. On the other hand, LDA performance is affected when the corpus contains documents from similar classes. LDA classification does not categorize documents hierarchically. The model also requires prior knowledge of the number of classes/topics for a given corpus from the user. Further, there is no topic name that can highlight the theme. Hence research and work are going on to make LDA more robust and intuitive to the user as well as increase the accuracy of the current document classification system through feature engineering. Relecura has a topic model variant implemented on TechExplorer[2] that helps users classify their document seamlessly. The system organizes your personal documents as well as patent documents with a label name assigned for each category or class. How the label names are generated for a topic is a story for another blog post.
References
[1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
[2] https://explorer.relecura.com/