Author: Ali Arsalan Yaqoob

For this post, I wanted to write about my favorite topic modeling algorithm called the Latent Dirichlet Allocation (LDA). Before we jump into LDA, I want to go over what topic modeling is.

Topic Modeling

Topic modeling is an unsupervised machine learning technique which is used for extracting hidden semantic relationships within large corpora of documents. It is unsupervised because there are no labels provided to create these models. A "topic" consists of a group of words that frequently occur together in documents. Each document consists of a distribution of topics (more on this later).

This is a powerful technique as it can allow us to discover and search unlabeled text data (basically, everything on the internet), cluster similar documents together and also understand thematic structure and underlying abstract topics that don't really have a word for them but can be understood what they mean when they occur together.

Amazon Reviews (Example)

Topic modeling works by grouping similar word patterns to identify topics within a document. An example of topic models being used is on customer reviews on Amazon.

Screenshot of customer review section of the book - Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (Must read!)

Recurring word patterns or phrases are used to identify the topics that were predominantly mentioned in the reviews and these allow the reviewer to filter through the set of reviews by selecting a phrase that sounds closer to what they would be looking for.

LDA

Overview

LDA is one of the most popular models in topic modeling. It stands for Latent Dirichlet Allocation. Say, for example, you have a collection of news articles and you want to sort these news articles by what is the topics that exists within these news articles. Let's say that the news articles could have three different topics - Science, Politics or Sports. LDA organizes the documents in a geometric topic space (in this case, a triangle) in a manner that they are closer to their assigned topic in that space. Visually, this would look like a triangle with each edge being one topic.

In the above example, we can see that the documents that predominantly talk about politics are closer to the "Politics" edge of the triangle. Notice also the document that talks about both science and politics, exists in the space between those two edges. If there was an article that had all the three topics, it would be in the middle of the triangle.

Generative Vs Discriminative Models

LDA is a generative statistical model. To understand that statement we must understand what it means to be a generative statistical model in contrast to a discriminative model. Informally, generative models can generate new data instances while discriminative models discriminate between different kinds of data instances. A generative model could generate photos of people that look like real people, while discriminative models could tell the difference between two people. A more statistically formal definition would define them as follows. Given a set of data instances X and a set of labels Y:

Generative models capture the joint probability of P( X, Y ) or just P( X ) if there are no labels.
Discriminative models capture the conditional probability of P( Y | X )

Generative models include the distribution of data itself and tells you how likely a given sample is. A discriminative model ignores the question of whether a given instance is likely, and just tells you how likely a specific label is to apply to the instance.

Keeping this in mind, LDA assumes that a given document can be recreated given the distribution of topics in that document and the distribution of words in the topics in that document. It assumes each document can be recreated in the following fashion.

Decide on the number of words N a document will have.
Choose a distribution of topics for the document given a fixed set of K topics. For example, given that you can only have two topics: Science and Politics. You might have a news article talking about how the some political organization was funding some scientific research. You may say that this article is 70% "Politics" and 30% "Science".
You then generate every word in the document by:
- First picking a topic given the specific probabilities associated with them. Taking the previous example, you may pick "Politics" with 70% probability and "Science" with 10% probability.
- Using the topic to generate the word. For example, if we choose the "Science" topic, we might possibly select "biological" with 30% probability and "soccer" with 10% probability.

Therefore, LDA essentially looks at the generated documents and then goes back to find the best distribution of topics to create a document that is closest to the original document.

The In and Outs

The input into this algorithm is going to be a collection of text documents. The output is going to be three things:

Cluster of words where each cluster is one topic. It is important to note that one word can belong to several clusters.
One topic has words and a distribution of the frequency of those words (For example, the topic of politics might include higher frequency of words such a president or democrat).
Every document that is given to the algorithm has a distribution of topics associated with it.

Plate Notation

$M - \text{number of documents} \\ N - \text{number of words per document }\\ \alpha - \text{parameter of the per document topic distribution (Dirchlet prior) (between 0 and non-infinity)} \\ \beta - \text{parameter of the per topic word distribution (Dirchlet prior)(between 0 and non-infinity)}\\ \theta_{i} - \text{topic distribution of the document } i\\ \varphi_{k} - \text{word distribution of topic } k\\ z_{ij} - \text{topic for the jth word in document i (multinomial)}\\ \omega_{ij} - \text{specific word (multinomial)}$

Above is the plate notation for the LDA. This is used to represent probabilistic graphical models and allows us to visualize the dependencies among the many variables of the model.

This is a complicated illustration and if you are like me, it will not make a lot of sense to you at first. I will try to abstract this as much as possible using the previous visualization so it is easier to grasp.

After adding the new labels, the concepts start becoming a lot clearer. However, we still have a lot of work to do to clearly understand LDA. We have M number of documents and each of these documents has N words. These are going to go through the LDA algorithm. After passing these through the algorithm, we get K number of topics. Here each topic has ⁍ distribution of words and each document has ⁍ distribution of topics. For example, one document can have 10% of topic A, 20% of topic B and then 70% of topic C.

We can use the concentration parameters ( ⁍ ) to tune the model using our domain knowledge or understanding of topics per document and words per topic. Here, ⁍ signifies the degree to which the documents are close to the topics. ⁍ denotes the degree to which a particular topic lies in a space of words. For example, the topic "politics", would be closer to the words "parliament" and "president" as compared to "touchdown" or "black hole". Using these two parameters, we assign each word within a particular document to K topics of interest, and generate a probability that the word belongs to that particular topic.

LDA Algorithm

Given a set of documents, we can use LDA to determine the number of words in each of those documents that are related to particular topics of our interest. We do this through a series of steps (Gibbs sampling method):

Go through each document. Randomly assign each word within that document to one of the K topics of interest that we have chosen.
This initial random assignment gives us topic distributions of all the documents, and word distributions of all the topics. However, since this is random, these initial distributions are not very good. We need to improve these distributions.
Therefore, we then go through each word j in each document d and compute two things for each topic K:
$P(\text{topic t | document d}) = \text{the proportion of words in document d that are currently assigned to topic t}$
Based on concentration parameters alpha and beta, we calculate a conditional probability that tells us how closely related each word is to each topic of interest. For example, the word 'plant' would have a greater probability of being related to the topic 'biology' than the topic 'politics'.
$P(\text{word w | topic t}) = \text{the proportion of asssignments to topic t over all documents that come from this word w}$
For example, we go through all the documents, and based on the frequency of the word 'plant' in those document, we decide that the document is assigned to the topic 'biology'.
We then reassign each word j a new topic. This is called resampling. Here the topic is chosen with the probability p(topic t | document d) * p(word w | topic t). Based on our generative model, this is basically the probability that the topic t generated the word j. Therefore, in this step, we make the assumption that all our topic assignments except for the current word in question are correct, and we then update the assignment of the current word using our model of how the documents are generated.
We iterate this previous step over a chosen number of times, to reach a steady state where our assignments are much better. Using these assignments, we estimate the topic distribution of each document and the word distribution of each topic.

At the final iteration pass, a final solution with coherent topics is obtained. This iterative cycling is a key feature of the LDA algorithm, whereby topic assignment checking is repeated for each word in each document, over the entire collection of documents multiple times.

This is the first part in a series of blog posts that will examine Latent Dirichlet Allocation. The next article is going to examine the applications and the validation of LDA.