Netflix App review Topic Modeling

Image for post

Hi, all! In the rise of interest in NLP and the advent of evermore accurate algorithms, it is exciting to start entering the world of NLP. Here, I chose Amazon Mobile App reviews which is publicly available in S3 bucket in AWS US East Region. First, let me briefly introduce the background.

Amazon Appstore for Android opened on 3/22/2011 and was made available in nearly 200 countries. Developers are paid 70% of the list price of the app or in-app purchase. The potential client of this project is developers who find the needs of consumers and maintain the quality assurance by debug/manage functionalities in a prompt manner.

Netflix app has one of the most reviews in this dataset. There are 12,566 reviews used for topic modeling and 6,283 hold-out reviews.


Image for post
  1. Preprocessing: Remove punctuations/stop words, creating bigrams, part-of-speech tagging with nouns, lemmatizing, and creating a dictionary.
  2. Tune hyperparameters: Choose the number of topics and alpha which determines the degree of mixture of topics in each document that has the highest coherence score.
  3. Select the model based on three criteria: 1) Interpretability, 2) Distinguishability from other topics, and 3) Coherence scores
  4. Label the topics myself based on the relevance of terms which is probability of word appearance given each topic, reading representative documents of each topic, and visualizing word cloud.
  5. Train BERT model with the labelled topics and Compare the results with the topic model

After hyperparameter tuning, I chose LDA-Mallet(which uses Gibbs sampling instead of variational inference) which met the three criteria in the best way. Most of all, the intertopic distance map by pyLDAvis convinced me to go with this model.

Image for post

The topics are scattered across the plot and fairly distant from each other. The size of the circle which represents the amount of tokens/bag of words is evenly large. These are ideal conditions for a topic model.

Word clouds are not the most helpful for labelling itself, but it is helpful in detecting the dominance of keywords in each topic. For example, the reviews under ‘User Experience’ has ‘kid’ as the most frequent word. This is because the reviewers writing their user experience, many of them expressed concerns about unrestricted access to contents which can be seen by their kids.

Using these labels predicted by LDA model, I was curious how BERT would classify them with these labels.

Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art NLP pre-training technique developed by Google and published in 2018. The model uses transformers which considers different aspects of words such as semantic, syntax, vocabulary, etc while also considering its positions in each sentence. With these attention layers running in parallel with GPU’s, it’s much faster than RNN/LSTM which learns sequentially. There’s an extremely helpful youtube video that explains the strengths of BERT in a nutshell by Leo Dirac at this link.

After running 5 epochs of stochastic gradient descent, I noticed that the model is getting more and more overfit. I couldn’t optimize this time because I didn’t set up cloud or GPU to speed up the training.

LDA-Mallet and BERT have only 43% class agreement with 6,283 unseen reviews. Let’s compare their most representative review in 2D plot. Each color represents a review.

Image for post
UMAP 2d-projection of most representative documents’ word embeddings. LDA-Mallet(left) and BERT(right)

On the left, a review tends to contain words that are close to each other except the trouble-shooting-related review and platform/device-related review which are quite polarized. This is a bag-of-words scheme of topic modeling where simply words themselves determine the topic.

In contrast, words don’t simply determine the topic in the right panel. For example, the words ‘past’ and ‘present’ are semantically close to each other, but they end up in different topics depending on what words appear in the review, in other words, its context.

To compare their classifications in a more friendly way, I made some user-interface that takes a review and spews out the topic distribution.

The first review is clearly about trouble-shooting. Both models predict it as trouble-shooting, but BERT predicts with higher probability.

The second review has mixed feelings about the app. LDA predicts as ‘Value’ (which is labeled so with the reviews about the value of subscribing Netflix over the other vendors) with 32.1% and ‘Shows’ with 29%. On the other hand, BERT thinks it’s 53% likely ‘Trouble-shooting’.

The third review is a sarcastic positive review which starts with a rather negative word ‘Warning’. LDA predicts as ‘Shows’ while Netflix predicts 35% likely ‘Trouble-shooting’.

The last review is a sarcastic negative review. LDA predicts it strongly as ‘Service’ while BERT predicts it as ‘Trouble-shooting’.

What conclusions can we make from these differences?

LDA model or other topic models would work well with reviews that have words that are coherent with the context as well. Sarcasm is one type that topic models will not catch often.

The BERT model could not detect the first sarcastic review, but it picked up the negative sentiment from the second sarcastic review. I was surprised that BERT was not confused with strong positive word such as ‘amazing’.

In the upcoming project, I will probably revisit BERT embeddings and use them for detecting sentiment. Better yet, with more sarcastic reviews.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s