Netflix App review Topic Modeling

Image for post

Hi, all! In the rise of interest in NLP and the advent of evermore accurate algorithms, it is exciting to start entering the world of NLP. Here, I chose Amazon Mobile App reviews which is publicly available in S3 bucket in AWS US East Region. First, let me briefly introduce the background.

Amazon Appstore for Android opened on 3/22/2011 and was made available in nearly 200 countries. Developers are paid 70% of the list price of the app or in-app purchase. The potential client of this project is developers who find the needs of consumers and maintain the quality assurance by debug/manage functionalities in a prompt manner.

Netflix app has one of the most reviews in this dataset. There are 12,566 reviews used for topic modeling and 6,283 hold-out reviews.


Image for post
  1. Preprocessing: Remove punctuations/stop words, creating bigrams, part-of-speech tagging with nouns, lemmatizing, and creating a dictionary.
  2. Tune hyperparameters: Choose the number of topics and alpha which determines the degree of mixture of topics in each document that has the highest coherence score.
  3. Select the model based on three criteria: 1) Interpretability, 2) Distinguishability from other topics, and 3) Coherence scores
  4. Label the topics myself based on the relevance of terms which is probability of word appearance given each topic, reading representative documents of each topic, and visualizing word cloud.
  5. Train BERT model with the labelled topics and Compare the results with the topic model

After hyperparameter tuning, I chose LDA-Mallet(which uses Gibbs sampling instead of variational inference) which met the three criteria in the best way. Most of all, the intertopic distance map by pyLDAvis convinced me to go with this model.

Image for post

The topics are scattered across the plot and fairly distant from each other. The size of the circle which represents the amount of tokens/bag of words is evenly large. These are ideal conditions for a topic model.

Word clouds are not the most helpful for labelling itself, but it is helpful in detecting the dominance of keywords in each topic. For example, the reviews under ‘User Experience’ has ‘kid’ as the most frequent word. This is because the reviewers writing their user experience, many of them expressed concerns about unrestricted access to contents which can be seen by their kids.

Using these labels predicted by LDA model, I was curious how BERT would classify them with these labels.

Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art NLP pre-training technique developed by Google and published in 2018. The model uses transformers which considers different aspects of words such as semantic, syntax, vocabulary, etc while also considering its positions in each sentence. With these attention layers running in parallel with GPU’s, it’s much faster than RNN/LSTM which learns sequentially. There’s an extremely helpful youtube video that explains the strengths of BERT in a nutshell by Leo Dirac at this link.

After running 5 epochs of stochastic gradient descent, I noticed that the model is getting more and more overfit. I couldn’t optimize this time because I didn’t set up cloud or GPU to speed up the training.

LDA-Mallet and BERT have only 43% class agreement with 6,283 unseen reviews. Let’s compare their most representative review in 2D plot. Each color represents a review.

Image for post
UMAP 2d-projection of most representative documents’ word embeddings. LDA-Mallet(left) and BERT(right)

On the left, a review tends to contain words that are close to each other except the trouble-shooting-related review and platform/device-related review which are quite polarized. This is a bag-of-words scheme of topic modeling where simply words themselves determine the topic.

In contrast, words don’t simply determine the topic in the right panel. For example, the words ‘past’ and ‘present’ are semantically close to each other, but they end up in different topics depending on what words appear in the review, in other words, its context.

To compare their classifications in a more friendly way, I made some user-interface that takes a review and spews out the topic distribution.

The first review is clearly about trouble-shooting. Both models predict it as trouble-shooting, but BERT predicts with higher probability.

The second review has mixed feelings about the app. LDA predicts as ‘Value’ (which is labeled so with the reviews about the value of subscribing Netflix over the other vendors) with 32.1% and ‘Shows’ with 29%. On the other hand, BERT thinks it’s 53% likely ‘Trouble-shooting’.

The third review is a sarcastic positive review which starts with a rather negative word ‘Warning’. LDA predicts as ‘Shows’ while Netflix predicts 35% likely ‘Trouble-shooting’.

The last review is a sarcastic negative review. LDA predicts it strongly as ‘Service’ while BERT predicts it as ‘Trouble-shooting’.

What conclusions can we make from these differences?

LDA model or other topic models would work well with reviews that have words that are coherent with the context as well. Sarcasm is one type that topic models will not catch often.

The BERT model could not detect the first sarcastic review, but it picked up the negative sentiment from the second sarcastic review. I was surprised that BERT was not confused with strong positive word such as ‘amazing’.

In the upcoming project, I will probably revisit BERT embeddings and use them for detecting sentiment. Better yet, with more sarcastic reviews.


AI Machine Learning Efforts Encounter A Carbon Footprint Blemish

By Lance Eliot, the AI Trends Insider

Green AI is arising.

Recent news about the benefits of Machine Learning (ML) and Deep Learning (DL) has taken a slightly downbeat turn toward pointing out that there is a potential ecological cost associated with these systems. In particular, AI developers and AI researchers need to be mindful of the adverse and damaging carbon footprint that they are generating while crafting ML/DL capabilities.

It is a so-called “green” or environmental wake-up call for AI that is worth hearing.

Let’s first review the nature of carbon footprints (CFPs) that are already quite familiar to all of us, such as the carbon belching transportation industry.

A carbon footprint is usually expressed as the amount of carbon dioxide emissions spewed forth, including for example when you fly in a commercial plane from Los Angeles to New York, or when you drive your gasoline-powered car from Silicon Valley to Silicon Beach.

Carbon accounting is used to figure out how much a machine or system produces in terms of its carbon footprint when being utilized and can be calculated for planes, cars, washing machines, refrigerators, and just about anything that emits carbon fumes.

We all seem to now know that our cars are emitting various greenhouse gasses including the dreaded carbon dioxide vapors that have numerous adverse environmental impacts. Some are quick to point out that hybrid cars that use both gasoline and electrical power tend to have a lower carbon footprint than conventional cars, while Electrical Vehicles (EV’s) are essentially zero carbon emissions at the tailpipe.

Calculating Carbon Footprints For A Car

When ascertaining the carbon footprint of a machine or device, it is easy to fall into the mental trap of only considering the emissions that occur when the apparatus is in use. A gasoline car might emit 200 grams of carbon dioxide per kilometer traveled, while a hybrid-electric might produce about half at 92 grams, and an EV presumably at 0 grams, per EPA and Department of Energy.

See this U.S. government website for detailed estimates about carbon emissions of cars:

Though the direct carbon footprint aspect does indeed involve what happens during the utilization effort of a machine or device, there is also the indirect carbon footprint that requires our equal attention, involving both upstream and downstream elements that contribute to a fuller picture of the true carbon footprint involved. For example, a conventional gasoline-powered car might generate perhaps 28 percent of its total life-time carbon dioxide emissions when the car was originally manufactured and shipped to being sold.

You might at first be normally thinking like this:

  • Total CFP of a car = CFP while burning gasoline

But it should be more like this:

  • Total CFP of a car = CFP when the car is made + CFP while burning gasoline

Let’s define “CFP Made” as a factor about the carbon footprint when a car is manufactured and shipped, and another factor we’ll call “CFP FuelUse” that represents the carbon footprint while the car is operating.

For the full lifecycle of a car, we need to add more factors into the equation.

There is a carbon footprint when the gasoline itself is being generated, I’ll call it “CFP FuelGen,” and thus we should include not just the CFP when the fuel is consumed but also when the fuel was originally processed or generated. Furthermore, once a car has seen its day and will be put aside and no longer used, there is a carbon footprint associated with disposing or scrapping of the car (“CFP Disposal”).

This also brings up a facet about EV’s. The attention of EV’s as having zero CFP at the tailpipe is somewhat misleading when considering the total lifecycle CFP since you should also be including the carbon footprint required to generate the electrical power that gets charged into the EV and then is consumed while the EV is driving around. We’ll assign that amount to the CFP FuelGen factor.

The expanded formula is:

  • Total CFP of a car = CFP Made + CFP FuelUse + CFP FuelGen + CFP Disposal

Let’s rearrange the factors to group together the one-time carbon footprint amounts, which would be the CFP Made and CFP Disposal, and group together the ongoing usage carbon footprint amounts, which would be the CFP FuelUse and CFP FuelGen. This makes sense since the fuel used and the fuel generated factors are going to vary depending upon how much a particular car is being driven. Presumably, a low mileage driven car that mainly sits in your garage would have a smaller grand-total over its lifetime of the CFP consumption amount than would a car that’s being driven all the time and racking up tons of miles.

The rearranged overall formula is:

  • Total CFP of a car = (CFP Made + CFP Disposal) + (CFP FuelUse + CFP FuelGen)

Next, I’d like to add a twist that very few are considering when it comes to the emergence of self-driving autonomous cars, namely the carbon footprint associated with the AI Machine Learning for driverless cars.

Let’s call that amount as “CFP ML” and add it to the equation.

  • Total CFP of a car = (CFP Made + CFP Disposal) + (CFP FuelUse + CFP FuelGen) + CFP ML

You might be puzzled as to what this new factor consists of and why it is being included. Allow me to elaborate.

AI Machine Learning As A Carbon Footprint

In a recent study done at the University of Massachusetts, researchers examined several AI Machine Learning or Deep Learning systems that are being used for Natural Language Processing (NLP) and tried to estimate how much of a carbon footprint was expended in developing those NLP systems (see the study at this link here:

You likely already know something about NLP if you’ve ever had a dialogue with Alexa or Siri. Those popular voice interactive systems are trained via a large-scale or deep Artificial Neural Network (ANN), a kind of computer-based model that simplistically mimics brain-like neurons and neural networks, and are a vital area of AI for having systems that can “learn” based on datasets provided to them.

Those of you versed in computers might be perplexed that the development of an AI Machine Learning system would somehow produce CFP since it is merely software running on computer hardware, and it is not a plane or a car.

Well, if you consider that there is electrical energy used to power the computer hardware, which is used to be able to run the software that then produces the ML model, you could then assert that the crafting of the AI Machine Learning system has caused some amount of CFP via however the electricity itself was generated to power the ML training operation.

According to the calculations done by the researchers, a somewhat minor or modest NLP ML model consumed an estimated 78,468 pounds of carbon dioxide emissions for its training, while a larger NLP ML consumed an estimated 626,155 pounds during training. As a basis for comparison, they report that an average car over its lifetime might consume 126,000 pounds of carbon dioxide emissions.

A key means of calculating the carbon dioxide produced was based on the EPA’s formula of total electrical power consumed is multiplied by a factor of 0.954 to arrive at the average CFP in pounds per kilowatt-hour and as based on assumptions of power generation plants in the United States.

Significance Of The CFP For Machine Learning

Why should you care about the CFP of the AI Machine Learning for an autonomous car?

Presumably, conventional cars don’t have to include the CFP ML factor since a conventional car does not encompass such a capability, therefore the factor would have a value of zero in the case of a conventional car. Meanwhile, for a driverless car, the CFP ML would have some determinable value and would need to be added into the total CFP calculation for driverless cars.

Essentially, it burdens the carbon footprint of a driverless car and tends to heighten the CFP in comparison to a conventional car.

For those of you that might react instantly to this aspect, I don’t think though that this means that the sky is falling and that we should somehow put the brakes on developing autonomous cars, you ought to consider these salient topics:

  • If the AI ML is being deployed across a fleet of driverless cars, perhaps in the hundreds, thousands, or eventually millions of autonomous cars, and if the AI ML is the same instance for each of those driverless cars, the amount of CFP for the AI ML production is divided across all of those driverless cars and therefore likely a relatively small fractional addition of CFP on a per driverless car basis.
  • Autonomous cars are more than likely to be EVs, partially due to the handy aspect that an EV is adept at storing electrical power, of which the driverless car sensors and computer processors slurp up and need profusely. Thus, the platform for the autonomous car is already going to be significantly cutting down on CFP due to using an EV.
  • Ongoing algorithmic improvements in being able to produce AI ML is bound to make it more efficient to create such models and therefore either decrease the amount of time required to produce the models (accordingly likely reducing the electrical power consumed) or can better use the electrical power in terms of faster processing by the hardware or software.
  • For semi-autonomous cars, you can expect that we’ll see AI ML being used there too, in addition to the fully autonomous cars, and therefore the reality will be that the CFP of the AI ML will apply to eventually all cars since conventional cars will gradually be usurped by semi-autonomous and fully autonomous cars.
  • Some might argue that the CFP of the AI ML ought to be tossed into the CFP Made bucket, meaning that it is just another CFP component within the effort to manufacture the autonomous car. And, if so, based on preliminary analyses, it would seem like the CFP AI ML is rather inconsequential in comparison to the rest of the CFP for making and shipping a car.

For those of you interested in trying out an experimental impact tracker in your AI ML developments, there are various tools coming available, including for example this one posted at GitHub that was developed jointly by Stanford University, Facebook AI Research, and McGill University:

As they say, your mileage may vary in terms of using any of these emerging tracking tools and you should proceed mindfully and with appropriate due diligence for applicability and soundness.

For my framework about AI autonomous cars, see the link here:

Why this is a moonshot effort, see my explanation here:

For more about the levels as a type of Richter scale, see my discussion here:

For the argument about bifurcating the levels, see my explanation here:


There’s an additional consideration for the CFP of AI ML.

You could claim that there is a CFP AI ML for the originating of the Machine Learning model that will be driving the autonomous car, and then there is the ongoing updating and upgrading involved too.

Therefore, the CFP AI ML is more than just a one-time CFP, it is also part of the ongoing grouping too.

Let’s split it across the two groupings:

  • Total CFP of a car = (CFP Made + CFP Disposal + CFP ML1) + (CFP FuelUse + CFP FuelGen + CFP ML2)

You can go even deeper and point out that some of the AI ML will be taking place in-the-cloud of the automaker or tech firm and then be pushed down into the driverless car (via Over-The-Air or OTA electronic communications), while some of the AI ML might be also occurring in the on-board systems of the autonomous car. In that case, there’s the CFP to be calculated for the cloud-based AI ML and then a different calculation to determine the CFP of the onboard AI ML.

There are some that point out that you can burden a lot of things in our society if you are going to be considering the amount of electrical power that they use, and perhaps it is unfair to suddenly bring up the CFP of AI ML, doing so in isolation of the myriad of other ways in which CFP arises due to any kind of computer-based system.

In the case of autonomous cars, it is also pertinent to consider not just the “costs” side of things, which includes the carbon footprint factor, but also the benefits side of things.

Even if there is some attributable amount of CFP for driverless cars, it would be prudent to consider what kinds of benefits we’ll derive as a society and weigh that against the CFP aspects. Without taking into account the hoped-for benefits, including the potential of human lives saved, the potential for mobility access to all and including the mobility marginalized, and other societal transformations, you get a much more robust picture.

In that sense, we need to figure out this equation:

  • Societal ROI of autonomous cars = Societal benefits – Societal costs

We don’t yet know how it is going to pan out, but most are hoping that the societal benefits will readily outweigh the societal costs, and therefore the ROI for self-driving driverless autonomous cars will be hefty and leave us all nearly breathless as such.