IMBD sentiment analysis using NLP

6 min readJan 27, 2021

Knowing the problem:

The main objective of the project is to predict the sentiment for a number of movie reviews obtained from the Internet Movie Database (IMDb). This dataset contains 50,000. Movie reviews that have been pre-labeled with “positive” and “negative” sentiment class labels based on the review content. Besides this, there are additional movie reviews that are unlabeled.

The dataset can be obtained from http://ai.stanford.edu/~amaas/data/sentiment/, courtesy of Stanford University, and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. They have datasets in the form of raw text as well as an already processed bag of words formats.

Hence our task will be to predict the sentiment of 15,000 labeled movie reviews and use the remaining 35,000 reviews for training our supervised models.

Step 1: Text Pre-Processing and Normalization

One of the key steps before diving into the process of feature engineering and modeling involves cleaning, pre-processing, and normalizing text to bring text components like phrases and words to some standard format. This enables standardization across a document corpus, which helps build meaningful features and helps reduce noise that can be introduced due to many factors like irrelevant symbols, special characters, XML and HTML tags, and so on.

• Cleaning text:

Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing sentiment. Hence we need to make sure we remove them before extracting features. The BeautifulSoup library does an excellent job in providing necessary functions for this.

• Removing accented characters:

In our dataset, we are dealing with reviews in the English language so we need to make sure that characters with any other format, especially accented characters are converted and standardized into ASCII characters. A simple example would be converting é to e.

• Expanding contractions:

In the English language, contractions are basically shortened versions of words or syllables. These shortened versions of existing words or phrases are created by removing specific letters and sounds. Examples would be, expand don’t to do not and I’d to I would. Contractions pose a problem in text normalization because we have to deal with special characters like the apostrophe and we also have to convert each contraction to its expanded, original form.

• Removing special characters:

Another important task in text cleaning and normalization is to remove special characters and symbols that often add to the extra noise in unstructured text. Simple regexes can be used to achieve this. Its your choice to retain numbers or remove them, if you do not want them in your normalized corpus.

• Removing stop words:

Words which have little or no significance especially when constructing meaningful features from text are also known as stop words. Words like a, an, the, and so on are considered to be stopwords. There is no universal stopword list but we use a standard English language stopwords list from nltk. You can also add your own domain specific stopwords if needed.

• Stemming and Lemmatization:

Word stems are usually the base form of possible words that can be created by attaching affixes like prefixes and suffixes to the stem to create new words. This is known as inflection. The reverse process of obtaining the base form of a word is known as stemming. A simple example are the words WATCHES, WATCHING, and WATCHED. They have the word root stem WATCH as the base form. The nltk package offers a wide range of stemmers like the PorterStemmer and LancasterStemmer. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary) but the root stem may not be so. We will be using Lemmatization only in our normalization pipeline to retain lexicographically correct words.

Approach 1: Doing Sentiment Analysis using Unsupervised Lexicon-Based Models:

Unsupervised sentiment analysis models use well curated knowledge bases, ontologies, lexicons, and databases that have detailed information pertaining to subjective words, phrases including sentiment, mood, polarity, objectivity, subjectivity, and so on. A lexicon model typically uses a lexicon, also known as a dictionary or vocabulary of words specifically aligned toward sentiment analysis. Usually these lexicons contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality, and so on. You can use these lexicons and compute sentiment of a text document by matching the presence of specific words from the lexicon, look at other additional factors like presence of negation parameters, surrounding words, overall context and phrases and aggregate overall sentiment polarity scores to decide the final sentiment score. There are several popular lexicon models used for sentiment analysis. Some of them are mentioned as follows.

. Bing Liu’s Lexicon

. MPQA Subjectivity Lexicon

. Pattern Lexicon

. AFINN Lexicon

. SentiWordNet Lexicon

. VADER Lexicon

We would be using the last 3 lexicon Models for Sentiment Analysis.

Model Evaluation of AFINN:

Model Performance metrics:
------------------------------
Accuracy: 0.71
Precision: 0.73
Recall: 0.71
F1 Score: 0.71Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.67      0.85      0.75      7510
   negative       0.79      0.57      0.67      7490avg / total       0.73      0.71      0.71     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6376     1134
        negative       3189     4301

Model Evaluation of SentiWordNet:

Model Performance metrics:
------------------------------
Accuracy: 0.69
Precision: 0.69
Recall: 0.69
F1 Score: 0.68Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.66      0.76      0.71      7510
   negative       0.72      0.61      0.66      7490avg / total       0.69      0.69      0.68     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       5742     1768
        negative       2932     4558

Model Evaluation of VADER:

Model Performance metrics:
------------------------------
Accuracy: 0.71
Precision: 0.72
Recall: 0.71
F1 Score: 0.71Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.67      0.83      0.74      7510
   negative       0.78      0.59      0.67      7490avg / total       0.72      0.71      0.71     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6235     1275
        negative       3068     4422

Approach 2: Classifying Sentiment with Supervised Learning:

Another way to build a model to understand the text content and predict the sentiment of the text based reviews is to use supervised Machine Learning. To be more specific, we will be using classification models for solving this problem.

The major steps to achieve this are mentioned as follows :

1. Prepare train and test datasets (optionally a validation dataset)

2. Pre-process and normalize text documents

3. Feature engineering

4. Model training

5. Model prediction and evaluation

Logistic Regression model on BOW features:

Model Performance metrics:
------------------------------
Accuracy: 0.91
Precision: 0.91
Recall: 0.91
F1 Score: 0.91Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.90      0.91      0.91      7510
   negative       0.91      0.90      0.90      7490avg / total       0.91      0.91      0.91     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6817      693
        negative        731     6759

Logistic Regression model on TF-IDF features:

Model Performance metrics:
------------------------------
Accuracy: 0.9
Precision: 0.9
Recall: 0.9
F1 Score: 0.9Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.89      0.90      0.90      7510
   negative       0.90      0.89      0.90      7490avg / total       0.90      0.90      0.90     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6780      730
        negative        828     6662

SVM model on BOW features:

Model Performance metrics:
------------------------------
Accuracy: 0.9
Precision: 0.9
Recall: 0.9
F1 Score: 0.9Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.90      0.89      0.90      7510
   negative       0.90      0.91      0.90      7490avg / total       0.90      0.90      0.90     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6721      789
        negative        711     6779

SVM model on TF-IDF features:

Model Performance metrics:
------------------------------
Accuracy: 0.9
Precision: 0.9
Recall: 0.9
F1 Score: 0.9Model Classification report:
------------------------------
             precision    recall  f1-score   support   positive       0.89      0.91      0.90      7510
   negative       0.91      0.88      0.90      7490avg / total       0.90      0.90      0.90     15000
Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive       6839      671
        negative        871     6619

Therefore, we can figure it out that the Logistic Regression model on BOW features work better than the AFINN model. Here is the comparision below:

AFINN                                           LOGISTIC REGRESSION
Accuracy: 0.71                                      Accuracy: 0.91
Precision: 0.73                                     Precision: 0.91
Recall: 0.71               VS                       Recall: 0.91
F1 Score: 0.71                                      F1 Score: 0.91

Thank You and Happy Learning…