Naive Bayes' Classifier: How to Build a Sentiment Analysis Program

August 10, 2018 in Blogs

...

In a previous blog post, Intro to NLP: TF-IDF from Scratch, we explored the workings behind TF-IDF, a method that quantifies how important a word is to the document in which it is found. We also built a TF-IDF program from scratch in Python. Click here to check it out.


Natural Language Processing and Machine Learning

NLP, short for natural language processing, describes how computers process human language. There are various tools that can be used to perform NLP and they amount to exploiting certain features of language. For text specifically, there are tools like TF-IDF that take text and represent its features numerically, a process called text mining. Text mining is extremely topical in the context of machine learning and artificial intelligence because machine learning algorithms are really mathematical models, and mathematical models can only take in numbers and arrays of numbers, called vectors.

The converse—applying machine learning to develop and improve text mining tools—has been a trending focus. One instance where machine learning is being applied is to a task in NLP called text classification. Text classification is the process of assigning text to different categories based on certain features like content, genre, language, or sentiment. Text classification is used in places like spam filtering, web page tagging (which plays into SEO, or search engine optimization), and email categorization (e.g. how Gmail organizes email into primary, social, and promotion categories).

In this article, we zoom in on a subset of text classification called sentiment analysis, the categorizing of text based on positive or negative sentiments. We’ll be exploring a statistical modeling technique called multinomial Naive Bayes classifier which can be used to classify text. Then, we’ll demonstrate how to build a sentiment classifier from scratch in Python.


Sentiment Analysis

Text sentiment refers not to the cut and dry meaning of text, but rather the feeling, attitude, and opinion behind it: “Is this movie review positive, negative, or neutral? Is this customer praising or criticising their purchase?” Extracting sentiment from a body of text to determine the writer’s attitude is commonly known as sentiment analysis or opinion mining.

Although sentiment analysis appears in various places, it is particularly useful in social media. To users, social media is a platform for thought-sharing, which in turn breeds discussion within a community. To businesses, social media is a platform for user feedback. The sheer amount of user feedback available allows businesses to perform data analysis, like sentiment analysis, to measure overall customer satisfaction (and dissatisfaction) towards one’s product and services, quantify how effective one’s marketing campaigns are, and help adjust/drive future product and marketing goals. Ergo, sentiment analysis on social media not only benefits customers and businesses, but also shifts business models in the long run. To read more about how sentiment analysis can play a role in business, click here.

Nonetheless, there are some limitations to sentiment analysis. Because language is often expressed with tone, sarcasm, and irony, or with images, videos, and emojis in juxtaposition and contradiction, sentiment analysis is not perfect. However, it remains very useful and will only improve as machine learning techniques advance.


The Multinomial Naive Bayes’ Classifier

The mechanism behind sentiment analysis is a text classification algorithm. The algorithm i.e. statistical model we’ll be using is the multinomial Naive Bayes’ classifier, a member of the Naive Bayes' classifer family.

NB classifiers are probabilistic classifiers, meaning that they use the probabilities of observed outcomes to return a reasonable estimate of an unknown outcome. At a high level, NB classifiers use Bayes' rule with one naive (or simplifying) assumption: that features are independent from/uncorrelated with others features. (For text classification, an example of a feature is the occurrence of a word. However, in the general case, the word features is used loosely, since choosing what constitutes a feature in ML is a topic of its own.) Although this assumption about independence is unrealistic, regardless, NB classifiers perform well in practice. For this reason, and for being fast and simple, they are used frequently in NLP. (Check out this blog post to learn more about them.)

At a high level, Bayes’ rule says, if we know the effect given the cause, we can calculate the cause given the effect.

sent-1

Here is a good resource for the full derivation of Bayes’ rule, which uses the definition of conditional probability and the law of total probability.

The multinomial term indicates that our features follow a multinomial distribution, which means that we have multiple features and we will count the number of occurrences or count the relative frequency of each feature. For text classification specifically, where words are features, a multinomial NB classifier classifies a document based on the count or on the relative frequency with which a word appears in a document.

In sentiment analysis, we might prefer using binary multinomial, which simply counts 1 if a word appears and 0 otherwise, since we really only care about a word appearing or not in, say, a positive review, rather than how many times it appeared. And, rather than using counts or relative frequencies, we might choose to use TF-IDF as a feature.


Intuition & Derivation for Model

To begin, we denote

  • d: document
  • c: class (or category). For instance, positive and negative are two classes in sentiment analysis.

The multinomial naive Bayes’ classifier works as follows: First, we find P(c|d) for each class c in C: the probability we return class ci given that our observation is d. Then, we find ĉ, the maximum of {P(c1|d), P(c2|d), ..., P(cn|d)}. ĉ is "our prediction of the correct class".

sent-2

ĉ captures the idea that we want to compute the most probable class from which a document might have come. The class that generates ĉ is the class with the highest probability of having generated document d. So we would predict d came from the c which generated ĉ.

Now to find P(ci|d) for each class ci, we use Bayes’ rule. P(c|d) is called the posterior probability, P(c) the prior probability, and P(d|c) the likelihood probability.

sent-3

Since we are comparing P(ci|d)s, we can ignore the denominator P(d), since it never changes.

sent-4

To proceed, we make some simplifying assumptions

  • Treat text as a bag of words, where order of the words does not matter. For instance, “the cow jumped over the moon.” is the same as “jumped cow the moon over the.” Then a document d is simply characterized by the words it contains.
  • (Multinomial): Treat the relative frequency with which a word appears in a document as a feature. From above, “the” is a feature, “cow” is a feature, etc. Now we can represent document d as a set of features f1f2...fn.
    sent-5
  • (Naive): The naive Bayes assumption, that each feature is independent from the others.
    sent-6

So the general Bayes' model looks like

sent-7

For text classification specifically, we assume a feature is just the existence of a word in a document, so we can find P(wi|c) by iterating through every word in d.

sent-8

To find P(c) and P(fi|c) for each class, we will construct them—this is where the machine learning aspect comes into play. To construct these probabilities, we train our model using a set of documents with labeled classes (d1, c1), ..., (dn, cn).

P(ci) is the probability a training set document is in class ci. To calculate P(ci):

sent-9

P(wi|ci) is the fraction of times word wi appears in all documents of class ci. First, we create a vocabulary V of unique words in our training set. Then, we store all the documents of class ci as Dci. To calculate P(wi|ci):

sent-10

For the purposes of sentiment analysis, how often a word appears is less important than whether it appears at all. A tweak is to use binary multinomial, where calculating P(wi|ci) is to just counting 1 if wi appears in docs of class ci, and 0 otherwise.


There is one thing we must account for when using maximum likelihood training. Because we calculate P(fi|c) for each feature, we run into a slight problem when one of our likelihood terms P(f1|c)·P(f2|c)· ... ·P(fn|c) is 0. This situation occurs when fi is not in any training documents of a particular class ci, but is in some other class cj. If a likelihood term is 0, the probability of the entire class is 0 regardless of the likelihoods of other features, which isn’t quite accurate.

To resolve zero probabilities, we employ add-one Laplace smoothing.

sent-11

Another thing we must account for is floating-point underflow. Computers return up to a specific decimal point accuracy, and taking the product of probabilities between 0 and 1 returns extremely small decimals that will not be stored in memory and that round to 0. To resolve underflow, we take the log of both sides and use the property log(x * y) = log(x) + log(y)).

sent-12

Lastly, when we predict for words that do not show up in our training data, we ignore it from our analysis. For instance, if the word “dish” has no probability from the training data, its class is undetermined.

So to summarize:

  • In text classification, we often treat text as a bag of words where order of the words does not matter. Each document d is simply characterized by the words it contains.
  • We treat each word as a feature and a document as a set of features f1f2...fn. Since a document is defined by its features, and classes are defined by documents, each class is defined by certain features.
  • We make the naive assumption that each feature is independent from the others.
  • To predict a class for a document dnew, we want to find ĉ, the maximum of {P(c1|d), P(c2|d), ..., P(cn|d)}. We can think of it as finding a new point (dnew, ĉ). This means, for a given observation, calculate the probabilities each class may have generated the sample and return the class that gave the highest probability. This is a class whose features resemble dnew best and therefore is most likely to have generated the observation.
  • We use a set of documents labeled (d1, c1), ..., (dn, cn) by classes to train our model. From the training dataset, we find P(c), the prior probability, and P(f1|c)·P(f2|c)· ... ·P(fn|c), the likelihood probability. for each class. In our implementation, we store these values in some data structure.
  • To account for zero probabilities, we use add-one Laplace smoothing.
  • To account for floating-point underflow, we use ln’s.

Now, we will apply all these concepts by implementing a multinomial naive Bayes classifier in Python.


Implementing a Sentiment Analyzer: Wine Reviews

To make our implementation more concrete, we’ll be using a cool wine dataset we found on Kaggle. (This is the same dataset we used to create our TF-IDF program. This way, we can contrast the two NLP techniques’ methods and outputs.)

Suppose we have a set of wine reviews. For each review, we’d like to predict whether the review is positive or negative. This is equivalent to saying we’d like to build a model that takes in a review and predicts whether it is one of two classes, positive or negative. (More formally, we’d like to build a model that takes a document d as input and predicts a class.)

First, we input and organize our data. To read a file, we import csv and found the following methods useful:

  file = open("winedata.csv")
  reader = csv.reader(file)

To view the csv file on terminal:

  • head filename.csv #displays first 10 lines of csv
  • head -n 7 filename.csv #displays first 7 lines of csv


On Excel, our dataset looks as follows:

country description ... points price province ... title variety winery
0 Italy "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity." ... 87 Sicily & Sardinia ... Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal "This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016." ... 87 15.0 Douro ... Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US "Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented." ... 87 14.0 Oregon ... Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm

From the dataset, we extract just the reviews and the ratings. Next, we clean our data so that our reviews, which are of type string, are tokenized into individual words—lowercased and punctuation-removed. Now, our document follows the bag-of-words assumption.

To generate a training set, we randomize our observations and choose the first 10000 observations. We also define the number of classes/categories; in our case there are two, 0 for the negative class and 1 for positive.

  import csv
  import math
  import numpy as np

  file = open("winedata.csv")
  reader = csv.reader(file)

  #Cleans the descriptions. Makes all words lowercase and
  #removes certain punctuations.
  data = [[row[2].lower()
           .replace(",", "").replace(".", "").replace("!", "").replace("?", "")
           .replace(";", "").replace(":", "").replace("*", "")
           .replace("(", "").replace(")", "")
           .replace("/", ""), row[4]]
           for row in reader]

  #Removes header
  data = data[1:]

  #Tokenizes the str description
  data = [[obs[0].split(), int(obs[1])] for obs in data]

  #Shuffles our rows. This allows us to construct (roughly) a training dataset
  np.random.shuffle(data)

  #Creates a training dataset e.g. [ [review 1, 92], [review 2, 87], ... ]
  training = data[:10000]

  #Defines classes: 0 is negative, 1 is positive
  classes = [0, 1]

The first few observations of our randomized training data set look like:

  >>> training[0:10]
  [[['this', 'is', 'easy', 'to', 'like', 'for', 'its', 'clean', 'brisk', 'mouthfeel', 'creaminess', 'and', 'slightly', 'sweet', 'flavors', 'of', 'lime', 'and', 'strawberry', 'made', 'from', 'chardonnay', 'and', 'pinot', 'noir', 'it', 'shows', 'real', 'finesse'], 90],
  [['a', 'blend', 'of', 'malvasia', 'greco', 'and', 'grechetto', 'this', 'fresh', 'white', 'delivers', 'zesty', 'acidity', 'and', 'lively', 'notes', 'of', 'bitter', 'almond', 'white', 'flower', 'citrus', 'and', 'crushed', 'mineral', 'it', 'would', 'pair', 'beautifully', 'with', 'a', 'heaping', 'plate', 'of', 'spaghetti', 'con', 'le', 'vongole'], 88],
  [['there', 'is', 'tough', 'concentration', 'here', 'that', 'suggests', 'a', 'wine', 'that', 'needs', 'aging', "there's", 'also', 'fruit', 'that', 'hints', 'at', 'richness', 'to', 'come', 'a', 'wine', 'that', 'has', 'a', 'good', 'future'], 89],
  [['made', 'with', '60%', 'sangiovese', '25%', 'cabernet', 'sauvignon', 'and', '15%', 'merlot', 'this', 'opens', 'with', 'delicate', 'aromas', 'of', 'black', 'currant', 'leather', 'and', 'toast', 'the', 'accessible', 'palate', 'doles', 'out', 'mature', 'black', 'cherry', 'vanilla', 'and', 'tobacco', 'alongside', 'chewy', 'tannins', 'and', 'modest', 'acidity'], 88],
  [['damp', 'sagebrush', 'dark', 'cranberry', 'buttered', 'black', 'cherries', 'fresh', 'plucked', 'marjoram', 'and', 'wet', 'stone', 'show', 'on', 'the', 'nose', 'of', 'this', 'bottling', 'the', 'tip', 'of', 'the', 'sip', 'shows', 'great', 'thyme', 'bay', 'leaf', 'and', 'anise', 'character', 'then', 'settles', 'into', 'rich', 'cranberry', 'fruit', 'all', 'held', 'together', 'by', 'a', 'decent', 'tannic', 'grip', 'the', 'dark', 'fruit', 'flavors', 'linger', 'deep', 'in', 'the', 'finish'], 90], 
  [['first', 'to', 'arrive', 'straight', 'to', 'your', 'awaiting', 'nose', 'are', 'complex', 'aromatics', 'with', 'a', 'mix', 'of', 'floral', 'and', 'spicy', 'highlights', 'that', 'surround', 'the', 'black', 'and', 'purple', 'fruits', 'with', 'exotic', 'nuances', 'the', 'fruit', 'is', 'nigh', 'perfect—ripe', 'round', 'forward', 'and', 'loaded', 'with', 'plummy', 'sweet', 'berries', 'it', 'gathers', 'strength', 'in', 'the', 'core', 'holds', 'and', 'then', 'expands', 'into', 'a', 'finish', 'dusted', 'with', 'cocoa', 'and', 'coffee'], 91],
  [['scratchy', 'berry', 'and', 'cassis', 'aromas', 'are', 'earthy', 'but', 'perfectly', 'presentable', 'this', 'feels', 'full', 'and', 'chewy', 'on', 'the', 'palate', 'with', 'firm', 'tannins', 'flavors', 'of', 'plum', 'and', 'berry', 'come', 'with', 'dried', 'spice', 'notes', 'and', 'raw', 'oak', 'while', 'the', 'finish', 'is', 'spicy', 'dry', 'and', 'mildly', 'herbal'], 86],
  [['grapefruit', 'spray', 'and', 'gooseberry', 'aromas', 'jump', 'from', 'the', 'glass', 'initially', 'with', 'an', 'herbal', 'tomato-leaf', 'tone', 'lingering', 'behind', 'lemon-lime', 'acidity', 'cuts', 'through', 'the', 'medium-bodied', 'palate', 'of', 'underripe', 'peach', 'and', 'tropical', 'fruit', 'leading', 'to', 'a', 'talc-dusted', 'finish', 'this', 'is', 'zesty', 'lively', 'and', 'perfectly', 'enjoyable', 'as', 'an', 'apéritif', 'or', 'paired', 'alongside', 'tangy', 'cheese'], 87],
  [['88-90', 'barrel', 'sample', 'light', 'ripe', 'fresh', 'with', 'pleasant', 'acidity', 'and', 'the', 'freshest', 'of', 'blackcurrants'], 89],
  [['a', 'very', 'nice', 'pinot', 'noir', "that's", 'dry', 'and', 'silky', 'with', 'satisfying', 'cherry', 'cola', 'oak', 'and', 'spice', 'flavors', 'not', 'one', 'to', 'age', 'but', 'fine', 'for', 'drinking', 'now', 'and', 'easily', 'as', 'good', 'as', 'many', 'pinots', 'costing', 'more'], 87]]
  

Now, we write the skeleton code for a function that trains our model. The function takes in a training set and the number of classes to categorize. The following lines initialize some data structures used to compute or store log(P(c)) and log(P(d|c)).

  def train_naive_bayes(training, classes):
    """Given a training dataset and the classes that categorize
    each observation, return V: a vocabulary of unique words,
    logprior: a list of P(c), and loglikelihood: a list of P(fi|c)s
    for each word
    """
    #Initialize D_c[ci]: a list of all documents of class i
    #E.g. D_c[1] is a list of [reviews, ratings] of class 1
    D_c = [[]] * len(classes)

    #Initialize n_c[ci]: number of documents of class i
    n_c = [None] * len(classes)

    #Initialize logprior[ci]: stores the prior probability for class i
    logprior = [None] * len(classes)

    #Initialize loglikelihood: loglikelihood[ci][wi] stores the likelihood probability for wi given class i
    loglikelihood = [None] * len(classes)

Next, we find all documents of class ci and store them in D_c[ci].

    #Partition documents into classes. D_c[0]: negative docs, D_c[1]: positive docs
    for obs in training:    #obs: a [review, rating] pair
        #if rating >= 90, classify the review as positive
        if obs[1] >= 90:
            D_c[1] = D_c[1] + [obs]    #Can also write as D_c[1] = D_c[1].append(obs)
        #else, classify review as negative
        elif obs[1] < 90:
            D_c[0] = D_c[0] + [obs]

Calling D_c[0]:

  >>> D_c[0]
  [[['a', 'blend', 'of', 'malvasia', 'greco', 'and', 'grechetto', 'this', 'fresh', 'white', 'delivers', 'zesty', 'acidity', 'and', 'lively', 'notes', 'of', 'bitter', 'almond', 'white', 'flower', 'citrus', 'and', 'crushed', 'mineral', 'it', 'would', 'pair', 'beautifully', 'with', 'a', 'heaping', 'plate', 'of', 'spaghetti', 'con', 'le', 'vongole'], 88],
  [['there', 'is', 'tough', 'concentration', 'here', 'that', 'suggests', 'a', 'wine', 'that', 'needs', 'aging', "there's", 'also', 'fruit', 'that', 'hints', 'at', 'richness', 'to', 'come', 'a', 'wine', 'that', 'has', 'a', 'good', 'future'], 89],
  [['made', 'with', '60%', 'sangiovese', '25%', 'cabernet', 'sauvignon', 'and', '15%', 'merlot', 'this', 'opens', 'with', 'delicate', 'aromas', 'of', 'black', 'currant', 'leather', 'and', 'toast', 'the', 'accessible', 'palate', 'doles', 'out', 'mature', 'black', 'cherry', 'vanilla', 'and', 'tobacco', 'alongside', 'chewy', 'tannins', 'and', 'modest', 'acidity'], 88],
  [['scratchy', 'berry', 'and', 'cassis', 'aromas', 'are', 'earthy', 'but', 'perfectly', 'presentable', 'this', 'feels', 'full', 'and', 'chewy', 'on', 'the', 'palate', 'with', 'firm', 'tannins', 'flavors', 'of', 'plum', 'and', 'berry', 'come', 'with', 'dried', 'spice', 'notes', 'and', 'raw', 'oak', 'while', 'the', 'finish', 'is', 'spicy', 'dry', 'and', 'mildly', 'herbal'], 86],
  [['grapefruit', 'spray', 'and', 'gooseberry', 'aromas', 'jump', 'from', 'the', 'glass', 'initially', 'with', 'an', 'herbal', 'tomato-leaf', 'tone', 'lingering', 'behind', 'lemon-lime', 'acidity', 'cuts', 'through', 'the', 'medium-bodied', 'palate', 'of', 'underripe', 'peach', 'and', 'tropical', 'fruit', 'leading', 'to', 'a', 'talc-dusted', 'finish', 'this', 'is', 'zesty', 'lively', 'and', 'perfectly', 'enjoyable', 'as', 'an', 'apéritif', 'or', 'paired', 'alongside', 'tangy', 'cheese'], 87],
  [['88-90', 'barrel', 'sample', 'light', 'ripe', 'fresh', 'with', 'pleasant', 'acidity', 'and', 'the', 'freshest', 'of', 'blackcurrants'], 89],
  [['a', 'very', 'nice', 'pinot', 'noir', "that's", 'dry', 'and', 'silky', 'with', 'satisfying', 'cherry', 'cola', 'oak', 'and', 'spice', 'flavors', 'not', 'one', 'to', 'age', 'but', 'fine', 'for', 'drinking', 'now', 'and', 'easily', 'as', 'good', 'as', 'many', 'pinots', 'costing', 'more'], 87]]]

We need to create a vocabulary list V, where V is a list of unique words in the dataset. We also find |V|, the size of V.

    #Creates a vocabulary list. For large datasets, this code becomes slow.
    #In our post about TF-IDF, we constructed a vocab list that runs much faster.
    V = []
    for obs in training:
        for word in obs[0]:
            if word in V:
                continue
            else:
                V.append(word)
  
    V_size = len(V)

Lastly, we find P(c), the logprior, and P(wi|c), the loglikelihood for feature wi. Recall

  • sent-9
  • sent-10

    #n_docs: total number of documents in training set
    n_docs = len(training)

    for ci in range(len(classes)):
        #Store n_c value for each class
        n_c[ci] = len(D_c[ci])
        
        #Compute P(c)
        logprior[ci] = np.log((n_c[ci] + 1)/ n_docs)


        #Counts total number of words in class c
        count_w_in_V = 0
        for d in D_c[ci]:
            count_w_in_V = count_w_in_V + len(d[0])
        denom = count_w_in_V + V_size

        dic = {}
        #Compute P(w|c)
        for wi in V:
            #Count number of times wi appears in D_c[ci]
            count_wi_in_D_c = 0
            for d in D_c[ci]:
                for word in d[0]:
                    if word == wi:
                        count_wi_in_D_c = count_wi_in_D_c + 1
            numer = count_wi_in_D_c + 1
            dic[wi] = np.log((numer) / (denom))
        loglikelihood[c] = dic
        
    return (V, logprior, loglikelihood)

We return V, logprior, and loglikelihood, the data structures containing all the information we need.

Calling train_naive_bayes(training[0:10], classes)

  >>> train_naive_bayes(training[0:10], classes)
  (#V
  ['this', 'is', 'easy', 'to', 'like', 'for', 'its', 'clean', 'brisk', 'mouthfeel', 'creaminess', 'and', 'slightly', 'sweet', 'flavors', 'of', 'lime', 'strawberry', 'made', 'from', 'chardonnay', 'pinot', 'noir', 'it', 'shows', 'real', 'finesse', 'a', 'blend', 'malvasia', 'greco', 'grechetto', 'fresh', 'white', 'delivers', 'zesty', 'acidity', 'lively', 'notes', 'bitter', 'almond', 'flower', 'citrus', 'crushed', 'mineral', 'would', 'pair', 'beautifully', 'with', 'heaping', 'plate', 'spaghetti', 'con', 'le', 'vongole', 'there', 'tough', 'concentration', 'here', 'that', 'suggests', 'wine', 'needs', 'aging', "there's", 'also', 'fruit', 'hints', 'at', 'richness', 'come', 'has', 'good', 'future', '60%', 'sangiovese', '25%', 'cabernet', 'sauvignon', '15%', 'merlot', 'opens', 'delicate', 'aromas', 'black', 'currant', 'leather', 'toast', 'the', 'accessible', 'palate', 'doles', 'out', 'mature', 'cherry', 'vanilla', 'tobacco', 'alongside', 'chewy', 'tannins', 'modest', 'damp', 'sagebrush', 'dark', 'cranberry', 'buttered', 'cherries', 'plucked', 'marjoram', 'wet', 'stone', 'show', 'on', 'nose', 'bottling', 'tip', 'sip', 'great', 'thyme', 'bay', 'leaf', 'anise', 'character', 'then', 'settles', 'into', 'rich', 'all', 'held', 'together', 'by', 'decent', 'tannic', 'grip', 'linger', 'deep', 'in', 'finish', 'first', 'arrive', 'straight', 'your', 'awaiting', 'are', 'complex', 'aromatics', 'mix', 'floral', 'spicy', 'highlights', 'surround', 'purple', 'fruits', 'exotic', 'nuances', 'nigh', 'perfect—ripe', 'round', 'forward', 'loaded', 'plummy', 'berries', 'gathers', 'strength', 'core', 'holds', 'expands', 'dusted', 'cocoa', 'coffee', 'scratchy', 'berry', 'cassis', 'earthy', 'but', 'perfectly', 'presentable', 'feels', 'full', 'firm', 'plum', 'dried', 'spice', 'raw', 'oak', 'while', 'dry', 'mildly', 'herbal', 'grapefruit', 'spray', 'gooseberry', 'jump', 'glass', 'initially', 'an', 'tomato-leaf', 'tone', 'lingering', 'behind', 'lemon-lime', 'cuts', 'through', 'medium-bodied', 'underripe', 'peach', 'tropical', 'leading', 'talc-dusted', 'enjoyable', 'as', 'apéritif', 'or', 'paired', 'tangy', 'cheese', '88-90', 'barrel', 'sample', 'light', 'ripe', 'pleasant', 'freshest', 'blackcurrants', 'very', 'nice', "that's", 'silky', 'satisfying', 'cola', 'not', 'one', 'age', 'fine', 'drinking', 'now', 'easily', 'many', 'pinots', 'costing', 'more'],
  #logprior
  [-0.2231435513142097, -0.916290731874155],
  #loglikelihood
  [{'this': -4.578826210648489, 'is': -4.801969761962699, 'easy': -6.18826412308259, 'to': -4.801969761962699, 'like': -6.18826412308259, 'for': -5.495116942522644, 'its': -6.18826412308259, 'clean': -6.18826412308259, 'brisk': -6.18826412308259, 'mouthfeel': -6.18826412308259, 'creaminess': -6.18826412308259, 'and': -3.192531849528599, 'slightly': -6.18826412308259, 'sweet': -6.18826412308259, 'flavors': -5.08965183441448, 'of': -4.1088225814027535, 'lime': -6.18826412308259, 'strawberry': -6.18826412308259, 'made': -5.495116942522644, 'from': -5.495116942522644, 'chardonnay': -6.18826412308259, 'pinot': -5.495116942522644, 'noir': -5.495116942522644, 'it': -5.495116942522644, 'shows': -6.18826412308259, 'real': -6.18826412308259, 'finesse': -6.18826412308259, 'a': -4.1088225814027535, 'blend': -5.495116942522644, 'malvasia': -5.495116942522644, 'greco': -5.495116942522644, 'grechetto': -5.495116942522644, 'fresh': -5.08965183441448, 'white': -5.08965183441448, 'delivers': -5.495116942522644, 'zesty': -5.08965183441448, 'acidity': -4.578826210648489, 'lively': -5.08965183441448, 'notes': -5.08965183441448, 'bitter': -5.495116942522644, 'almond': -5.495116942522644, 'flower': -5.495116942522644, 'citrus': -5.495116942522644, 'crushed': -5.495116942522644, 'mineral': -5.495116942522644, 'would': -5.495116942522644, 'pair': -5.495116942522644, 'beautifully': -5.495116942522644, 'with': -3.9910395457463705, 'heaping': -5.495116942522644, 'plate': -5.495116942522644, 'spaghetti': -5.495116942522644, 'con': -5.495116942522644, 'le': -5.495116942522644, 'vongole': -5.495116942522644, 'there': -5.495116942522644, 'tough': -5.495116942522644, 'concentration': -5.495116942522644, 'here': -5.495116942522644, 'that': -4.578826210648489, 'suggests': -5.495116942522644, 'wine': -5.08965183441448, 'needs': -5.495116942522644, 'aging': -5.495116942522644, "there's": -5.495116942522644, 'also': -5.495116942522644, 'fruit': -5.08965183441448, 'hints': -5.495116942522644, 'at': -5.495116942522644, 'richness': -5.495116942522644, 'come': -5.08965183441448, 'has': -5.495116942522644, 'good': -5.08965183441448, 'future': -5.495116942522644, '60%': -5.495116942522644, 'sangiovese': -5.495116942522644, '25%': -5.495116942522644, 'cabernet': -5.495116942522644, 'sauvignon': -5.495116942522644, '15%': -5.495116942522644, 'merlot': -5.495116942522644, 'opens': -5.495116942522644, 'delicate': -5.495116942522644, 'aromas': -4.801969761962699, 'black': -5.08965183441448, 'currant': -5.495116942522644, 'leather': -5.495116942522644, 'toast': -5.495116942522644, 'the': -4.242353974027276, 'accessible': -5.495116942522644, 'palate': -4.801969761962699, 'doles': -5.495116942522644, 'out': -5.495116942522644, 'mature': -5.495116942522644, 'cherry': -5.08965183441448, 'vanilla': -5.495116942522644, 'tobacco': -5.495116942522644, 'alongside': -5.08965183441448, 'chewy': -5.08965183441448, 'tannins': -5.08965183441448, 'modest': -5.495116942522644, 'damp': -6.18826412308259, 'sagebrush': -6.18826412308259, 'dark': -6.18826412308259, 'cranberry': -6.18826412308259, 'buttered': -6.18826412308259, 'cherries': -6.18826412308259, 'plucked': -6.18826412308259, 'marjoram': -6.18826412308259, 'wet': -6.18826412308259, 'stone': -6.18826412308259, 'show': -6.18826412308259, 'on': -5.495116942522644, 'nose': -6.18826412308259, 'bottling': -6.18826412308259, 'tip': -6.18826412308259, 'sip': -6.18826412308259, 'great': -6.18826412308259, 'thyme': -6.18826412308259, 'bay': -6.18826412308259, 'leaf': -6.18826412308259, 'anise': -6.18826412308259, 'character': -6.18826412308259, 'then': -6.18826412308259, 'settles': -6.18826412308259, 'into': -6.18826412308259, 'rich': -6.18826412308259, 'all': -6.18826412308259, 'held': -6.18826412308259, 'together': -6.18826412308259, 'by': -6.18826412308259, 'decent': -6.18826412308259, 'tannic': -6.18826412308259, 'grip': -6.18826412308259, 'linger': -6.18826412308259, 'deep': -6.18826412308259, 'in': -6.18826412308259, 'finish': -5.08965183441448, 'first': -6.18826412308259, 'arrive': -6.18826412308259, 'straight': -6.18826412308259, 'your': -6.18826412308259, 'awaiting': -6.18826412308259, 'are': -5.495116942522644, 'complex': -6.18826412308259, 'aromatics': -6.18826412308259, 'mix': -6.18826412308259, 'floral': -6.18826412308259, 'spicy': -5.495116942522644, 'highlights': -6.18826412308259, 'surround': -6.18826412308259, 'purple': -6.18826412308259, 'fruits': -6.18826412308259, 'exotic': -6.18826412308259, 'nuances': -6.18826412308259, 'nigh': -6.18826412308259, 'perfect—ripe': -6.18826412308259, 'round': -6.18826412308259, 'forward': -6.18826412308259, 'loaded': -6.18826412308259, 'plummy': -6.18826412308259, 'berries': -6.18826412308259, 'gathers': -6.18826412308259, 'strength': -6.18826412308259, 'core': -6.18826412308259, 'holds': -6.18826412308259, 'expands': -6.18826412308259, 'dusted': -6.18826412308259, 'cocoa': -6.18826412308259, 'coffee': -6.18826412308259, 'scratchy': -5.495116942522644, 'berry': -5.08965183441448, 'cassis': -5.495116942522644, 'earthy': -5.495116942522644, 'but': -5.08965183441448, 'perfectly': -5.08965183441448, 'presentable': -5.495116942522644, 'feels': -5.495116942522644, 'full': -5.495116942522644, 'firm': -5.495116942522644, 'plum': -5.495116942522644, 'dried': -5.495116942522644, 'spice': -5.08965183441448, 'raw': -5.495116942522644, 'oak': -5.08965183441448, 'while': -5.495116942522644, 'dry': -5.08965183441448, 'mildly': -5.495116942522644, 'herbal': -5.08965183441448, 'grapefruit': -5.495116942522644, 'spray': -5.495116942522644, 'gooseberry': -5.495116942522644, 'jump': -5.495116942522644, 'glass': -5.495116942522644, 'initially': -5.495116942522644, 'an': -5.08965183441448, 'tomato-leaf': -5.495116942522644, 'tone': -5.495116942522644, 'lingering': -5.495116942522644, 'behind': -5.495116942522644, 'lemon-lime': -5.495116942522644, 'cuts': -5.495116942522644, 'through': -5.495116942522644, 'medium-bodied': -5.495116942522644, 'underripe': -5.495116942522644, 'peach': -5.495116942522644, 'tropical': -5.495116942522644, 'leading': -5.495116942522644, 'talc-dusted': -5.495116942522644, 'enjoyable': -5.495116942522644, 'as': -4.801969761962699, 'apéritif': -5.495116942522644, 'or': -5.495116942522644, 'paired': -5.495116942522644, 'tangy': -5.495116942522644, 'cheese': -5.495116942522644, '88-90': -5.495116942522644, 'barrel': -5.495116942522644, 'sample': -5.495116942522644, 'light': -5.495116942522644, 'ripe': -5.495116942522644, 'pleasant': -5.495116942522644, 'freshest': -5.495116942522644, 'blackcurrants': -5.495116942522644, 'very': -5.495116942522644, 'nice': -5.495116942522644, "that's": -5.495116942522644, 'silky': -5.495116942522644, 'satisfying': -5.495116942522644, 'cola': -5.495116942522644, 'not': -5.495116942522644, 'one': -5.495116942522644, 'age': -5.495116942522644, 'fine': -5.495116942522644, 'drinking': -5.495116942522644, 'now': -5.495116942522644, 'easily': -5.495116942522644, 'many': -5.495116942522644, 'pinots': -5.495116942522644, 'costing': -5.495116942522644, 'more': -5.495116942522644},
  {'this': -4.857225080796721, 'is': -4.857225080796721, 'easy': -5.262690188904886, 'to': -4.56954300834494, 'like': -5.262690188904886, 'for': -5.262690188904886, 'its': -5.262690188904886, 'clean': -5.262690188904886, 'brisk': -5.262690188904886, 'mouthfeel': -5.262690188904886, 'creaminess': -5.262690188904886, 'and': -3.5579420966664603, 'slightly': -5.262690188904886, 'sweet': -4.857225080796721, 'flavors': -4.857225080796721, 'of': -4.3463994570307305, 'lime': -5.262690188904886, 'strawberry': -5.262690188904886, 'made': -5.262690188904886, 'from': -5.262690188904886, 'chardonnay': -5.262690188904886, 'pinot': -5.262690188904886, 'noir': -5.262690188904886, 'it': -4.857225080796721, 'shows': -4.857225080796721, 'real': -5.262690188904886, 'finesse': -5.262690188904886, 'a': -4.56954300834494, 'blend': -5.955837369464831, 'malvasia': -5.955837369464831, 'greco': -5.955837369464831, 'grechetto': -5.955837369464831, 'fresh': -5.262690188904886, 'white': -5.955837369464831, 'delivers': -5.955837369464831, 'zesty': -5.955837369464831, 'acidity': -5.955837369464831, 'lively': -5.955837369464831, 'notes': -5.955837369464831, 'bitter': -5.955837369464831, 'almond': -5.955837369464831, 'flower': -5.955837369464831, 'citrus': -5.955837369464831, 'crushed': -5.955837369464831, 'mineral': -5.955837369464831, 'would': -5.955837369464831, 'pair': -5.955837369464831, 'beautifully': -5.955837369464831, 'with': -4.3463994570307305, 'heaping': -5.955837369464831, 'plate': -5.955837369464831, 'spaghetti': -5.955837369464831, 'con': -5.955837369464831, 'le': -5.955837369464831, 'vongole': -5.955837369464831, 'there': -5.955837369464831, 'tough': -5.955837369464831, 'concentration': -5.955837369464831, 'here': -5.955837369464831, 'that': -5.262690188904886, 'suggests': -5.955837369464831, 'wine': -5.955837369464831, 'needs': -5.955837369464831, 'aging': -5.955837369464831, "there's": -5.955837369464831, 'also': -5.955837369464831, 'fruit': -4.56954300834494, 'hints': -5.955837369464831, 'at': -5.955837369464831, 'richness': -5.955837369464831, 'come': -5.955837369464831, 'has': -5.955837369464831, 'good': -5.955837369464831, 'future': -5.955837369464831, '60%': -5.955837369464831, 'sangiovese': -5.955837369464831, '25%': -5.955837369464831, 'cabernet': -5.955837369464831, 'sauvignon': -5.955837369464831, '15%': -5.955837369464831, 'merlot': -5.955837369464831, 'opens': -5.955837369464831, 'delicate': -5.955837369464831, 'aromas': -5.955837369464831, 'black': -4.857225080796721, 'currant': -5.955837369464831, 'leather': -5.955837369464831, 'toast': -5.955837369464831, 'the': -3.7586127921286114, 'accessible': -5.955837369464831, 'palate': -5.955837369464831, 'doles': -5.955837369464831, 'out': -5.955837369464831, 'mature': -5.955837369464831, 'cherry': -5.955837369464831, 'vanilla': -5.955837369464831, 'tobacco': -5.955837369464831, 'alongside': -5.955837369464831, 'chewy': -5.955837369464831, 'tannins': -5.955837369464831, 'modest': -5.955837369464831, 'damp': -5.262690188904886, 'sagebrush': -5.262690188904886, 'dark': -4.857225080796721, 'cranberry': -4.857225080796721, 'buttered': -5.262690188904886, 'cherries': -5.262690188904886, 'plucked': -5.262690188904886, 'marjoram': -5.262690188904886, 'wet': -5.262690188904886, 'stone': -5.262690188904886, 'show': -5.262690188904886, 'on': -5.262690188904886, 'nose': -4.857225080796721, 'bottling': -5.262690188904886, 'tip': -5.262690188904886, 'sip': -5.262690188904886, 'great': -5.262690188904886, 'thyme': -5.262690188904886, 'bay': -5.262690188904886, 'leaf': -5.262690188904886, 'anise': -5.262690188904886, 'character': -5.262690188904886, 'then': -4.857225080796721, 'settles': -5.262690188904886, 'into': -4.857225080796721, 'rich': -5.262690188904886, 'all': -5.262690188904886, 'held': -5.262690188904886, 'together': -5.262690188904886, 'by': -5.262690188904886, 'decent': -5.262690188904886, 'tannic': -5.262690188904886, 'grip': -5.262690188904886, 'linger': -5.262690188904886, 'deep': -5.262690188904886, 'in': -4.857225080796721, 'finish': -4.857225080796721, 'first': -5.262690188904886, 'arrive': -5.262690188904886, 'straight': -5.262690188904886, 'your': -5.262690188904886, 'awaiting': -5.262690188904886, 'are': -5.262690188904886, 'complex': -5.262690188904886, 'aromatics': -5.262690188904886, 'mix': -5.262690188904886, 'floral': -5.262690188904886, 'spicy': -5.262690188904886, 'highlights': -5.262690188904886, 'surround': -5.262690188904886, 'purple': -5.262690188904886, 'fruits': -5.262690188904886, 'exotic': -5.262690188904886, 'nuances': -5.262690188904886, 'nigh': -5.262690188904886, 'perfect—ripe': -5.262690188904886, 'round': -5.262690188904886, 'forward': -5.262690188904886, 'loaded': -5.262690188904886, 'plummy': -5.262690188904886, 'berries': -5.262690188904886, 'gathers': -5.262690188904886, 'strength': -5.262690188904886, 'core': -5.262690188904886, 'holds': -5.262690188904886, 'expands': -5.262690188904886, 'dusted': -5.262690188904886, 'cocoa': -5.262690188904886, 'coffee': -5.262690188904886, 'scratchy': -5.955837369464831, 'berry': -5.955837369464831, 'cassis': -5.955837369464831, 'earthy': -5.955837369464831, 'but': -5.955837369464831, 'perfectly': -5.955837369464831, 'presentable': -5.955837369464831, 'feels': -5.955837369464831, 'full': -5.955837369464831, 'firm': -5.955837369464831, 'plum': -5.955837369464831, 'dried': -5.955837369464831, 'spice': -5.955837369464831, 'raw': -5.955837369464831, 'oak': -5.955837369464831, 'while': -5.955837369464831, 'dry': -5.955837369464831, 'mildly': -5.955837369464831, 'herbal': -5.955837369464831, 'grapefruit': -5.955837369464831, 'spray': -5.955837369464831, 'gooseberry': -5.955837369464831, 'jump': -5.955837369464831, 'glass': -5.955837369464831, 'initially': -5.955837369464831, 'an': -5.955837369464831, 'tomato-leaf': -5.955837369464831, 'tone': -5.955837369464831, 'lingering': -5.955837369464831, 'behind': -5.955837369464831, 'lemon-lime': -5.955837369464831, 'cuts': -5.955837369464831, 'through': -5.955837369464831, 'medium-bodied': -5.955837369464831, 'underripe': -5.955837369464831, 'peach': -5.955837369464831, 'tropical': -5.955837369464831, 'leading': -5.955837369464831, 'talc-dusted': -5.955837369464831, 'enjoyable': -5.955837369464831, 'as': -5.955837369464831, 'apéritif': -5.955837369464831, 'or': -5.955837369464831, 'paired': -5.955837369464831, 'tangy': -5.955837369464831, 'cheese': -5.955837369464831, '88-90': -5.955837369464831, 'barrel': -5.955837369464831, 'sample': -5.955837369464831, 'light': -5.955837369464831, 'ripe': -5.955837369464831, 'pleasant': -5.955837369464831, 'freshest': -5.955837369464831, 'blackcurrants': -5.955837369464831, 'very': -5.955837369464831, 'nice': -5.955837369464831, "that's": -5.955837369464831, 'silky': -5.955837369464831, 'satisfying': -5.955837369464831, 'cola': -5.955837369464831, 'not': -5.955837369464831, 'one': -5.955837369464831, 'age': -5.955837369464831, 'fine': -5.955837369464831, 'drinking': -5.955837369464831, 'now': -5.955837369464831, 'easily': -5.955837369464831, 'many': -5.955837369464831, 'pinots': -5.955837369464831, 'costing': -5.955837369464831, 'more': -5.955837369464831}])

To predict a class for a testdoc, we pass in the document, P(c), P(wi|c), and V. We return P(c|d), the logpost, for each class and take the max.

  def test_naive_bayes(testdoc, logprior, loglikelihood, V):
    #Initialize logpost[ci]: stores the posterior probability for class ci
    logpost = [None] * len(classes)
    
    for ci in classes:
        sumloglikelihoods = 0
        for word in testdoc:
            if word in V:
                #This is sum represents log(P(w|c)) = log(P(w1|c)) + log(P(wn|c))
                sumloglikelihoods += loglikelihood[ci][word]
        
    #Computes P(c|d)
        logpost[ci] = logprior[ci] + sumloglikelihoods

    #Return the class that generated max ĉ
    return logpost.index(max(logpost))

Closing Thoughts

Bayes' rule is a powerful probability theorem that, coupled with a naive assumption, forms the basis of a simple, fast, and practical machine learning algorithm. In this article, we saw how a naive Bayes' classifier could be used in NLP for text classification. We also built a text classification program in Python specifically for sentiment analysis. Our sentiment analysis program is merely a foundation upon which one can expand to analyze larger and more complex datasets.

Now that we have more NLP tools to work with, we have more ways to extract data that describes our reviews in different ways. All these tools are useful for our ultimate task, building a wine recommender. We will explore how to build a recommender system in upcoming blog posts, so stay tuned!


References

Multinomial Naive Bayes’ Classifier