...

Machine learning is the process of designing algorithms that make predictions. Behind the scenes, predictive algorithms are really constantly evolving mathematical models.

Recall that a mathematical model has a set of inputs, coefficients, and outputs. In machine learning, we draw our attention to a model’s coefficients, which we refer to as “variables.” (We call them variables because they are randomly initialized, and we would like to determine their optimal values.) The idea is, we’d like to feed our model existing data and adjust these variables to better match the output. To do this, we train our models—that is, we fine-tune our coefficients based on observed data.

Mathematical models are powerful tools, but using them has a caveat—they only take numbers and vectors (arrays of numbers) as inputs. But in reality, our observed data is both quantitative and qualitative. The question is, how can we handle a body of text as our input?

The solution lies at the intersection of computer science and computational linguistics. In this article, we will explore a method called TF-IDF that turns text into numbers, and we will learn how to create a TF-IDF program in Python.


Natural Language Processing

Natural language processing, abbreviated NLP, is a field in computer science that describes the way computers process, understand, and interpret human language. The core of NLP is writing programs that analyze and model language in a mathematical way. Although NLP techniques can transform data for machine learning purposes, the converse is also true. Machine learning, neural networks, and deep learning are actively being applied to NLP in order to facilitate language processing, and these applications have made significant advancements in NLP itself.

Once computers are equipped with language processing tools, they can be used for a variety of tasks. Deep learning for NLP has helped machines perform tasks such as speech recognition, machine translation, caption generation, and spam detection, and is also the technology behind well-known voice assistants Siri and Alexa.

To see NLP advancements in AI in action, check out this sweet AI demo Google showed at Google I/O 2018. To learn more about NLP basics, check out this blog post.

So in short, we can use NLP techniques to turn text into a vector of numbers. Such a process—extracting information, like numbers or sentiment, from text—is called text mining.


What is TF-IDF?

In this blog post, we’ll be exploring a text mining method called TF-IDF. TF-IDF, which stands for term frequency inverse-document frequency, is a statistic that measures how important a term is relative to a document and to a corpus, a collection of documents. The TF-IDF of a term is given by the equation:

TF-IDF(term) = TF(term in a document) * IDF(term)

  • TF(term) = # of times the term appears in document / total # of terms in document
  • IDF(term) = log(total # of documents / # of documents with term in it)

To explain TF-IDF, let’s walk through a concrete example. Say you are sifting through some blog posts about building games in Python.

  • In post #1, the word “Python” appears once in five pages
  • In post #2, “Python” appears dozens of times in two pages

The term frequency (TF) measures how often a term shows up in a single document. If a term appears relatively frequently in a document, it is most likely an important term. In our example, the word “Python” appears quite frequently in post #2, so it would have a high term frequency. We may also infer that it is an important word in post #2. On the other hand, “Python” appears only once in post #1, so it would have a low term frequency. Observe that dividing by the length of the document allows us compare two different length documents fairly.

Now suppose the word Python doesn’t appear in any of our other blog posts. We draw the conclusion that the word Python isn’t very relevant to most of our blog posts, but very relevant to post #2, where it appeared dozens of times. The inverse document frequency (IDF) tells us how important a term is to a collection of documents. A good example of how IDF comes into play is for the word “the.” We know that just about every document contains “the,” so the term isn’t really special anymore, thereby producing a very low IDF. Now let’s contrast “the” with “Python” in our example. “Python” appears rarely in the other posts, so its IDF should be high. In fact, “Python” now carries a weight signaling that in any document in which it appears, it is important to that document.

When we multiply TF and IDF, we observe that the larger the number, the more important a term in a document is to that document. We can then compute the TF-IDF for each word in each document and create a vector.


Some Additional Notes:
When parsing through a body of text, we observe several features of the English language that we may want to account for.

  • First, we pepper prepositions such as “a(n),” “and,” “but,” “in,” “of,” “that,” “the,” and “which” in our language. We refer to these words as stopwords, words that are ubiquitous, but contribute little semantic significance. Stopwords generate a low IDF score, so they don't generally affect vectors. However, they do produce some noise, so they are usually left out of the analysis.
  • Second, words in English can share the same stem, like “include,” “includes,” and “including.” In fact, such words not only share the same stem, but also relatively the same meaning. The process of reducing similar stemmed words to one word is called stemming. Stemming helps clean our data.
  • Third, a “word” in English can be made up of two words (like “fire extinguisher” or “does not”). Depending on the nature of the text, we may want to account for these special cases.


Text Mining: Why use TF-IDF?

TF-IDF is useful when we are concerned with whole words in our text. A cool usage of TF-IDF is in automated blog post tagging, where running TF-IDF on every blog post helps single out words that can be used as tags. We can also use TF-IDF vectors for machine learning, we can use to power our recommendations. (Read on for our TF-IDF implementation on wine reviews.)

In addition to TF-IDF, there are two other text mining methods, word2vec or gloVe, that are commonly used. While TF-IDF returns a vector per word per document based on frequency, both word2vec and GloVe return vectors per word based on co-occurrence information, how frequently words appear with other words in a corpus (like a word’s context).

The advantages of TF-IDF is that the process itself is intuitive and its vectors are interpretable. For instance, the TF-IDF of a word can be calculated using a simple formula, and each vector is just a list of word frequencies. However, language is far more complex than the individual words that compose it. Because TF-IDF acts solely on words, it is not as informative nor as powerful as a method like word2vec, which acts on sentences. Likewise, word2vec has its own advantages and disadvantages -- it provides a better representation of text at the cost of our ability to interpret the data as humans.


How to Implement TF-IDF: Wine Reviews

In this post we’ll be writing constructing a text mining program based on TF-IDF. Our goal is to take in a set of documents and output something that will tell us the [word, document, and the TF-IDF score] per word per document. For our purposes, we want a matrix that looks something like this:

  [[0, 0, 0, 0, 0.47, 0, 0.23, 0]
   [0, 0.68, 0, 0, 0.32, 0, 0, 0]
   [0.11, 0, 0.19, 0, 0, 0, 0, 0]]

to correspond with

  ["Python", "game", "goal", "success", "pass", "code", "walk", "move"]

Each row in the matrix is a review, each index in the review is a word in our dataset, and each word that appears in the review has a non-zero TF-IDF frequency at its index.


To make our implementation more concrete, we’ll be using a really cool wine reviews dataset we found on Kaggle. Here is how the csv file looks on Excel:


country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy "Aromas include tropical fruit, broom, brimstone and dried..." Vulkà Bianco 87 Sicily & Sardinia Etna Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal "This is ripe and fruity, a wine that is smooth while still structured..." Avidagos 87 15.0 Douro Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US "Tart and snappy, the flavors of lime flesh and rind dominate..." 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm

And in our terminal:

  ,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco  (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's  already drinkable, although it will certainly be better from 2016.",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm

We see that each row is a wine observation and each column is a variable. Of all the variables—location, province, region, pricing, grape variety, etc.—we are most interested in description, the column of wine reviews.

From a cursory glance at the wine reviews, we notice that each wine has its own characteristic, from its color, texture, and fruitiness to its aroma, acidity, and dryness. More importantly, we observe that each characteristic is captured by several descriptive words. Wines that are “fruitier” or “fresh” will contain those words in their reviews, whereas “bright” or “spicy” wines “infused with berries” will contain these words in theirs. If we wanted to, perhaps, gain some insight on whether certain wines share similarities in “acidity” or in “aroma,” we can run TF-IDF on each review and compare their word similarities. Taking it a step further, we can use this data to power machine learning models to predict similar wines, like a wine recommender system!


To view the csv file on terminal:

  • head filename.csv #displays first 10 lines of csv
  • head -n 7 filename.csv #displays first 7 lines of csv


Reading, Tokenizing, Parsing

Our goal is to take our wine dataset and make it useable, so the first thing we need to do is read, extract, and clean our data.

To read a csv file, we found the following methods useful.

  file = open("winedata.csv")
  reader = csv.reader(file)

Next, we’d like to extract the reviews. (For ease of comprehension, we will use the colloquial words “word” as opposed to “term,” and “review” as opposed to “document”). Since our reviews are in column 3 of the dataset, a list comprehension like row[2] for row in reader will do the trick.

Observe that each review is a string. In order to run TF-IDF, we need to convert each string into a list of individual words, a process called tokenization. This can be done by using a .split() function. We can also clean up our data by making our characters lowercase and by removing certain punctuations (We used .replace() for simplicity).

  data = [
    [(word.replace(",", "")
          .replace(".", "")
          .replace("(", "")
          .replace(")", ""))
    for word in row[2].lower().split()]
    for row in reader]
    
  #Removes header
  data = data[1:]

Calling data[0],

  >>> data[0]
  ['aromas', 'include', 'tropical', 'fruit', 'broom', 'brimstone',
  'and', 'dried', 'herb', 'the', 'palate', "isn't", 'overly',
  'expressive', 'offering', 'unripened', 'apple', 'citrus',
  'and', 'dried', 'sage', 'alongside', 'brisk', 'acidity']

Computing a TF Map

TF(term) = # of times the term appears in document / total # of terms in document
Now that our data is usable, we’d like to start computing the TF and the IDF. Recall that computing the tf of a word in a document requires us to calculate the number of words in a review, and the number of times each word appears in the review. We can store each (word, word count pair) in a dictionary. The keys of the dictionary are then just the unique terms in the review. The following function takes in a review and outputs a tf dictionary for that review.

  def computeReviewTFDict(review):
    """ Returns a tf dictionary for each review whose keys are all 
    the unique words in the review and whose values are their 
    corresponding tf.
    """
    #Counts the number of times the word appears in review
    reviewTFDict = {}
    for word in review:
        if word in reviewTFDict:
            reviewTFDict[word] += 1
        else:
            reviewTFDict[word] = 1
    #Computes tf for each word           
    for word in reviewTFDict:
        reviewTFDict[word] = reviewTFDict[word] / len(review)
    return reviewTFDict

We then call our function on every review and store all these tf dictionaries. We do this using a list comprehension, where each index maps to a review’s tf dictionary.

  >>> tfDict[0]
  {'aromas': 0.041666666666666664, 'include': 0.041666666666666664, 
   'tropical': 0.041666666666666664, 'fruit': 0.041666666666666664, 
   'broom': 0.041666666666666664, 'brimstone': 0.041666666666666664, 
   'and': 0.08333333333333333, 'dried': 0.08333333333333333, 
   'herb': 0.041666666666666664, 'the': 0.041666666666666664, 
   'palate': 0.041666666666666664, "isn't": 0.041666666666666664, 
   'overly': 0.041666666666666664, 'expressive': 0.041666666666666664, 
   'offering': 0.041666666666666664, 'unripened': 0.041666666666666664, 
   'apple': 0.041666666666666664, 'citrus': 0.041666666666666664, 
   'sage': 0.041666666666666664, 'alongside': 0.041666666666666664, 
   'brisk': 0.041666666666666664, 'acidity': 0.041666666666666664}

Computing an IDF Map

IDF(term) = log(total # of documents / # of documents with term in it)
Computing the idf of a word requires us to compute the total number of documents and the number of documents that contains the word. In our case we can calculate the total number of documents with len(data), the number of wine reviews. For each review, we increment the document count for each unique word. We can use the keys of the dictionaries that we calculated in the TF step to get the unique set of words. The resulting IDF dictionary’s keys will be the set of all unique words across every document.

  def computeCountDict():
    """ Returns a dictionary whose keys are all the unique words in
    the dataset and whose values count the number of reviews in which
    the word appears.
    """
    countDict = {}
    # Run through each review's tf dictionary and increment countDict's (word, doc) pair
    for review in tfDict:
        for word in review:
            if word in countDict:
                countDict[word] += 1
            else:
                countDict[word] = 1
    return countDict

  #Stores the review count dictionary
  countDict = computeCountDict()

We can index into countDic using a word.

  >>> countDict["aromas"]
  39089

Finally, we can compute an idfDict, using countDict and some math, and store it.

  def computeIDFDict():
    """ Returns a dictionary whose keys are all the unique words in the
    dataset and whose values are their corresponding idf.
    """
    idfDict = {}
    for word in countDict:
        idfDict[word] = math.log(len(data) / countDict[word])
    return idfDict
  
  #Stores the idf dictionary
  idfDict = computeIDFDict()

  >>> idfDict["aromas"]
  1.2013935061705585

Computing the TF-IDF Map

TF-IDF(term) = TF(term) * IDF(term)
The last step is to compute the TF-IDF. We use our existing tf dictionaries and simply multiply each value by the idf. We can use the idf keys since they contain every unique word.

  def computeReviewTFIDFDict(reviewTFDict):
    """ Returns a dictionary whose keys are all the unique words in the
    review and whose values are their corresponding tfidf.
    """
    reviewTFIDFDict = {}
    #For each word in the review, we multiply its tf and its idf.
    for word in reviewTFDict:
        reviewTFIDFDict[word] = reviewTFDict[word] * idfDict[word]
    return reviewTFIDFDict

  #Stores the TF-IDF dictionaries
  tfidfDict = [computeReviewTFIDFDict(review) for review in tfDict]

  >>> tfidfDict[0]
  {'aromas': 0.0500580627571066, 'include': 0.219553440312544,
   'tropical': 0.14996823412144533, 'fruit': 0.048533342887594366, 
   'broom': 0.2713490182801833, 'brimstone': 0.3160588285667858,
   'and': 0.0017742747855027446, 'dried': 0.23995308337440016,
   'herb': 0.1266341645904876, 'the': 0.008924560213578504, 
   'palate': 0.05209040004433979, "isn't": 0.22238812407887013, 
   'overly': 0.22941543099679262, 'expressive': 0.2227900564560287, 
   'offering': 0.16452597884055295, 'unripened': 0.4448522641233823, 
   'apple': 0.10024349710803793, 'citrus': 0.10337619569298064, 
   'sage': 0.18008684575682601, 'alongside': 0.1326848799224944, 
   'brisk': 0.1718509198413495, 'acidity': 0.055808050591909825}

Constructing A Vector

Now that we have our TF-IDF dictionaries, we can create our matrix. Our matrix will be an array of vectors, where each vector represents a review. The vector will be a list of frequencies for each unique word in the dataset—the TF-IDF value if the word is in the review, or 0.0 otherwise.

  # Create a list of unique words
  wordDict = sorted(countDict.keys())

  def computeTFIDFVector(review):
      tfidfVector = [0.0] * len(wordDict)
     
      # For each unique word, if it is in the review, store its TF-IDF value.
      for i, word in enumerate(wordDict):
          if word in review:
              tfidfVector[i] = review[word]
      return tfidfVector

  tfidfVector = [computeTFIDFVector(review) for review in tfidfDict]

The first review of our matrix looks something like this:


  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0.055808050591909825, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0, 0, 0, 0, 0.1326848799224944, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0, 0, 0, 0.0017742747855027446, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0, 0, 0.10024349710803793, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   ... 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0500580627571066, 0, 0, ... ]

Where the values correspond to the words [..., 'acidity', ..., 'alongside', ..., 'and', ..., 'apple', ..., 'aromas', ...,] respectively in the first review.

Again, we can run the function on every review and store the entire list, thus giving us our matrix.


Comparing TF-IDF Vectors

We can calculate the similarity between two TF-IDF vectors. This translates into how similar two reviews are and can be built upon to do more specific tasks. We will use cosine similarity to do so. It is defined as follows:

COS-SIM(vector_x, vector_y) = (vector_x dot vector_y) / magnitude(vector_x) * magnitude(vector_y)

To calculate the dot product of two vectors, you multiply each element index wise, then you add them all together.

  def dot_product(vector_x, vector_y):
    dot = 0.0
    for e_x, e_y in zip(vector_x, vector_y):
       dot += e_x * e_y
    return dot

To calculate the magnitude of the vector we sum the square of each index, then square root the whole thing. In Python it looks like the following:

  def magnitude(vector):
    mag = 0.0
    for index in vector:
      mag += math.pow(index, 2)
    return math.sqrt(mag)

So concretely, the similarity of the first two reviews can be calculated as follows:

  review_similarity = dot_product(tfidfVector[0], tfidfVector[1])
    / magnitude(tfidfVector[0]) * magnitude(tfidfVector[1])

Closing Thoughts

Now that we know how to build a simple yet powerful tool for transforming text into numbers, the next step is to use our data. Our TF-IDF vectors will be the backbone of more complex and interesting tasks. For example, we can build a wine recommender, or predict vineyard by reviews, or calculate tags for each wine based on this matrix. We will do all of the following and more in upcoming blog posts, so stay tuned!


References

NLP TF-IDF vs word2vec Reading CSV Files Parsing Text TF-IDF