TF-IDF using R

TF-IDF using R

In this article we will get to know about a concept widely used in Natural Language processing.

At first let us know what is Tf-idf. For understanding the concepts of TF and IDF we need to know what is Corpus & Document.

Document: A piece writing which is a collection of Sentences and facts. (denoted by d)

Corpus: A collection of Documents is called Corpus. (denoted by D)

For Example,

  1. A website is a Document and Collection of Websites is a Corpus.
  2. A Chapter of a book is a document and The book (collection of Chapters) is a Corpus.

TF (Term Frequency):

It is a numerical statistic which measures the value of the Occurrence of a particular term (t) in a Document (d) i.e. It is measured by the number of times that term t occurs in document d. [denoted by tf(t,d)]

The formula used to measure tf is as follows:

We can also use log[1+f(t,d)]

where, f(t,d)= Number of occurrences of the term t in document d

But the Relative frequency formula is most commonly used one.

IDF (Inverse Document Frequency):

It is also a numerical statistic which measures the amount of information the word provides. It is measured as the number of documents (in Corpus D) in which the term (t) occurs.

The formula is as follows:

Where,

nt = Number of documents in which the term t occurs in the corpus D

N = Total Number of documents in corpus D.

Now, nt can be zero i.e. the term may not present in the corpus. So, we replace the denominator of the above formula by 1+ nt.

Origin of Concept:

Let X be a Random variable where X € (collection of all the words in a corpus) 

Now from the Language model derived by Shannon we can say the probability model of Random variable X is:

Where P(X= wordi ) = pi means probability of the occurrence i-th word in the corpus.

Now the information value of a word x in a Corpus. By information value I literally mean its entropy:

H(X) = EX[I(x)] =  

This Entropy of a word can be measured by the statistic TF-IDF.

TF-IDF = TF * IDF

TF-IDF in R:

Before applying the TF-IDF in R we have to do few pre-processing to the given data, so that the data is converted into Tidy Text format.

Tidy data has a specific structure:

• Each variable is a column.

• Each observation is a row.

• Each type of observational unit is a table.

We thus define the tidy text format as being a table with one token per row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.

For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.

As first we see what are the libraries required:

  1. dplyr
  2. tidytext
  3. janeaustenr (this library contains the text of Jane Austen’s six completed, published novels)

The following flowchart describes the workflow done in R.

R code:

library(dplyr)

library(janeaustenr)

library(tidytext)

#Within our tidy text framework, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure. To do this, we use the unnest_tokens() function.

# “%>%” is one of the pipe operators. Like “g(x) %>% f(x) =f(g(x))”

#Step 1:

book_words <- austen_books() %>%

  unnest_tokens(word, text) %>%

  count(book, word, sort = TRUE) %>%

  ungroup()

# For creating the Table.

#Step 2:

total_words <- book_words %>%

  group_by(book) %>%

  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words

#Step 3:

freq_by_rank <- book_words %>%

  group_by(book) %>%

  mutate(rank = row_number(),

         `term frequency` = n/total)

freq_by_rank

# The bind_tf_idf ()  function calculates the tf-idf of each of the words.

#Step 4:

book_words <- book_words %>%

  bind_tf_idf(word, book, n)

book_words

R Output:

Step 1 & 2:

Step 3:

Step 4:

From the above output you can easily see that the unimportant words which are common to occur has the tf-idf value=0 (like ‘The’, ‘a’, ‘to’, ‘of’)

A few more illustrative examples are shown below.

Paulo Coelho’s Quotations:

text <- c(“And, when you want something, all the universe conspires in helping you to achieve it”,

          “It’s the possibility of having a dream come true that makes life interesting.”,

          “There is only one thing that makes a dream impossible to achieve: the fear of failure.”,

          “The secret of life, though, is to fall seven times and to get up eight times.”,

          “Don’t give in to your fear.” ,

          ” If you do, you wouldn’t be able to talk to your heart.”,

          “Remember that wherever your heart is there you will find your treasure.”)

text_df <- data_frame(text)

tidy_text_df <- text_df %>% unnest_tokens(words, text)

tidy_text_df <- tidy_text_df %>% count( words , sort = T) %>% ungroup()

stop_tidy_text_df <- tidy_text_df %>% anti_join(stop_words)

total_words_1 <- tidy_text_df %>% summarize(total = sum(n))

tidy_text_df <- left_join(tidy_text_df, total_words_1)

freq_rank =tidy_text_df %>% mutate(rank = row_number(),

                   `term frequency` = n/sum(n))

tidy_text_df <- tidy_text_df %>%

  bind_tf_idf(words, (text_df %>% unnest_tokens(words, text) ), n)

#(as there is only one document in the corpus, so we cannot find IDF)

Crude Way out:

library(tm)

library(proxy)

library(dplyr)

doc <- c(“And, when you want something, all the universe conspires in helping you to achieve it”,

          “It’s the possibility of having a dream come true that makes life interesting.”,

          “There is only one thing that makes a dream impossible to achieve: the fear of failure.”,

          “The secret of life, though, is to fall seven times and to get up eight times.”,

          “Don’t give in to your fear.” ,

          ” If you do, you wouldn’t be able to talk to your heart.”,

          “Remember that wherever your heart is there you will find your treasure.”)

# create term frequency matrix using functions from tm library

doc_corpus <- Corpus( VectorSource(doc) )

control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)

tdm <- TermDocumentMatrix(doc_corpus, control = control_list)

# print

tf <- as.matrix(tdm)

# idf

 idf <- log( ncol(tf) / ( 1 + rowSums(tf != 0) ) )

idf

idf <- diag(idf)

tf_idf <- crossprod(tf, idf)

colnames(tf_idf) <- rownames(tf)

tf_idf

tf_idf / sqrt( rowSums( tf_idf^2 ) )

Given Example of Jane Austen Novel’s

library(dplyr)

library(janeaustenr)

library(tidytext)

book_words <- austen_books() %>%

  unnest_tokens(word, text) %>%

  count(book, word, sort = TRUE) %>%

  ungroup()

total_words <- book_words %>%

  group_by(book) %>%

  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words

freq_by_rank <- book_words %>%

  group_by(book) %>%

  mutate(rank = row_number(),

         `term frequency` = n/total)

freq_by_rank

book_words <- book_words %>%

  bind_tf_idf(word, book, n)

book_words

For a given Data:

install.packages(c(‘tm’, ‘SnowballC’, ‘wordcloud’, ‘topicmodels’))

    library(tm)

    library(SnowballC)

    library(wordcloud)

    edited_corpus = Corpus(VectorSource(“file name”))

    edited_corpus = tm_map(edited_corpus, content_transformer(tolower))

    edited_corpus = tm_map(edited_corpus, removeNumbers)

    edited_corpus = tm_map(edited_corpus, removePunctuation)

    edited_corpus = tm_map(edited_corpus, removeWords, c(“the”, “and”, stopwords(“english”)))

    edited_corpus =  tm_map(edited_corpus, stripWhitespace)

    inspect(edited_corpus[1])

    review <- DocumentTermMatrix(edited_corpus)

    findFreqTerms(review, 1000)

    freq = data.frame(sort(colSums(as.matrix(review)), decreasing=TRUE))

    wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, “Dark2”))


Kounteyo Roy Chowdhury
Kounteyo Roy Chowdhury

Msc in applied statistics.
Data Scientist specializing in AI-NLP

Mathematica-City

Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. We also provide freelancing services on the aforementioned topics. Feel free to reach out to us for any kind of discussion on any of the related topics.

4 thoughts on “TF-IDF using R

Comments are closed.