Text Mining

Text Mining

Natural languages (English, Hindi, Mandarin etc.) are different from programming languages. The semantic or the meaning of a statement depends on the context, tone and a lot of other factors. Unlike programming languages, natural languages are ambiguous.
Text mining deals with helping computers understand the “meaning” of the text. Some of the common text mining applications include sentiment analysis e.g if a Tweet about a movie says something positive or not, text classification e.g classifying the mails you get as spam or ham etc.

Packages in R:
  1. RSQLite, ‘SQLite’ Interface for R
  2. tm, framework for text mining applications
  3. SnowballC, text stemming library
  4. Wordcloud, for making wordcloud visualizations
  5. Syuzhet, text sentiment analysis
  6. ggplot2, one of the best data visualization libraries
  7. quanteda, N-grams
You can install the aforementioned packages using the following command:
install.package(“package name”)

Text preprocessing
Before we dive into analyzing text, we need to preprocess it. Text data contains white spaces, punctuations, stop words etc. These characters do not convey much information and are hard to process. For example, English stop words like “the”, “is” etc. do not tell you much information about the sentiment of the text, entities mentioned in the text, or relationships between those entities. Depending upon the task at hand, we deal with such characters differently. This will help isolate text mining in R on important words.

Word cloud
A word cloud is a simple yet informative way to understand textual data and to do text analysis. In this example, we will try to visualize Hillary Clinton’s Emails. This will help us quantify the content of the Emails and help us derive insights and better communicate our results Along the way, we’ll also learn about some data preprocessing steps that will be immensely helpful in other text mining tasks as well. Let’s start with getting the data. You can head over to Kaggle to download the dataset.

Let’s read the data and learn to implement the preprocessing steps.

library(RSQLite)
db <- dbConnect(dbDriver("SQLite"), "/Users/shubham/Documents/hillary-clinton-emails/database.sqlite")

# Get all the emails sent by Hillary
emailHillary <- dbGetQuery(db, "SELECT ExtractedBodyText EmailBody FROM Emails e INNER JOIN Persons p ON e.SenderPersonId=P.Id WHERE p.Name='Hillary Clinton'  AND e.ExtractedBodyText != '' ORDER BY RANDOM()")
emailRaw <- paste(emailHillary$EmailBody, collapse=" // ")

The above code reads in the “database.sqlite” file into R. SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. So, you can read an SQLite file just as you would read a CSV or a text file. Accordingly, the same theory would apply to any type of CSV or text file or input file that you can work with in R, though you would use a different approach. 
Here, we’ll use the package RSQLite to read in a SQLite file containing all of Hillary Clinton’s emails. Next, we will be querying the column containing the Email text body. Then we’ll be ready to do an analysis of the Clinton emails that shaped this political season. 

We’ll perform the following steps to make sure that the text mining in R we’re dealing with is clean:
  • Convert the text to lower case, so that words like “write” and “Write” are considered the same word for analysis
  • Remove numbers
  • Remove English stopwords e.g “the”, “is”, “of”, etc
  • Remove punctuation e.g “,”, “?”, etc
  • Eliminate extra white spaces
  • Stemming our text 

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. E.g changing “car”, “cars”, “car’s”, “cars’” to “car”. This can also help with different verb tenses with the same semantic meaning such as digs, digging, and dig. 
One very useful library to perform the aforementioned steps and text mining in R is the “tm” package. The main structure for managing documents in tm is called a Corpus, which represents a collection of text documents.

#Cleaning text in R
# Transform and clean the text
library("tm")
docs <- Corpus(VectorSource(emailRaw))

Once we have our email corpus (all of Hillary’s emails) stored in the variable “docs”, we’ll want to modify the words within the emails in it with the techniques we discussed above such as stemming, stopword removal and more. With the tm library, this can be done easily. Transformations are done via the tm_map() function which applies a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map() just applies them to all documents in a corpus. If you wanted to convert all the text of Hillary’s emails into lowercase at once, you’d use the tm library and the techniques below to do so easily. 

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

To stem text, we will need another library, known as SnowballC. 

# Text stemming (reduces words to their root form)
library("SnowballC")
docs <- tm_map(docs, stemDocument)
# Remove additional stopwords
docs <- tm_map(docs, removeWords, c("clintonemailcom", "stategov", "hrod"))

A document term matrix is an important representation for text mining in R tasks and an important concept in text analytics. Each row of the matrix is a document vector, with one column for every term in the entire corpus.
Naturally, some documents may not contain a given term, so this matrix is sparse. The value in each cell of the matrix is the term frequency. ‘tm’ makes it very easy to create the term-document matrix. With the document term matrix made, we can then proceed to build a word cloud for Hillary’s emails, highlighting which words the most are frequently made. 

Using the SnowballC library to stem text

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

Generating a wordcloud of Hillary's emails

# Generate the WordCloud
library("wordcloud")
library("RColorBrewer")
par(bg="grey30")
png(file="WordCloud.png",width=1000,height=700, bg="grey30")
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per=0.3 )
title(main = "Hillary Clinton's Most Used Used in the Emails", font.main = 1, col.main = "cornsilk3", cex.main = 1.5)
dev.off()

Sentiment Analysis
Sentiment analysis is the process of determining whether a piece of writing is positive, negative or neutral. Here, we’ll work with the package “syuzhet”. Just as the previous example, we’ll read the Emails from the database.

Read emails into syuzhet
Emails <- data.frame(dbGetQuery(db,"SELECT * FROM Emails"))
library('syuzhet')

“syuzhet” uses NRC Emotion lexicon. The NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). 
The get_nrc_sentiment function returns a data frame in which each row represents a sentence from the original file. The columns include one for each emotion type was well as the positive or negative sentiment valence. It allows us to take a body of text and return which emotions it represents — and also whether the emotion is positive or negative. 

##Do sentiment analysis of Hillary's emails
d<-get_nrc_sentiment(Emails$RawText)
td<-data.frame(t(d))

td_new <- data.frame(rowSums(td[2:7945]))
#The function rowSums computes column sums across rows for each level of a grouping variable.

#Transformation and  cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]

Now, we’ll use “ggplot2” to create a bar graph. Each bar represents how prominent the each of the emotion is in text.

##Graph the sentiment analysis in ggplot2
#Visualisation
library("ggplot2")
qplot(sentiment, data=td_new2, weight=count, geom="bar",fill=sentiment)+ggtitle("Email sentiments")

N-grams
You must have noticed YouTube’s auto-captioning feature. Auto-captioning is a speech recognition problem. One of the features in being able to generate captions automatically from audio is to predict what word comes after a given sequence of words.
E.g     I’d like to make a …
Hopefully, you concluded that the next word in the sequence is “call”. We do this by first analyzing what words frequently co-occur. We formalize this by introducing N-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. In other words, we’ll be finding collocations. a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. An example of this would be the term “very much.” 
In this section, we’ll use the R-library “quanteda” to compute tri-grams to find commonly occuring sequence of 3 words.

##Calculating trigrams in quanteda
library(tm)
library(RSQLite)
library(quanteda)

db <- dbConnect(dbDriver("SQLite"), "/Users/shubham/Documents/hillary-clinton-emails/database.sqlite")

# Get all the emails sent by Hillary
emailHillary <- dbGetQuery(db, "SELECT ExtractedBodyText EmailBody FROM Emails e INNER JOIN Persons p ON e.SenderPersonId=P.Id WHERE p.Name='Hillary Clinton' AND e.ExtractedBodyText != '' ORDER BY RANDOM()")
emails <- paste(emailHillary$EmailBody, collapse=" // ")

We will use quanteda’s function collocations to do so. And, finally we’ll remove stopwords from the collocations so we can get a full view of which are the most frequently used collection of three words in Hillary’s emails. 

##Remove stopwords
collocations(emails, size = 2:3)

print(removeFeatures(collocations(emails, size = 2:3), stopwords("english")))


Share on Google Plus

About Data Sciences by Venu

Hi, My name is Venugopala Chary and I'm Currently working as Associate Professor in Reputed Engineerng College, Hyderabad. I have B.Tech and M.tech in regular from JNTU Hyderabad. I have 11 Years of Teaching Experience for both B.Tech and M.Tech Courses.
    Blogger Comment
    Facebook Comment

0 comments:

Post a Comment