languages (English, Hindi, Mandarin etc.) are different from programming
languages. The semantic or the meaning of a statement depends on the context,
tone and a lot of other factors. Unlike programming languages, natural
languages are ambiguous.
mining deals with helping computers understand the “meaning” of the text. Some
of the common text mining applications include sentiment analysis e.g if a
Tweet about a movie says something positive or not, text classification e.g
classifying the mails you get as spam or ham etc.
Packages in R:
- RSQLite,
‘SQLite’ Interface for R
- tm,
framework for text mining applications
- SnowballC,
text stemming library
- Wordcloud,
for making wordcloud visualizations
- Syuzhet,
text sentiment analysis
- ggplot2,
one of the best data visualization libraries
- quanteda,
You can install the aforementioned
packages using the following command:
install.package(“package name”)
Text preprocessing
Before we dive into analyzing text,
we need to preprocess it. Text data contains white spaces, punctuations, stop
words etc. These characters do not convey much information and are hard to
process. For example, English stop words like “the”, “is” etc. do not tell you
much information about the sentiment of the text, entities mentioned in the
text, or relationships between those entities. Depending upon the task at hand,
we deal with such characters differently. This will help isolate text mining in
R on important words.
Word cloud
A word cloud is a simple yet
informative way to understand textual data and to do text analysis. In this
example, we will try to visualize Hillary Clinton’s Emails. This will help us
quantify the content of the Emails and help us derive insights and better
communicate our results Along the way, we’ll also learn about some data
preprocessing steps that will be immensely helpful in other text mining tasks
as well. Let’s start with getting the data. You can head over to Kaggle to download
the dataset.
Let’s read
the data and learn to implement the preprocessing steps.
<- dbConnect(dbDriver("SQLite"), "/Users/shubham/Documents/hillary-clinton-emails/database.sqlite")
# Get all the emails sent by Hillary
emailHillary <- dbGetQuery(db, "SELECT
ExtractedBodyText EmailBody FROM Emails e INNER JOIN Persons p ON
e.SenderPersonId=P.Id WHERE p.Name='Hillary Clinton' AND
e.ExtractedBodyText != '' ORDER BY RANDOM()")
emailRaw <- paste(emailHillary$EmailBody, collapse="
// ")
The above code reads in the “database.sqlite” file into R.
SQLite is an embedded SQL database engine. Unlike most other SQL databases,
SQLite does not have a separate server process. SQLite reads and writes
directly to ordinary disk files. So, you can read an SQLite file just as you
would read a CSV or a text file. Accordingly, the same theory would apply to
any type of CSV or text file or input file that you can work with in R, though
you would use a different approach.
we’ll use the package RSQLite to read in a SQLite file containing all of
Hillary Clinton’s emails. Next, we will be querying the column containing the
Email text body. Then we’ll be ready to do an analysis of the Clinton
emails that shaped this political season.
perform the following steps to make sure that the text mining in R we’re
dealing with is clean:
- Convert the text to lower case,
so that words like “write” and “Write” are considered the same word for
- Remove numbers
- Remove English stopwords e.g
“the”, “is”, “of”, etc
- Remove punctuation e.g “,”,
“?”, etc
- Eliminate extra white spaces
- Stemming our text
Stemming is
the process of reducing inflected (or sometimes derived) words to their word
stem, base or root form. E.g changing “car”, “cars”, “car’s”, “cars’” to
“car”. This can also help with different verb tenses with the same semantic
meaning such as digs, digging, and dig.
very useful library to perform the aforementioned steps and text mining in
R is the “tm” package. The main structure for managing documents in tm is
called a Corpus, which represents a collection of text documents.
text in R
# Transform and clean the text
<- Corpus(VectorSource(emailRaw))
Once we have our email corpus (all
of Hillary’s emails) stored in the variable “docs”, we’ll want to modify the
words within the emails in it with the techniques we discussed above such
as stemming, stopword removal and more. With the tm library, this can be done
easily. Transformations are done via the tm_map() function which applies a
function to all elements of the corpus. Basically, all transformations work on
single text documents and tm_map() just applies them to all documents in a
corpus. If you wanted to convert all the text of Hillary’s emails into
lowercase at once, you’d use the tm library and the techniques below to do so
# Convert the text to lower case
<- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs,
removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
To stem text, we will need another library, known as
# Text stemming (reduces words to their root form)
docs <- tm_map(docs, stemDocument)
# Remove additional stopwords
docs <- tm_map(docs,
removeWords, c("clintonemailcom", "stategov", "hrod"))
document term matrix is an important representation for text mining in R
tasks and an important concept in text analytics. Each row of the matrix
is a document vector, with one column for every term in the entire corpus.
some documents may not contain a given term, so this matrix is sparse. The
value in each cell of the matrix is the term frequency. ‘tm’ makes it very easy
to create the term-document matrix. With the document term matrix made, we
can then proceed to build a word cloud for Hillary’s emails, highlighting which
words the most are frequently made.
Using the SnowballC library to stem text
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
Generating a wordcloud of Hillary's emails
# Generate the WordCloud
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word),
alpha=0.9), random.order=FALSE, rot.per=0.3 )
title(main = "Hillary Clinton's Most Used Used in
the Emails", font.main = 1, col.main = "cornsilk3",
cex.main = 1.5)
Sentiment Analysis
analysis is the process of determining whether a piece of writing is positive,
negative or neutral. Here, we’ll work with the package “syuzhet”. Just as the
previous example, we’ll read the Emails from the database.
Read emails into syuzhet
Emails <- data.frame(dbGetQuery(db,"SELECT *
FROM Emails"))
“syuzhet” uses NRC Emotion lexicon. The NRC emotion lexicon is a
list of words and their associations with eight emotions (anger, fear, anticipation,
trust, surprise, sadness, joy, and disgust) and two sentiments (negative and
The get_nrc_sentiment function
returns a data frame in which each row represents a sentence from the original
file. The columns include one for each emotion type was well as the positive or
negative sentiment valence. It allows us to take a body of text and return
which emotions it represents — and also whether the emotion is positive or
##Do sentiment analysis of Hillary's emails
td_new <- data.frame(rowSums(td[2:7945]))
#The function rowSums computes column sums across rows for
each level of a grouping variable.
#Transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment"
= rownames(td_new), td_new)
rownames(td_new) <- NULL

##Graph the sentiment analysis in ggplot2
qplot(sentiment, data=td_new2, weight=count,
geom="bar",fill=sentiment)+ggtitle("Email sentiments")
You must have noticed YouTube’s
auto-captioning feature. Auto-captioning is a speech recognition problem. One
of the features in being able to generate captions automatically from audio is
to predict what word comes after a given sequence of words.
E.g I’d
like to make a …
Hopefully, you concluded that the
next word in the sequence is “call”. We do this by first analyzing what words
frequently co-occur. We formalize this by introducing N-grams. An n-gram is a
contiguous sequence of n items from a given sequence of text or speech. In
other words, we’ll be finding collocations. a collocation is a sequence of
words or terms that co-occur more often than would be expected by chance. An
example of this would be the term “very much.”
this section, we’ll use the R-library “quanteda” to compute tri-grams to find
commonly occuring sequence of 3 words.
##Calculating trigrams in quanteda
db <- dbConnect(dbDriver("SQLite"), "/Users/shubham/Documents/hillary-clinton-emails/database.sqlite")
# Get all the emails sent by Hillary
emailHillary <- dbGetQuery(db, "SELECT
ExtractedBodyText EmailBody FROM Emails e INNER JOIN Persons p ON e.SenderPersonId=P.Id
WHERE p.Name='Hillary Clinton' AND
e.ExtractedBodyText != '' ORDER
emails <- paste(emailHillary$EmailBody,
collapse=" // ")
will use quanteda’s function collocations to do
so. And, finally we’ll remove stopwords from the collocations so we can get a
full view of which are the most frequently used collection of three words in
Hillary’s emails.
##Remove stopwords
collocations(emails, size = 2:3)
print(removeFeatures(collocations(emails, size =
2:3), stopwords("english")))
Post a Comment