We need 'tm' and 'wordcloud' libraries for this tutorial. You may need to install them with below commands if they are not available on your machine.
> install.packages("tm") > install.packages("wordcloud")
Then, we load them.
> library(tm) > library(wordcloud)
As an input text, I use tm::Corpus() function help text. It can be seen through the 'help("Corpus")' command in RStudio. You may use your own text as an input.
> text<-"Corpora are collections of documents containing (natural language) text. In packages which employ the infrastructure provided by package tm, such corpora are represented via the virtual S3 class Corpus: such packages then provide S3 corpus classes extending the virtual base class (such as VCorpus provided by package tm itself).\n\nAll extension classes must provide accessors to extract subsets ([), individual documents ([[), and metadata (meta). The function length must return the number of documents, and as.list must construct a list holding the documents.\n\nA corpus can have two types of metadata (accessible via meta). Corpus metadata contains corpus specific metadata in form of tag-value pairs. Document level metadata contains document specific metadata but is stored in the corpus as a data frame. Document level metadata is typically used for semantic reasons (e.g., classifications of documents form an own entity due to some high-level information like the range of possible values) or for performance reasons (single access instead of extracting metadata of each document)."
Next, we create a vector source. VectorSource makes each element of a vector as a document and builds corpus metadata object.
> vs_text<-VectorSource(text) > corpus_text<-Corpus(vs_text)
We use TermDocumentMatrix function to build a document-term matrix.
> tdm<-TermDocumentMatrix(corpus_text)
Converting the object into matrix and sorting.
> m<-as.matrix(tdm)
> v<-sort(rowSums(m),decreasing = T)
Collecting the result into a data frame.
> df_text<-data.frame(word=names(v),freq=v)
> head(df_text) word freq metadata metadata 8 the the 8 corpus corpus 6 documents documents 5 document document 4 level level 3
Finally, we visualize df_text in a word cloud.
> wordcloud(words = df_text$word,
freq = df_text$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
This is a simple method to do a word cloud map. There are more options to improve the cloud map view and filtering some of the words in the text.
You may refer documentation above packages or other resources to use more options and advanced usage of functions to build a word cloud.
No comments:
Post a Comment