Text mining for exploratory data analysis


What is text mining and/or text analytics?

Text mining, or text analysis, is the automated process by which large volumes of unstructured, natural‐language text are analysed in order to pinpoint and extract user‐specified information.

A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation. Marti Hearst, What Is Text Mining?


Texts are easy for human beings but complex for computer programs. Natural language often contains ambiguous terms, contextual information and common sense reasoning and knowledge that computers may find hard to process.


What is unstructured data?

We may be more familiar with structured data. Structured data consists of numbers and values, stored in tabular formats, organized and ready for computer programs to process. ”Unstructured”, on the other hand,

  • has no easily identifiable structure
  • can not be stored in the form of rows and columns
  • does not conform to a predefined data model
  • does not follow any semantic or rules
  • lacks any particular format or sequence

Unstructured data can be created by people. These include text files (e.g. word processing documents, spreadsheets, presentations, email, log files), media (e.g. digital photos, audio, and video files), mobile and communications data (e.g. text messages, phone recordings, chats), and social media (e.g. data from Twitter, LinkedIn, Facebook, Instagram, YouTube). Unstructured data can also be generated by machines. These include scientific data, GPS sensors, online forms, network logs, web server logs etc.

Reference


Text mining/analysis has its own limitations


The fundamental limitations of text mining are first, that we will not be able to write programs that fully interpret text for a very long time, and second, that the information one needs is often not recorded in textual form. If I tried to write a program that detected when a where a new word came into existence and how it spread by analyzing web pages, I would miss important clues relating to usage in spoken conversations, email, on the radio and TV, and so on. Similarly, If I tried to write a program that processes published documents in order to guess what will happen to a bill in Washington DC, I would fail because most of the action still happens in negotiations behind closed doors. Marti Hearst, What Is Text Mining?


Applications of text mining/analysis

Text mining/analysis can be used for a variety of purposes. It can be used for exploratory data analysis that “leads to the discovery of heretofore unknown information, or to answers to questions for which the answer is not currently known” Marti Hearst, Untangling Text Data Mining. That is one of our focuses in this workshop. Text mining/analysis can also be used in developing tools, such as creating spam filters.

  • Books
    • classifying the themes
  • Social media contents
    • predicting stock market prices based on social media posts
    • recommender systems
  • Emails
    • spam filtering
  • Administrative documents
    • fraud detection by insurance companies
    • improving customer services
    • supporting decision making

Previous submodule:
Next submodule: