Text mining, or text analysis, is the automated process by which large volumes of unstructured, natural‐language text are analysed in order to pinpoint and extract user‐specified information.
A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation. Marti Hearst, What Is Text Mining?
Texts are easy for human beings but complex for computer programs. Natural language often contains ambiguous terms, contextual information and common sense reasoning and knowledge that computers may find hard to process.
We may be more familiar with structured data. Structured data consists of numbers and values, stored in tabular formats, organized and ready for computer programs to process. ”Unstructured”, on the other hand,
Unstructured data can be created by people. These include text files (e.g. word processing documents, spreadsheets, presentations, email, log files), media (e.g. digital photos, audio, and video files), mobile and communications data (e.g. text messages, phone recordings, chats), and social media (e.g. data from Twitter, LinkedIn, Facebook, Instagram, YouTube). Unstructured data can also be generated by machines. These include scientific data, GPS sensors, online forms, network logs, web server logs etc.
Reference
The fundamental limitations of text mining are first, that we will not be able to write programs that fully interpret text for a very long time, and second, that the information one needs is often not recorded in textual form. If I tried to write a program that detected when a where a new word came into existence and how it spread by analyzing web pages, I would miss important clues relating to usage in spoken conversations, email, on the radio and TV, and so on. Similarly, If I tried to write a program that processes published documents in order to guess what will happen to a bill in Washington DC, I would fail because most of the action still happens in negotiations behind closed doors. Marti Hearst, What Is Text Mining?
Text mining/analysis can be used for a variety of purposes. It can be used for exploratory data analysis that “leads to the discovery of heretofore unknown information, or to answers to questions for which the answer is not currently known” Marti Hearst, Untangling Text Data Mining. That is one of our focuses in this workshop. Text mining/analysis can also be used in developing tools, such as creating spam filters.