Image by: stevanovicigor, ©2017 Getty Images

    Text analytics is a rather new concept in enterprise content management (ECM) and document management. Many organizations are using automatic categorization or classification of content, but this is only a small part of how text analytics can be used in organizations.

    Text analytics uses similar concepts to data analytics but adds specific abilities around natural language processing (NLP) and specifically machine learning. NLP is a field of computer science that addresses the intersection of machine and human language. Numbers are easily understood by machines, but the human language requires specific techniques. This is made even more difficult because different languages with various grammatical rules are included. NLP looks to understand sentence structure, parts of speech, and tenses and relate it into terms that computers understand. While NLP is a complex concept, the algorithms that have emerged from this field of study can be grouped into a collection of questions that can be answered by text analytics.

    The use of text analytics continues to evolve. To some extent, what it can do is different based on the language of the document. Today, text analytics can answer four questions.

    1. What Are the Key Topics of This Document?

    Auto-categorization is currently the most popular application of text analytics as well as the most commonly misbranded. Auto-classification, which uses zonal optical character recognition (OCR) or full-text, is not text analytics. While some concepts of OCR are considered "early text analytics," the space has evolved beyond recognizing letters within a specific space on a page image.

    Today, identification of key topics is done by not only looking for a specific word or tense but similar words and concepts as well. Text analytics algorithms can either be trained or learn to identify key concepts on their own. Algorithms are used to map similar words and phrases in relation to one another to identify major concepts. This data and their relationships can be shown using topic trees or thinking maps. The data extracted can be used to identify keywords, classify documents, or even pinpoint process workflow steps.

    2. Are These Documents Similar?

    The ability to identify key topics in a document can be used with document collections as well. Algorithms can compare the individual document against a collection of known documents. The algorithm then returns a ranking of similarity between the document and the collection. If the document scores high, it is said to be very similar to the collection.

    These algorithms are being used for general classification of documents (e.g., a record versus employment contract). Classification systems that use this approach often require training document sets for accuracy.

    The same algorithms are used to detect anomalies. When the algorithm returns a low score for a document after it has been compared to the collection, it is considered an anomaly. Depending on the type of document being compared, this could be an indication of a fraudulent claim or an adverse effect. This same concept could be used to find a unique situation from a series of regular reports—the proverbial “needle in a haystack.”

    3. What Is the Mood of This Document?

    Sentiment analysis is often used in social media. Office documents are typically devoid of emotion, but emails or letters may be very expressive. Algorithms can be used to review the text of a document to identify trigger words that set the sentiment of a document. These algorithms assess the use of verbs, adjectives, and adverbs as part of the document. Today, sentiment analysis is mostly used for defining actions to perform in a workflow based on the emotion expressed by the author, like escalation.

    4. What Is the Gist of This Document?

    Summarization algorithms look at the document or paragraph as a whole. These algorithms rely heavily on entity extraction but also recognize sentence and document structure. Entity extraction algorithms typically ignore some parts of the sentence, like adjectives and adverbs. Summarization algorithms look at all parts of the sentence as well as how sentences fit into a paragraph and an entire document.

    Unlike the algorithms presented so far, summarization algorithms also create natural language sentences. When a document is processed by a summarization algorithm, the result is typically a number of sentences that present the most important concepts of a document. For example, a 10-page document will be summarized as a five-sentence paragraph. These algorithms can be tuned to focus on specific concepts to summarize in the document. Some implementations can summarize an entire collection of documents into a similar document. This could be used in e-discovery or case assessment.

    Text Analytics in Practice

    Often, text analytics solutions do not use a single algorithm. The solutions may use several different algorithms and may use them in different ways. For example, machine translation will often use summarization algorithms to understand sentence structure, classification algorithms to decide which translation is most similar to the source text, and then summarization algorithms to recreate a translated document.

    Neither text analytics nor data analytics is about what answer you will get. It’s about knowing the types of questions you can ask. Text analytics solutions will differ across industries, business processes, and even organizations in similar spaces. Once you know the types of questions you can ask, and how to ask them, you may find some very powerful answers.

    Marko Sillanpaa is co-founder of the blog Big Men On Content and the founder of BMO Consulting. He has been working in ECM for over 18 years for vendors like Documentum, EMC, Hyland, and SDL Trados and systems integrators like CSC and Accenture. Follow him on Twitter @MSillanpaaBMOC.