Most applications today still work in the world of processing data from structured and semistructured sources. They connect to SQL databases to query information or present information from JSON or XML data sources. Many applications still avoid the complication of parsing and extracting knowledge from unstructured sources such as open text fields, rich text editors, database CLOB (character large object) data types, social media news streams, and full documents from tools like Microsoft Word, Google Docs, and Adobe Acrobat.
But the world of information is largely unstructured. People enter, search, and manage information in a myriad of tools and formats. Modern applications are going beyond just storing and retrieving unstructured information and are incorporating elements of natural language processing (NLP) to improve user experiences, manage complex information, enable chatbot dialogs, and perform text analytics.
What is natural language processing (NLP)
NLP engines are designed to extract data, information, knowledge, and sentiment from blocks of text and documents. They often use a mix of parsing technologies, knowledge data structures, and machine-learning algorithms to extract and present information in comprehensible formats to both people and downstream applications.
NLP engines typically have the following technical components:
- API and data storage interfaces to make it easy to connect to data sources and aggregate information for analysis.
- File parsers that extract text, metadata, and other contextual information from different file types and document storage formats.
- Document parsers that break down documents into more atomic units including sections, paragraphs, sentences, phrases, and words.
- Pattern-recognition tools such as a regular expression parser to identify patterns such as dates, currencies, phone numbers, and addresses.
- Dictionaries and other knowledge-storage tools that can help NLP engines identify entities such as names, places, and products.
- Tools and machine-learning algorithms to aid in the creation of domain-specific entities, topics, and terms.
- Semantic and other contextual analysis functions that provide deeper analysis. Is the paragraph positive, negative, or neutral about the subject? Was the paragraph adjacent to a photograph that provides additional context? Was the document found in a folder or have links to other documents that can provide additional context? What is known about the document’s author and when the document was written that can provide additional context?
The combination of these components lets NLP engines provide a rich summarization of the information contained in a document. The summarization can be directly useful to users, especially for simple documents that cover a single or relatively few concepts. For example, an NLP engine processing today’s news articles can present who, when, and where an article is about back to the user. This is also useful to downstream technologies such as search engines, chatbots, and analytics tools that can more easily work with structured information that is summarized from documents.
Text size and complexity drive design decisions
Whereas most NLP engines have some of these basic technology components, the level of sophistication to handle various content sources and types varies significantly.
The simplest engines concentrate on small documents and formats. Consider a search-engine query box that only parses words, phrases, and short Boolean terms. This engine is largely looking to separate words, identify phrases, and parse basic logical operators so it can present one or more lower-level queries to the search engine.
In more advanced search engines, some meaning, and interpretation of the query is useful to better determine context. For example, a search with the keyword “jaguar” might be targeting the animal, the car manufacturer, the NFL football team, or possibly other references that can be narrowed down using other references shared by the user.
Chatbot texts are similar in that they are more often working with phrases and short sentences. But while searches tend to be filled with topics and entities (that is, nouns), texts to chatbots are often a mix of nouns, verbs, and sentiment. For example, “I’m having trouble resetting my password” indicates the service requested (login), the action requested (password reset) and the sentiment (negative) that should be used in processing the request and formulating a humanlike response to the user.
Interpreting social media content such as tweets and updates on Facebook or LinkedIn has many additional challenges. The longer, paragraph format implies that there are likely multiple topics and entities being referenced. In addition, understanding the sentiment and intent may be more important than the subject; for example, knowing that someone intends to buy a car is more important to advertisers than knowing that someone referenced what type of car he or she rented on a recent business trip.
Engines processing larger document formats require more parsing and linguistic sophistication. For example, if the intent of the engine is to parse longform news articles, it must be able to separate sentences, paragraphs, and sections to better represent the underlying information. The level of sophistication required becomes more significant with larger document formats such as legal, financial, medical, and building-construction documents because knowing where in the document something is important.
Understanding natural-language engine capabilities
Simple natural-language processors are designed to extract basic entities by parsing text for well-identifiable names that have few ambiguities. Consider that most news sites have people, places, and brands hyperlinked and let you click and review relevant information on that entity.
Identifying dates, currencies, quantifications, or descriptive attributes requires more sophistication to identify relationships and contexts. For example, dates and currencies extracted from legal and financial documents are often tied to an event name such as when a contract term or a financial performance metric. In a construction documents, identifying paint colors with their associated room types is useful to manufacturers and contractors. For medical documents, finding the cancer type is more valuable if the doctor knows what part of the body it was found in.
Beyond common entities and patterns, NLP platforms differ in how they enable creating customized concepts, topics, entities, phrases, patterns, and other elements to be identified in text and documents.
Validating learning tools in NLP platforms
NLP engines sold by public cloud vendors such as Amazon, Microsoft, Google, and IBM compete on the sophistication of their algorithms, performance of processing queries, depth of their APIs, versatility in handling different text, document, and file types, and unit pricing, among other factors.
But it’s the simplicity of using the platform’s tools for training custom entities, topics, and other information artifacts that is most important to consider in early experimentation. Which tools make it easier to extract the information needed for the documents and use cases required?
Here’s a brief summary of the tools and capabilities being offered by the big cloud vendors for configuring their NLP platforms to extract domain specific knowledge.
- Amazon Comprehend has tools and APIs to submit custom classifiers and entities by submitting lists to the engine.
- Google Natural Language allows configuring of data sets that map inline text or references text files to one or more labels.
- Microsoft Language Understanding (Luis) lets you describe intents (buying, shopping, browsing), entities, utterances (how something is phrased in the document text), and more complex language patterns.
- IBM Watson Natural Language Understanding offers more information extraction using as many as five levels of category hierarchies, concepts that can be abstracted beyond what’s described in the text, both emotion and sentiment, and relationships between entities.
Creating entity and topic knowledgebases isn’t trivial, so some of the cloud providers are starting to build standard or starter ones. For example, there’s Amazon Comprehend Medical to extract medical information while Microsoft has prebuilt domains for places, events, music, weather, and other common areas.
From these examples, you can see there are multiple ways to train NLP engines on entities, topics, and intents. Simple approaches start with lists of keywords that map to topics. More sophisticated engines then support learning algorithms that scan documents and present potential topics and associated phrases back to users to see if they should be included in training sets. The more sophisticated engines from Expert System, SmartLogic, and Bitext use taxonomy-management tools and integrate with NoSQL and multimodel data stores such as MarkLogic so concept matches performed on a document can be used with referenced ontologies to power more sophisticated and operational inferences.
Preparing for your NLP experiments
Before jumping into selecting technologies and doing proofs of concept, it’s important to ground any NLP experimentation with a defined scope and success criteria. Be sure to understand the volume of training texts or documents, the level of detail required from the extraction, the types of information required, the overall quality required from the extraction, and the performance required in processing new texts. The best experiments are when business value can be delivered on modest requirements and more sophistication added through an agile, iterative process.