Natural language processing (NLP) is one of the most important frontiers in software. The basic idea—how to consume and generate human language effectively—has been an ongoing effort since the dawn of digital computing. The effort continues today, with machine learning and graph databases on the frontlines of the effort to master natural language.
This article is a hands-on introduction to Apache OpenNLP, a Java-based machine learning project that delivers primitives like chunking and lemmatization, both required for building NLP-enabled systems.
What is Apache OpenNLP?
A machine learning natural language processing system such as Apache OpenNLP typically has three parts:
- Learning from a corpus, which is a set of textual data (plural: corpora)
- A model that is generated from the corpus
- Using the model to perform tasks on target text
To make things even simpler, OpenNLP has pre-trained models available for many common use cases. For more sophisticated requirements, you might need to train your own models. For a more simple scenario, you can just download an existing model and apply it to the task at hand.
Language detection with OpenNLP
Let’s build up a basic application that we can use to see how OpenNLP works. We can start the layout with a Maven archetype, as shown in Listing 1.
Listing 1. Make a new project
~/apache-maven-3.8.6/bin/mvn archetype:generate -DgroupId=com.infoworld.com -DartifactId=opennlp -DarchetypeArtifactId=maven-arhectype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false
This archetype will scaffold a new Java project. Next, add the Apache OpenNLP dependency to the pom.xml
in the project's root directory, as shown in Listing 2. (You can use whatever version of the OpenNLP dependency is most current.)
Listing 2. The OpenNLP Maven dependency
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>2.0.0</version>
</dependency>
To make it easier to execute the program, also add the following entry to the <plugins>
segment of the pom.xm
l file:
Listing 3. Main class execution target for the Maven POM
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<mainClass>com.infoworld.App</mainClass>
</configuration>
</plugin>
Now, run the program with maven compile exec:java
. (You’ll need Maven and a JDK installed to run this command.) Running it now will just give you the familiar “Hello World!” output.
Download and set up a language detection model
Now we are ready to use OpenNLP to detect the language in our example program. The first step is to download a language detection model. Download the latest Language Detector component from the OpenNLP models download page. As of this writing, the current version is langdetect-183.bin.
To make the model easy to get at, let’s go into the Maven project and mkdir
a new directory at /opennlp/src/main/resource
, then copy the langdetect-*.bin
file in there.
Now, let’s modify an existing file to what you see in Listing 4. We'll use /opennlp/src/main/java/com/infoworld/App.java
for this example.
Listing 4. App.java
package com.infoworld;
import java.util.Arrays;
import java.io.IOException;
import java.io.InputStream;
import java.io.FileInputStream;
import opennlp.tools.langdetect.LanguageDetectorModel;
import opennlp.tools.langdetect.LanguageDetector;
import opennlp.tools.langdetect.LanguageDetectorME;
import opennlp.tools.langdetect.Language;
public class App {
public static void main( String[] args ) {
System.out.println( "Hello World!" );
App app = new App();
try {
app.nlp();
} catch (IOException ioe){
System.err.println("Problem: " + ioe);
}
}
public void nlp() throws IOException {
InputStream is = this.getClass().getClassLoader().getResourceAsStream("langdetect-183.bin"); // 1
LanguageDetectorModel langModel = new LanguageDetectorModel(is); // 2
String input = "This is a test. This is only a test. Do not pass go. Do not collect $200. When in the course of human history."; // 3
LanguageDetector langDetect = new LanguageDetectorME(langModel); // 4
Language langGuess = langDetect.predictLanguage(input); // 5
System.out.println("Language best guess: " + langGuess.getLang());
Language[] languages = langDetect.predictLanguages(input);
System.out.println("Languages: " + Arrays.toString(languages));
}
}
Now, you can run this program with the command, maven compile exec:java
. When you do, you’ll get output similar to what is shown in Listing 5.
Listing 5. Language detection run 1
Language best guess: eng
Languages: [eng (0.09568318011427969), tgl (0.027236092538322446), cym (0.02607472496029117), war (0.023722424236917564)...
The "ME" in this sample stands for maximum entropy. Maximum entropy is a concept from statistics that is used in natural language processing to optimize for best results.
Evaluate the results
Afer running the program, you will see that the OpenNLP language detector accurately guessed that the language of the text in the example program was English. We've also output some of the probabilities the language detection algorithm came up with. After English, it guessed the language might be Tagalog, Welsh, or War-Jaintia. In the detector's defense, the language sample was small. Correctly identifying the language from just a handful of sentences, with no other context, is pretty impressive.
Before we move on, look back at Listing 4. The flow is pretty simple. Each commented line works like so:
- Open the
langdetect-183.bin
file as an input stream. - Use the input stream to parameterize instantiation of the
LanguageDetectorModel
. - Create a string to use as input.
- Make a language detector object, using the
LanguageDetectorModel
from line 2. - Run the
langDetect.predictLanguage()
method on the input from line 3.
Testing probability
If we add more English language text to the string and run it again, the probability assigned to eng
should go up. Let's try it by pasting in the contents of the United States Declaration of Independence into a new file in our project directory: /src/main/resources/declaration.txt
. We’ll load that and process it as shown in Listing 6, replacing the inline string:
Listing 6. Load the Declaration of Independence text
String input = new String(this.getClass().getClassLoader().getResourceAsStream("declaration.txt").readAllBytes());
If you run this, you’ll see that English is still the detected language.
Detecting sentences with OpenNLP
You've seen the language detection model at work. Now, let's try out a model for detecting sentences. To start, return to the OpenNLP model download page, and add the latest Sentence English model component to your project's /resource
directory. Notice that knowing the language of the text is a prerequisite for detecting sentences.
We’ll follow a similar pattern to what we did with the language detection model: load the file (in my case opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin
) and use it to instantiate a sentence detector. Then, we'll use the detector on the input file. You can see the new code in Listing 7 (along with its imports); the rest of the code remains the same.
Listing 7. Detecting sentences
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.sentdetect.SentenceDetectorME;
//...
InputStream modelFile = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin");
SentenceModel sentModel = new SentenceModel(modelFile);
SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentModel);
String sentences[] = sentenceDetector.sentDetect(input);
System.out.println("Sentences: " + sentences.length + " first line: "+ sentences[2])
Running the file now will output something like what's shown in Listing 8.
Listing 8. Output of the sentence detector
Sentences: 41 first line: In Congress, July 4, 1776
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, ...
Notice that the sentence detector found 41 sentences, which sounds about right. Notice also that this detector model is fairly simple: It just looks for periods and spaces to find the breaks. It doesn't have logic for grammar. That is why we used index 2 on the sentences array to get the actual preamble —the header lines were slurped up together as two sentences. (The founding documents are notoriously inconsistent with punctuation and the sentence detector makes no attempt to consider “When in the Course …” as a new sentence.)
Tokenizing with OpenNLP
After breaking documents into sentences, tokenizing is the next level of granularity. Tokenizing is the process of breaking the document down to words and punctuation, respectively. We can use the code shown in Listing 9:
Listing 9. Tokenizing
import opennlp.tools.tokenize.SimpleTokenizer;
//...
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(input);
System.out.println("tokens: " + tokens.length + " : " + tokens[73] + " " + tokens[74] + " " + tokens[75]);
This will give output like what's shown in Listing 10.
Listing 10. Tokenizer output
tokens: 1704 : human events ,
So, the model broke the document into 1704 tokens. We can access the array of tokens, the words “human events,” and the following comma, and each occupies an element.
Name finding with OpenNLP
Now, we’ll grab the "Person name finder" model for English, called en-ner-person.bin. Not that this model is located on the Sourceforge model downloads page. Once you have the model, put it in the resources directory for your project and use it to find names in the document, as shown in Listing 11.
Listing 11. Name finding with OpenNLP
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.util.Span
//...
InputStream nameFinderFile = this.getClass().getClassLoader().getResourceAsStream("en-ner-person.bin");
TokenNameFinderModel nameFinderModel = new TokenNameFinderModel(nameFinderFile);
NameFinderME nameFinder = new NameFinderME(nameFinderModel);
Span[] names = nameFinder.find(tokens);
System.out.println("names: " + names.length);
for (Span nameSpan : names){
System.out.println("name: " + nameSpan + " : " + tokens[nameSpan.getStart()-1] + " " + tokens[nameSpan.getEnd()-1]);
}
In Listing 11 we load the model and use it to instantiate a NameFinderME
object, which we then use to get an array of names, modeled as span objects. A span has a start and end that tells us where the detector think the name begins and ends in the set of tokens. Note that the name finder expects an array of already tokenized strings.
Tagging parts of speech with OpenNLP
OpenNLP allows us to tag parts of speech (POS) against tokenized strings. Listing 12 is an example of parts-of-speech tagging.
Listing 12. Parts-of-speech tagging
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
//…
InputStream posIS = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-pos-1.0-1.9.3.bin");
POSModel posModel = new POSModel(posIS);
POSTaggerME posTagger = new POSTaggerME(posModel);
String tags[] = posTagger.tag(tokens);
System.out.println("tags: " + tags.length);
for (int i = 0; i < 15; i++){
System.out.println(tokens[i] + " = " + tags[i]);
}
The process is similar with the model file loaded into a model class and then used on the array of tokens. It outputs something like Listing 13.
Listing 13. Parts-of-speech output
tags: 1704
Declaration = NOUN
of = ADP
Independence = NOUN
: = PUNCT
A = DET
Transcription = NOUN
Print = VERB
This = DET
Page = NOUN
Note = NOUN
: = PUNCT
The = DET
following = VERB
text = NOUN
is = AUX
Unlike the name finding model, the POS tagger has done a good job. It correctly identified several different parts of speech. Examples in Listing 13 included NOUN, ADP (which stands for adposition) and PUNCT (for punctuation).
Conclusion
In this article, you've seen how to add Apache OpenNLP to a Java project and use pre-built models for natural language processing. In some cases, you may need to develop you own model, but the pre-existing models will often do the trick. In addition to the models demonstrated here, OpenNLP includes features such as a document categorizer, a lemmatizer (which breaks words down to their roots), a chunker, and a parser. All of these are the fundamental elements of a natural language processing system, and freely available with OpenNLP.