How to Build a Natural Language Processing App

Natural language processing—a technology that allows software applications to process human language—has become fairly ubiquitous over the last few years.

Google search is increasingly capable of answering natural-sounding questions, Apple’s Siri is able to understand a wide variety of questions, and more and more companies are using (reasonably) intelligent chat and phone bots to communicate with customers. But how does this seemingly “smart” software really work?

 

In this article, you will learn about the technology that makes these applications tick, and you will learn how to develop natural language processing software of your own.

The article will walk you through the example process of building a news relevance analyzer. Imagine you have a stock portfolio, and you would like an app to automatically crawl through popular news websites and identify articles that are relevant to your portfolio. For example, if your stock portfolio includes companies like Microsoft, BlackStone, and Luxottica, you would want to see articles that mention these three companies.

Getting Started with the Stanford NLP Library

Natural language processing apps, like any other machine learning apps, are built on a number of relatively small, simple, intuitive algorithms working in tandem. It often makes sense to use an external library where all of these algorithms are already implemented and integrated.

For our example, we will use the Stanford NLP library, a powerful Java-based natural-language processing library that comes with support for many languages.

One particular algorithm from this library that we are interested on is the part-of-speech (POS) tagger. A POS tagger is used to automatically assign parts of speech to every word in a piece of text. This POS tagger classifies words in text based on lexical features and analyzes them in relation to other words around them.

The exact mechanics of the POS tagger algorithm is beyond the scope of this article, but you can learn more about it here.

To begin, we’ll create a new Java project (you can use your favorite IDE) and add the Stanford NLP library to the list of dependencies. If you are using Maven, simply add it to your pom.xml file:

<dependency>
 <groupId>edu.stanford.nlp</groupId>
 <artifactId>stanford-corenlp</artifactId>
 <version>3.6.0</version>
</dependency>
<dependency>
 <groupId>edu.stanford.nlp</groupId>
 <artifactId>stanford-corenlp</artifactId>
 <version>3.6.0</version>
 <classifier>models</classifier>
</dependency>

Since the app will need to automatically extract the content of an article from a web page, you will need to specify the following two dependencies as well:

<dependency>
 <groupId>de.l3s.boilerpipe</groupId>
 <artifactId>boilerpipe</artifactId>
 <version>1.1.0</version>
</dependency>
<dependency>
 <groupId>net.sourceforge.nekohtml</groupId>
 <artifactId>nekohtml</artifactId>
 <version>1.9.22</version>
</dependency>

With these dependencies added, you are ready to move forward.

Scraping and Cleaning Articles

The first part of our analyzer will involve retrieving articles and extracting their content from web pages.

When retrieving articles from news sources, the pages are usually riddled with extraneous information (embedded videos, outbound links, videos, advertisements, etc.) that are irrelevant to the article itself. This is where Boilerpipe comes into play.

Boilerpipe is an extremely robust and efficient algorithm for removing “clutter” that identifies the main content of a news article by analyzing different content blocks using features like length of an average sentence, types of tags used in content blocks, and density of links. The boilerpipe algorithm has proven to be competitive with other much more computationally expensive algorithms, such as those based on machine vision. You can learn more at its project site.

The Boilerpipe library comes with built-in support for scraping web pages. It can fetch the HTML from the web, extract text from HTML, and clean the extracted text. You can define a function, extractFromURL, that will take a URL and use Boilerpipe to return the most relevant text as a string using ArticleExtractor for this task:

import java.net.URL;

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;

public class BoilerPipeExtractor {
   public static String extractFromUrl(String userUrl)
     throws java.io.IOException,
                  org.xml.sax.SAXException,
                  de.l3s.boilerpipe.BoilerpipeProcessingException  {
       final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(userUrl));
       final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
       return CommonExtractors.ARTICLE_EXTRACTOR.getText(doc);
   }
}

The Boilerpipe library provides different extractors based on the boilerpipe algorithm, with ArticleExtractor being specifically optimized for HTML-formatted news articles. ArticleExtractor focuses specifically on HTML tags used in each content block and outbound link density. This is better suited to our task than the faster-but-simpler DefaultExtractor.

The built-in functions take care of everything for us:

  • HTMLFetcher.fetch gets the HTML document
  • getTextDocument extracts the text document
  • CommonExtractors.ARTICLE_EXTRACTOR.getText extracts the relevant text from the article using the boilerpipe algorithm

Now you can try it out with an example article regarding the mergers of optical giants Essilor and Luxottica, which you can find here. You can feed this URL to the function and see what comes out.

Add the following code to your main function:

public class App
{
   public static void main( String[] args )
      throws java.io.IOException,
                   org.xml.sax.SAXException,
                   de.l3s.boilerpipe.BoilerpipeProcessingException {
       String urlString = "http://www.reuters.com/article/us-essilor-m-a-luxottica-group-idUSKBN14Z110";
       String text = BoilerPipeExtractor.extractFromUrl(urlString);
       System.out.println(text);
   }
}

You should see in your output in the main body of the article, without the ads, HTML tags, and outbound links. Here is the beginning snippet from what I got when I ran this:

MILAN/PARIS Italy's Luxottica (LUX.MI) and France's Essilor (ESSI.PA) have agreed a 46 billion euro ($49 billion) merger to create a global eyewear powerhouse with annual revenue of more than 15 billion euros.
The all-share deal is one of Europe's largest cross-border tie-ups and brings together Luxottica, the world's top spectacles maker with brands such as Ray-Ban and Oakley, with leading lens manufacturer Essilor.
"Finally ... two products which are naturally complementary -- namely frames and lenses -- will be designed, manufactured and distributed under the same roof," Luxottica's 81-year-old  founder Leonardo Del Vecchio said in a statement on Monday.
Shares in Luxottica were up by 8.6 percent at 53.80 euros by 1405 GMT (9:05 a.m. ET), with Essilor up 12.2 percent at 114.60 euros.
The merger between the top players in the 95 billion eyewear market is aimed at helping the businesses to take full advantage of expected strong demand for prescription spectacles and sunglasses due to an aging global population and increasing awareness about eye care.
Jefferies analysts estimate that the market is growing at between...

And that is indeed the main article body of the article. Hard to imagine this being much simpler to implement.

Tagging Parts of Speech

Now that you have successfully extracted the main article body, you can work on determining if the article mentions companies that are of interest to the user.

You may be tempted to simply do a string or regular expression search, but there are several disadvantages to this approach.

First of all, a string search may be prone to false positives. An article that mentions Microsoft Excel may be tagged as mentioning Microsoft, for instance.

Secondly, depending on the construction of the regular expression, a regular expression search can lead to false negatives. For example, an article that contains the phrase “Luxottica’s quarterly earnings exceeded expectations” may be missed by a regular expression search that searches for “Luxottica” surrounded by white spaces.

Finally, if you are interested in a large number of companies and are processing a large number of articles, searching through the entire body of the text for every company in the user’s portfolio may prove extremely time-consuming, yielding unacceptable performance.

Stanford’s CoreNLP library has many powerful features and provides a way to solve all three of these problems.

For our analyzer, we will use the Parts-of-Speech (POS) tagger. In particular, we can use the POS tagger to find all the proper nouns in the article and compare them to our portfolio of interesting stocks.

By incorporating NLP technology, we not only improve the accuracy of our tagger and minimize false positives and negatives mentioned above, but we also dramatically minimize the amount of text we need to compare to our portfolio of stocks, since proper nouns only comprise a small subset of the full text of the article.

By pre-processing our portfolio into a data structure that has low membership query cost, we can dramatically reduce the time needed to analyze an article.

Stanford CoreNLP provides a very convenient tagger called MaxentTagger that can provide POS Tagging in just a few lines of code.

Here is a simple implementation:

public class PortfolioNewsAnalyzer {
    private HashSet<String> portfolio;
    private static final String modelPath = "edu\\stanford\\nlp\\models\\pos-tagger\\english-left3words\\english-left3words-distsim.tagger";
    private MaxentTagger tagger;

    public PortfolioNewsAnalyzer() {
        tagger = new MaxentTagger(modelPath);
    }
    public String tagPos(String input) {
        return tagger.tagString(input);
    }

The tagger function, tagPos, takes a string as an input and outputs a string that contains the words in the original string along with the corresponding part of speech. In your main function, instantiate a PortfolioNewsAnalyzer and feed the output of the scraper into the tagger function and you should see something like this:

MILAN/PARIS_NN Italy_NNP 's_POS Luxottica_NNP -LRB-_-LRB- LUX.MI_NNP -RRB-_-RRB- and_CC France_NNP 's_POS Essilor_NNP -LRB-_-LRB- ESSI.PA_NNP -RRB-_-RRB- have_VBP agreed_VBN a_DT 46_CD billion_CD euro_NN -LRB-_-LRB- $_$ 49_CD billion_CD -RRB-_-RRB- merger_NN to_TO create_VB a_DT global_JJ eyewear_NN powerhouse_NN with_IN annual_JJ revenue_NN of_IN more_JJR than_IN 15_CD billion_CD euros_NNS ._. The_DT all-share_JJ deal_NN is_VBZ one_CD of_IN Europe_NNP 's_POS largest_JJS cross-border_JJ tie-ups_NNS and_CC brings_VBZ together_RB Luxottica_NNP ,_, the_DT world_NN 's_POS top_JJ spectacles_NNS maker_NN with_IN brands_NNS such_JJ as_IN Ray-Ban_NNP and_CC Oakley_NNP ,_, with_IN leading_VBG lens_NN manufacturer_NN Essilor_NNP ._. ``_`` Finally_RB ..._: two_CD products_NNS which_WDT are_VBP naturally_RB complementary_JJ --_: namely_RB frames_NNS and_CC lenses_NNS --_: will_MD be_VB designed_VBN ,_, manufactured_VBN and_CC distributed_VBN under_IN the_DT same_JJ roof_NN ,_, ''_'' Luxottica_NNP 's_POS 81-year-old_JJ founder_NN Leonardo_NNP Del_NNP Vecchio_NNP said_VBD in_IN a_DT statement_NN on_IN Monday_NNP ._. Shares_NNS in_IN Luxottica_NNP were_VBD up_RB by_IN 8.6_CD percent_NN at_IN 53.80_CD euros_NNS by_IN 1405_CD GMT_NNP -LRB-_-LRB- 9:05_CD a.m._NN ET_NNP -RRB-_-RRB- ,_, with_IN Essilor_NNP up_IN 12.2_CD percent_NN at_IN 114.60_CD euros_NNS ._. The_DT merger_NN between_IN the_DT top_JJ players_NNS in_IN the_DT 95_CD billion_CD eyewear_NN market_NN is_VBZ aimed_VBN at_IN helping_VBG the_DT businesses_NNS to_TO take_VB full_JJ advantage_NN of_IN expected_VBN strong_JJ demand_NN for_IN prescription_NN spectacles_NNS and_CC sunglasses_NNS due_JJ to_TO an_DT aging_NN global_JJ population_NN and_CC increasing_VBG awareness_NN about_IN...

Processing the Tagged Output into a Set

So far, we’ve built functions to download, clean, and tag a news article. But we still need to determine if the article mentions any of the companies of interest to the user.

To do this, we need to collect all the proper nouns and check if stocks from our portfolio are included in those proper nouns.

To find all the proper nouns, we will first want to split the tagged string output into tokens (using spaces as the delimiters), then split each of the tokens on the underscore (_) and check if the part of speech is a proper noun.

Once we have all the proper nouns, we will want to store them in a data structure that is better optimized for our purpose. For our example, we’ll use a HashSet. In exchange for disallowing duplicate entries and not keeping track of the order of the entries, HashSet allows very fast membership queries. Since we are only interested in querying for membership, the HashSet is perfect for our purposes.

Below is the function that implements the splitting and storing of proper nouns. Place this function in your PortfolioNewsAnalyzer class:

public static HashSet<String> extractProperNouns(String taggedOutput) {
   HashSet<String> propNounSet = new HashSet<String>();
   String[] split = taggedOutput.split(" ");
   for (String token: split ){
       String[] splitTokens = token.split("_");
       if(splitTokesn[1].equals("NNP")){
           propNounSet.add(splitTokens[0]);
       }
   }
   return propNounSet;
}

There is an issue with this implementation though. If a company’s name consists of multiple words, (e.g., Carl Zeiss in the Luxottica example) this implementation will be unable to catch it. In the example of Carl Zeiss, “Carl” and “Zeiss” will be inserted into the set separately, and therefore will never contain the single string “Carl Zeiss.”

To solve this problem, we can collect all the consecutive proper nouns and join them with spaces. Here is the updated implementation that accomplishes this:

public static HashSet<String> extractProperNouns(String taggedOutput) {
   HashSet<String> propNounSet = new HashSet<String>();
   String[] split = taggedOutput.split(" ");
   List<String> propNounList = new ArrayList<String>();
   for (String token: split ){
       String[] splitTokens = token.split("_");
       if(splitTokens[1].equals("NNP")){
           propNounList.add(splitTokens[0]);
       } else {
           if (!propNounList.isEmpty()) {
               propNounSet.add(StringUtils.join(propNounList, " "));
               propNounList.clear();
           }
       }
   }
   if (!propNounList.isEmpty()) {
       propNounSet.add(StringUtils.join(propNounList, " "));
       propNounList.clear();
   }
   return propNounSet;
}

Now the function should return a set with the individual proper nouns and the consecutive proper nouns (i.e., joined by spaces). If you print the propNounSet, you should see something like the following:

[... Monday, Gianluca Semeraro, David Goodman, Delfin, North America, Luxottica, Latin America, Rossi/File Photo, Rome, Safilo Group, SFLG.MI, Friday, Valentina Za, Del Vecchio, CEO Hubert Sagnieres, Oakley, Sagnieres, Jefferies, Ray Ban, ...]

Comparing the Portfolio against the PropNouns Set

We are almost done!

In the previous sections, we built a scraper that can download and extract the body of an article, a tagger that can parser the article body and identify proper nouns, and a processor that takes the tagged output and collects the proper nouns into a HashSet. Now all that’s left to do is to take the HashSet and compare it with the list of companies that we’re interested in.

The implementation is very simple. Add the following code in your PortfolioNewsAnalyzer class:

private HashSet<String> portfolio;
public PortfolioNewsAnalyzer() {
  portfolio = new HashSet<String>();
}
public void addPortfolioCompany(String company) {
   portfolio.add(company);
}
public boolean arePortfolioCompaniesMentioned(HashSet<String> articleProperNouns){
   return !Collections.disjoint(articleProperNouns, portfolio);
}

Putting it All Together

Now we can run the entire application—the scraping, cleaning, tagging, the collecting, and comparing. Here is the function that runs through the entire application. Add this function to your PortfolioNewsAnalyzer class:

public boolean analyzeArticle(String urlString) throws
      IOException,
      SAXException,
      BoilerpipeProcessingException
{
   String articleText = extractFromUrl(urlString);
   String tagged = tagPos(articleText);
   HashSet<String> properNounsSet = extractProperNouns(tagged);
   return arePortfolioCompaniesMentioned(properNounsSet);
}

Finally, we can use the app!

Here is an example using the same article as above and Luxottica as the portfolio company:

public static void main( String[] args ) throws
      IOException,
      SAXException,
      BoilerpipeProcessingException
{
   PortfolioNewsAnalyzer analyzer = new PortfolioNewsAnalyzer();
   analyzer.addPortfolioCompany("Luxottica");
   boolean mentioned = analyzer.analyzeArticle("http://www.reuters.com/article/us-essilor-m-a-luxottica-group-idUSKBN14Z110");
   if (mentioned) {
       System.out.println("Article mentions portfolio companies");
   } else {
       System.out.println("Article does not mention portfolio companies");
   }
}

Run this, and the app should print “Article mentions portfolio companies.”

Change the portfolio company from Luxottica to a company not mentioned in the article (such as “Microsoft”), and the app should print “Article does not mention portfolio companies.”

Building an NLP App Doesn’t Need to Be Hard

In this article, we stepped through the process of building an application that downloads an article from a URL, cleans it using Boilerpipe, processes it using Stanford NLP, and checks if the article makes specific references of interest (in our case, companies in our portfolio). As demonstrated, leveraging this array of technologies makes what would otherwise be a daunting task into one that is relatively straightforward.

Loading