NLP and The White Whale

In this blog post I’m going to talk about Natural Language Processing. For my example, I’m going to use a list of books, including books about whales with white fins, blue whales, white whales and Moby Dick the white whale. I’m going to use NLP to enhance the search results of these book titles from a simple search engine.

Searching

In this age, search engines have become very good at finding what you’re looking for. As a software developer or a system designer, you’ll need to provide a similar experience for your customers. Creating code that searches a database for keywords is rather straight forward. The most obvious technique would be to search for any word that appears in the search box.

Let me show an example, that I’m sure everyone has seen or implemented before. I’ll create a table of book titles in a C# list. Normally this would be a SQL database that contains records of book titles, but this is just a demonstration and I want to cut out all the ORM code, the database setup, etc. So I’ll provide a list of strings that contains titles. You can see the final code on my GitHub account (click here). Here is the list of book titles that I have hard-coded into the sample program:

Moby Dick
The Little Engine That Could
The Whale with the White Fin
The White Fish
The White Car
How to Paint With Black and White
Black and White Photography
My Little Pony
Do Androids Dream of Electric Sheep?
Something Wicked This Way Comes
I Was Told There'd Be Cake
Sperm whale
Whaling ship
All the whales
The Curious Incident of the Dog in the Night-Time
The Hollow Chocolate Bunnies of the Apocalypse
Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation
To Kill a Mockingbird
The Unbearable Lightness of Being
A Clockwork Orange
The blue whale
Midnight in the Garden of Good and Evil
The Perks of Being a Wallflower
The Man Without Qualities
Cloudy With a Chance of Meatballs
Humpback Whale
One Hundred Years of Solitude
The Elephant Tree
One Flew Over the Cuckoo's Nest
Requiem for a Dream
A Wrinkle in Time (Time Quintet, #1)
The Whale
White Whale

Next, I want to create some code that allows a user to type in some search text and grab all the titles that contain each word in their search text. Here’s the basic code:

private static void SimpleSearch(string searchText)
{
	var searchWords = searchText.Split(' ');

	var results = new List<string>();

	foreach (var word in searchWords)
	{
		results.AddRange((from t in books.Titles where t.ToLower().Contains(word) select t).ToList());
	}

	foreach (var result in results)
	{
		Console.WriteLine(result);
	}
}

If we search for “White Whale”, the result will be any book title that contains the word “White” and the word “Whale”. Here’s the result from the code above:

The Whale with the White Fin
The White Fish
The White Car
How to Paint With Black and White
Black and White Photography
White Whale
The Whale

Is this the result we really wanted? I don’t think so. First, there is the problem of context. If we wanted to see all the books about white whales, then we don’t want to see books like “Black and White Photography” or “The White Car”.

The obvious fix would be to change the algorithm to “and”. Only return books with a title that contains “White” and “Whale”. That will produce this list:

The Whale with the White Fin
White Whale

I don’t think this is the result that people normally using Google or Bing search are expecting either.

The problem with the previous search algorithms is that they don’t recognize the context of the search. The search text typed in has a subject, the “Whale”. What we probably want is any title that contains the word “Whale” in it. However, we want any adjective of the noun “Whale” that is “White” to be at the top. In other words, the very top search results should be something where “White” applies to “Whale”. That means that “The Whale with the White Fin” should be at the bottom. The search using “and” has no idea that the word “White” applies to the “Fin” of the whale in the first sentence. We need an algorithm that takes that into account.

Natural Language Processing – NLP

There are several NLP engines out there. Including an API from Google and one from Bing (these are just off the top of my head). The code I’m going to use is from Stanford. I’m going to show what their POS (Parts Of Speech) tagger produces. What this code does is tag each word in a sentence to indicate if the word is a noun, adjective, adverb, verb, etc. The output of a sentence would be something like this:

the/DT great/JJ white/JJ whale/NN is/VBZ named/VBN Moby/NNP Dick/NNP

For a list of the parts of speech tags you can go here. I’ll list a few here:

CC	coordinating conjunction		nd
DT	determiner				the
EX	existential there			there is
JJ	adjective				big
JJR	adjective, comparative			bigger
JJS	adjective, superlative			biggest
MD	modal					could, will
NN	noun, singular or mass			door
NNS	noun plural				doors
NNP	proper noun, singular			John
NNPS	proper noun, plural			Vikings
RB	adverb					however, usually, naturally, here, good
RBR	adverb, comparative			better
RBS	adverb, superlative			best
TO	to					to go, to him
VB	verb, base form				take
VBD	verb, past tense			took
VBG	verb, gerund/present participle		taking
VBN	verb, past participle			taken
VBP	verb, sing. present, non-3d		take
VBZ	verb, 3rd person sing. present		takes
WDT	wh-determiner				which
WP	wh-pronoun				who, what

Using the above chart with the previous search text (“the great white whale is named Moby Dick”), you can see that the POS code tagged the whale as a noun (NN) and the words “great” and “white” are adjectives (JJ). With this extra metadata, we can determine what the subject of the sentence fragment is and what are adjectives. Let’s build a search engine that only looks for the noun of a book title that matches the noun of the search text. Then rank those according to the number of adjectives found for each. The greatest number of adjectives that match should be toward the top of the search list.

To do this, I decided to use the Stanford NLP Parser. The NLP Parser produces a tree structure as well as the POS tagging. To use the NLP Parser, you’ll need to install the NuGet package (or download the sample program from my GitHub account), then download the data needed by the parser. This can be found on Stanford’s page here. I did not include the data in my GitHub repository (here), but my GitHub readme file tells where to install the data directory and how to unzip the jar file for the models directory. I used the free utility called 7-zip to extract the jar file.

I pre-process all the book titles to create a list of noun phrases for each book:

public class NounPhrase
{
    public List<string> Adjectives { get; set; }
    public string Noun { get; set; }
}

Once I filled in all the books, I used the same parser to return one noun phrase from the search text. Then my search algorithm only matches the noun phrase metadata from my search against the noun phrase metadata of each book.

When I first approached this problem I looked at a few sentence fragments that came out of the NLP Parser and I saw trees that looked like this:

(ROOT
  (FRAG
    (NP (DT the) (JJ white) (NN whale))))

I quickly wrote a recursive method in the TreeFind class to search for a noun phrase (NP) and then scan all the leaves of that level for adjectives (JJ) and nouns (NN). Then I discovered that there are proper nouns and they are marked as NNP. I added those to the noun scanning. My first real brick wall was when I parsed this phrase:

The whale with the white fin

That fragment decoded into this tree:

(ROOT
  (NP
    (NP (DT the) (NN whale))
    (PP (IN with)
      (NP (DT the) (JJ white) (NN fin)))))

There is a noun phrase that contains two noun phrases. To get around the nested noun phrases, I checked to see if a noun phrase had noun phrase child nodes. If it did, then I skipped down the tree and parsed the lowest level noun phrase. That took care of my immediate problem. Except for the fact that there are two noun phrases, which I fixed by expanding my noun phrase storage to be a list of noun phrase. Now I have to check to see if the search noun or nouns are contained in the list of nouns for the book.

To make that work right, I scored nouns as a 100. This gives me up to 99 adjectives that count as a 1. Each noun phrase is computed according to the matching noun and adjectives, then the highest scoring noun phrase is returned. It works, but I would test another algorithm to add both noun phrases together. That way, if a search string contained both noun phrases, then it would be worth more than just one match. To test something like this, I’d have to setup more test data (maybe 1,000 records or more). My test data would need to contain many multiple noun phrase sentences.

The next challenge came when I tried to use a conjunction, like this:

The blue or white whale

Producing a tree like this:

(ROOT
  (FRAG
    (NP (DT The)
      (ADJP (JJ blue)
        (CC or)
        (JJ white))
      (NN whale))))

There is a new sub-tree called ADJP which stands for adjective phrase. To handle this in a very simple manner, I had to recursively call one more level to get the adjectives from that sub-tree.

What are the results of my search? Currently the search engine will produce this output for the search phrase “The white whale”:

101  White Whale
100  The Whale with the White Fin
100  Sperm whale
100  The blue whale
100  Humpback Whale
100  The Whale

I put the computed ranking on the left side for troubleshooting purposes. Notice how the whale with the white fin doesn’t count the word “white” as an adjective that belongs to the noun “whale”. This is exactly what I was looking for. The search results make more sense than the earlier “or” search.

Conjunctions

You may have noticed that I skipped right over the conjunction in one of my earlier tests (The blue or white whale). I just grabbed all the adjectives and pretended there was no conjunction. To enhance the algorithm, I would expect any conjunction to be properly parsed. In the case of an “or”, each adjective would count like it currently does. In the case of an “and”, there would need to be logic to make sure that both adjectives are present in order to be counted. What could get tricky is if there is a complex conjunction phrase. To handle all cases of conjunction, I would code it such that each “and” and “or” is stored in an array with the adjectives to be parsed. Then perform a contains() method to find out which adjectives match and use the true/false of each adjective with an “and/or” phrase to form a final output of true or false. This includes precedent (and before or) and parenthesis. An example is:

The (white and blue) or red whale

Which produces this:

(ROOT
  (X
    (NP
      (NP
        (NP (NNP The))
        (PRN (-LRB- -LRB-)
          (NP (JJ white)
            (CC and)
            (JJ blue))
          (-RRB- -RRB-)))
      (CC or)
      (NP
        (ADJP (JJ red))
        (NN whale)))))

The PRN must stand for parenthesis. This is not listed in the Parts of Speech listing (not in this one either). The parenthesis tree nodes would need to be accounted for.

Pluralization

Another problem is the possibility that a word is the same but in plural form. If a person searches for “white whale” it would be nice to bring up any book titles that matched “white whales”. Maybe the pluralized version ranks a little lower than the exact match to the search. In the code I posted on GitHub, the noun “whales” will not match “whale”.

To handle this, I think I would try running each noun through a pluralization algorithm and store those as extra nouns in the noun list. The book metadata structure will need to be enhanced to indicate which version of the noun is exact and which is pluralized (or not pluralized).

For example: The search string “white whale” would match a book with “white whale” in the title and it should rank a book with “white whales” just below it. Maybe a flag for each noun that indicates that the noun is the original noun, and a false to indicate that the noun is the opposite pluralization.

Verbs

I have purposely ignored verbs and adverbs. I’m assuming that there would be verb phrases and the algorithm to match verbs and adverbs would be along the same lines as the noun and adjectives. This would be on my list of investigating. Your search engine might not need to account for verbs. Maybe you have a giant inventory of products and you just want to be able to search for items in your warehouse(s). In that case, you probably don’t need to worry about verbs. Book titles could be another matter.

Conclusion

I’m sure there are other issues that need to be resolved. If you have a set of data that you are performing a simple search against, you can apply that data to this dumb-and-dirty program and see how it performs. Every program works great until it comes in contact with real data. That’s where the really complicated problems start. I always try to get a sample of real data that is rather large. This is where I can focus on the performance of my algorithm to make sure it is not going to collapse the moment it’s deployed to a production environment. I would also like to see what the outliers are. A really good test setup would be to record the search text that your customers enter against the data they are trying to search. Execute these searches against your current algorithm and then try it against this algorithm and see if there is an improvement.

I want to point out that there are other search engines as well. Elasticsearch is one search engine that I’ve used before. Elasticsearch is a no-sql database system that you send your data to. It processes the data and produces really good search results. It’s fast, but you have to make sure you push your data before someone searches for it. If you design your system to run a background process to update the Elasticsearch database from your own data, users will experience missed searches. In other words, If your users are entering data (like a recipe for chocolate cake), they are probably going to search for the data they just created. That means that you’ll need to make sure to save the data in your database and immediately save it to Elasticsearch.

If you’re looking for another NLP processor, there are other ones available as well. AWS (called comprehend) and Azure (text analytics) both have NLP APIs available. I have not yet investigated either of these products. I only point them out to make people aware that they are available. If I were seriously looking to create an NLP-based search engine, I would create demonstration programs using each of these and the Elasticsearch engine and compare the differences. After listing all the pros and cons, I would select the best choice for my product. Some of my criteria would be ease of development (i.e. what’s going to be my development cost to get it rolling), reoccurring cost, performance and quality of results.