In this blog post I’m going to talk about Natural Language Processing. For my example, I’m going to use a list of books, including books about whales with white fins, blue whales, white whales and Moby Dick the white whale. I’m going to use NLP to enhance the search results of these book titles from a simple search engine.
Searching
In this age, search engines have become very good at finding what you’re looking for. As a software developer or a system designer, you’ll need to provide a similar experience for your customers. Creating code that searches a database for keywords is rather straight forward. The most obvious technique would be to search for any word that appears in the search box.
Let me show an example, that I’m sure everyone has seen or implemented before. I’ll create a table of book titles in a C# list. Normally this would be a SQL database that contains records of book titles, but this is just a demonstration and I want to cut out all the ORM code, the database setup, etc. So I’ll provide a list of strings that contains titles. You can see the final code on my GitHub account (click here). Here is the list of book titles that I have hard-coded into the sample program:
Moby Dick The Little Engine That Could The Whale with the White Fin The White Fish The White Car How to Paint With Black and White Black and White Photography My Little Pony Do Androids Dream of Electric Sheep? Something Wicked This Way Comes I Was Told There'd Be Cake Sperm whale Whaling ship All the whales The Curious Incident of the Dog in the Night-Time The Hollow Chocolate Bunnies of the Apocalypse Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation To Kill a Mockingbird The Unbearable Lightness of Being A Clockwork Orange The blue whale Midnight in the Garden of Good and Evil The Perks of Being a Wallflower The Man Without Qualities Cloudy With a Chance of Meatballs Humpback Whale One Hundred Years of Solitude The Elephant Tree One Flew Over the Cuckoo's Nest Requiem for a Dream A Wrinkle in Time (Time Quintet, #1) The Whale White Whale
Next, I want to create some code that allows a user to type in some search text and grab all the titles that contain each word in their search text. Here’s the basic code:
private static void SimpleSearch(string searchText) { var searchWords = searchText.Split(' '); var results = new List<string>(); foreach (var word in searchWords) { results.AddRange((from t in books.Titles where t.ToLower().Contains(word) select t).ToList()); } foreach (var result in results) { Console.WriteLine(result); } }
If we search for “White Whale”, the result will be any book title that contains the word “White” and the word “Whale”. Here’s the result from the code above:
The Whale with the White Fin The White Fish The White Car How to Paint With Black and White Black and White Photography White Whale The Whale
Is this the result we really wanted? I don’t think so. First, there is the problem of context. If we wanted to see all the books about white whales, then we don’t want to see books like “Black and White Photography” or “The White Car”.
The obvious fix would be to change the algorithm to “and”. Only return books with a title that contains “White” and “Whale”. That will produce this list:
The Whale with the White Fin White Whale
I don’t think this is the result that people normally using Google or Bing search are expecting either.
The problem with the previous search algorithms is that they don’t recognize the context of the search. The search text typed in has a subject, the “Whale”. What we probably want is any title that contains the word “Whale” in it. However, we want any adjective of the noun “Whale” that is “White” to be at the top. In other words, the very top search results should be something where “White” applies to “Whale”. That means that “The Whale with the White Fin” should be at the bottom. The search using “and” has no idea that the word “White” applies to the “Fin” of the whale in the first sentence. We need an algorithm that takes that into account.
Natural Language Processing – NLP
There are several NLP engines out there. Including an API from Google and one from Bing (these are just off the top of my head). The code I’m going to use is from Stanford. I’m going to show what their POS (Parts Of Speech) tagger produces. What this code does is tag each word in a sentence to indicate if the word is a noun, adjective, adverb, verb, etc. The output of a sentence would be something like this:
the/DT great/JJ white/JJ whale/NN is/VBZ named/VBN Moby/NNP Dick/NNP
For a list of the parts of speech tags you can go here. I’ll list a few here:
CC coordinating conjunction nd DT determiner the EX existential there there is JJ adjective big JJR adjective, comparative bigger JJS adjective, superlative biggest MD modal could, will NN noun, singular or mass door NNS noun plural doors NNP proper noun, singular John NNPS proper noun, plural Vikings RB adverb however, usually, naturally, here, good RBR adverb, comparative better RBS adverb, superlative best TO to to go, to him VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what
Using the above chart with the previous search text (“the great white whale is named Moby Dick”), you can see that the POS code tagged the whale as a noun (NN) and the words “great” and “white” are adjectives (JJ). With this extra metadata, we can determine what the subject of the sentence fragment is and what are adjectives. Let’s build a search engine that only looks for the noun of a book title that matches the noun of the search text. Then rank those according to the number of adjectives found for each. The greatest number of adjectives that match should be toward the top of the search list.
To do this, I decided to use the Stanford NLP Parser. The NLP Parser produces a tree structure as well as the POS tagging. To use the NLP Parser, you’ll need to install the NuGet package (or download the sample program from my GitHub account), then download the data needed by the parser. This can be found on Stanford’s page here. I did not include the data in my GitHub repository (here), but my GitHub readme file tells where to install the data directory and how to unzip the jar file for the models directory. I used the free utility called 7-zip to extract the jar file.
I pre-process all the book titles to create a list of noun phrases for each book:
public class NounPhrase { public List<string> Adjectives { get; set; } public string Noun { get; set; } }
Once I filled in all the books, I used the same parser to return one noun phrase from the search text. Then my search algorithm only matches the noun phrase metadata from my search against the noun phrase metadata of each book.
When I first approached this problem I looked at a few sentence fragments that came out of the NLP Parser and I saw trees that looked like this:
(ROOT (FRAG (NP (DT the) (JJ white) (NN whale))))
I quickly wrote a recursive method in the TreeFind class to search for a noun phrase (NP) and then scan all the leaves of that level for adjectives (JJ) and nouns (NN). Then I discovered that there are proper nouns and they are marked as NNP. I added those to the noun scanning. My first real brick wall was when I parsed this phrase:
The whale with the white fin
That fragment decoded into this tree:
(ROOT (NP (NP (DT the) (NN whale)) (PP (IN with) (NP (DT the) (JJ white) (NN fin)))))
There is a noun phrase that contains two noun phrases. To get around the nested noun phrases, I checked to see if a noun phrase had noun phrase child nodes. If it did, then I skipped down the tree and parsed the lowest level noun phrase. That took care of my immediate problem. Except for the fact that there are two noun phrases, which I fixed by expanding my noun phrase storage to be a list of noun phrase. Now I have to check to see if the search noun or nouns are contained in the list of nouns for the book.
To make that work right, I scored nouns as a 100. This gives me up to 99 adjectives that count as a 1. Each noun phrase is computed according to the matching noun and adjectives, then the highest scoring noun phrase is returned. It works, but I would test another algorithm to add both noun phrases together. That way, if a search string contained both noun phrases, then it would be worth more than just one match. To test something like this, I’d have to setup more test data (maybe 1,000 records or more). My test data would need to contain many multiple noun phrase sentences.
The next challenge came when I tried to use a conjunction, like this:
The blue or white whale
Producing a tree like this:
(ROOT (FRAG (NP (DT The) (ADJP (JJ blue) (CC or) (JJ white)) (NN whale))))
There is a new sub-tree called ADJP which stands for adjective phrase. To handle this in a very simple manner, I had to recursively call one more level to get the adjectives from that sub-tree.
What are the results of my search? Currently the search engine will produce this output for the search phrase “The white whale”:
101 White Whale 100 The Whale with the White Fin 100 Sperm whale 100 The blue whale 100 Humpback Whale 100 The Whale
I put the computed ranking on the left side for troubleshooting purposes. Notice how the whale with the white fin doesn’t count the word “white” as an adjective that belongs to the noun “whale”. This is exactly what I was looking for. The search results make more sense than the earlier “or” search.
Conjunctions
You may have noticed that I skipped right over the conjunction in one of my earlier tests (The blue or white whale). I just grabbed all the adjectives and pretended there was no conjunction. To enhance the algorithm, I would expect any conjunction to be properly parsed. In the case of an “or”, each adjective would count like it currently does. In the case of an “and”, there would need to be logic to make sure that both adjectives are present in order to be counted. What could get tricky is if there is a complex conjunction phrase. To handle all cases of conjunction, I would code it such that each “and” and “or” is stored in an array with the adjectives to be parsed. Then perform a contains() method to find out which adjectives match and use the true/false of each adjective with an “and/or” phrase to form a final output of true or false. This includes precedent (and before or) and parenthesis. An example is:
The (white and blue) or red whale
Which produces this:
(ROOT (X (NP (NP (NP (NNP The)) (PRN (-LRB- -LRB-) (NP (JJ white) (CC and) (JJ blue)) (-RRB- -RRB-))) (CC or) (NP (ADJP (JJ red)) (NN whale)))))
The PRN must stand for parenthesis. This is not listed in the Parts of Speech listing (not in this one either). The parenthesis tree nodes would need to be accounted for.
Pluralization
Another problem is the possibility that a word is the same but in plural form. If a person searches for “white whale” it would be nice to bring up any book titles that matched “white whales”. Maybe the pluralized version ranks a little lower than the exact match to the search. In the code I posted on GitHub, the noun “whales” will not match “whale”.
To handle this, I think I would try running each noun through a pluralization algorithm and store those as extra nouns in the noun list. The book metadata structure will need to be enhanced to indicate which version of the noun is exact and which is pluralized (or not pluralized).
For example: The search string “white whale” would match a book with “white whale” in the title and it should rank a book with “white whales” just below it. Maybe a flag for each noun that indicates that the noun is the original noun, and a false to indicate that the noun is the opposite pluralization.
Verbs
I have purposely ignored verbs and adverbs. I’m assuming that there would be verb phrases and the algorithm to match verbs and adverbs would be along the same lines as the noun and adjectives. This would be on my list of investigating. Your search engine might not need to account for verbs. Maybe you have a giant inventory of products and you just want to be able to search for items in your warehouse(s). In that case, you probably don’t need to worry about verbs. Book titles could be another matter.
Conclusion
I’m sure there are other issues that need to be resolved. If you have a set of data that you are performing a simple search against, you can apply that data to this dumb-and-dirty program and see how it performs. Every program works great until it comes in contact with real data. That’s where the really complicated problems start. I always try to get a sample of real data that is rather large. This is where I can focus on the performance of my algorithm to make sure it is not going to collapse the moment it’s deployed to a production environment. I would also like to see what the outliers are. A really good test setup would be to record the search text that your customers enter against the data they are trying to search. Execute these searches against your current algorithm and then try it against this algorithm and see if there is an improvement.
I want to point out that there are other search engines as well. Elasticsearch is one search engine that I’ve used before. Elasticsearch is a no-sql database system that you send your data to. It processes the data and produces really good search results. It’s fast, but you have to make sure you push your data before someone searches for it. If you design your system to run a background process to update the Elasticsearch database from your own data, users will experience missed searches. In other words, If your users are entering data (like a recipe for chocolate cake), they are probably going to search for the data they just created. That means that you’ll need to make sure to save the data in your database and immediately save it to Elasticsearch.
If you’re looking for another NLP processor, there are other ones available as well. AWS (called comprehend) and Azure (text analytics) both have NLP APIs available. I have not yet investigated either of these products. I only point them out to make people aware that they are available. If I were seriously looking to create an NLP-based search engine, I would create demonstration programs using each of these and the Elasticsearch engine and compare the differences. After listing all the pros and cons, I would select the best choice for my product. Some of my criteria would be ease of development (i.e. what’s going to be my development cost to get it rolling), reoccurring cost, performance and quality of results.