Dorothy uses natural language processing to search the patent database. Many search platforms have semantic search capabilities which seem to vary in their effectiveness. Like Dorothy, the semantic search query is a plain English description of the thing being searched. You are probably asking yourself, “What’s the difference?”
The Boolean/keyword searching that we’ve all become accustomed to scours the text of searched documents for exact, or very close, matches to the words of the query.
Semantic search takes this a step further by guessing what concepts are implied by the query expressed as individual words and adjacent or co-occurring words. The concepts are associated with many more documents than those that contain exact or close matches to the words in the query making search broader.
Dorothy uses natural language processing to make guesses not only about the concepts used in the query, but also uses the importance of individual concepts of the query, how the concepts are arranged, and the inventor’s intent. The query is examined as a whole to assess the overall “gist” of the query text and takes into account modifiers and indicators of intent (for example “most importantly…”). These factors are used to identify the concepts as well as their relative importance to the query as a whole. In this way, natural language processing takes semantic searching to another level.
Here’s an example:
In a basic semantic search, the machine extracts key concepts in the text of the query based on keywords and phrases used in the text. For example, the phrase “a red pickup truck having a 16 foot bed” might be extracted as a “motorized vehicle” by a semantic search engine.
Red pickup truck? Dorothy does not do visual recognition.
Typically, a search query will contain more than one concept. These concepts may or may not be applicable to the invention being searched. For example, the semantic engine might identify the “16 foot bed” in our pickup truck example as being a “sleep device,” and search for patents that describe this concept along with “motorized vehicles.” Imprecision in the English language makes these unintended concepts common, so many semantic engines allow the user to review the extracted concepts and allow users to remove unintended concepts, like “sleep devices,” before performing the search from the search. Google Patents recently began posting extracted concepts at the bottom of the returned results.
The combination of concepts identified in the query is then used to search the database. Return results are weighted and ranked based on the whether and how often the key concepts are discussed. Patents that discuss combinations of key concepts are scored more highly than patents that discuss individual concepts, and patents that discuss key concepts more frequently are scored more highly than those that discuss the concepts fewer times in the text. The numbers to the right of the concepts in the Google Patent results are most likely used as a weighting factor.
Dorothy extracts more information from the query by analyzing it as a coherent whole. Dorothy identifies concepts, and also to make decisions about how the concepts are related are which concepts are most relevant. This way, Dorothy “understands” that the “bed” of the “red pickup truck having a 16 foot bed” is a component of the pickup truck and not a “sleep device,” because the terms “truck” and “bed” are found in close proximity to one another often in the training data. This eliminates unintended concepts, like “sleep devices,” and allows Dorothy to extract the inventor’s intent from the search query.
Dorothy “understands” by creating mathematical expressions for trucks with beds when the database is indexed. The same algorithm parses the query “a red pickup truck having a 16 foot bed” and creates a mathematical expression for this phrase that is similar to the indexed phrases for trucks with beds. If you imagine these expressions being plotted, all phrases describing trucks with beds would be points very near each other. Points from indexed patents that are closer to the query point are more relevant to the search query and included in the return results.
Specific parts of the phrase, like bed length, are also part of the mathematical expression, so the indexed points and the query point do not overlap. Our example red pickup truck having a 16 foot bed, would distinguish it from most “prior art” pickup trucks that have 4-8 foot beds. This part would create a gap between the point representing our query and the points representing indexed pickup trucks.
There is a lot that goes into building a mathematical expression from a search query. Perhaps we’ll get into that in another post.
Semantic searching is much better way to retrieve information then Boolean/keyword searching. But, there are real advantages to natural language processing. As we’re better able to distinguish specific components of various phrases, like numerical ranges, Dorothy will be able to do things that can’t be done from a semantic search engine. In other words, Dorothy will be better than any search engine ever created.