What is a synonym?
Webster’s defines “synonym” as “one of two or more words or expressions of the same language that have the same or nearly the same meaning in some or all senses.” The average 4th grader can probably give you a good definition complete with examples. Simple, right?
No. Not when you are dealing with legal writing. Definitely not when you’re dealing with patent law, which blends legal writing and technology, and encourages inventors to “act as their own lexicographer.”
DorothyAI CEO, Curt Wadsworth, walks the line...
The complexity of patent law makes the job of building a thesaurus based on the U.S. patent database difficult to say the least. Our best hope is machine learning, which uses algorithms that map the words of each sentence to vector of numerical values producing “word embeddings.” Words with similar “scores” (i.e. vectors pointing in the same direction in high dimensional space) are considered synonyms and are saved to the thesaurus. There is no human interaction with the algorithm during this process, so Dorothy had plenty of space to draw her own conclusions while we were building our thesaurus.
The results are interesting. According to Dorothy, synonyms for “candle” include “sausage” and “chocolate.” I can’t think of two sentences that use terms “candle” and “sausage” in the same context. A chocolate scented candle makes sense. On the other hand, it takes a special person to mask the scent of dog in the living room with a sausage scented candle. Dorothy sees something we don’t. However, this insight is not helpful to our customers, so we corrected the thesaurus. The results returned for this test case noticeably improved.
The biotech and chemistry portions of the thesaurus were messy. It turns out, there are approximately 4.5 million unique peptide sequences in the U.S. patent database that are synonyms to everything from “absent” to “zirconia.” In the chemical arts, word embedding algorithms create new “words” after every dash and between sets of parentheses and brackets, creating around 2 million additional entries. Removing these expressions made a big difference. We have plans to deal with biomolecules and chemical compounds in the future.
Despite the nonsense, enlisting a machine that can’t “read” and doesn’t understand the meanings of the words it is associating as synonyms has its advantages. The algorithm holds no bias, doesn’t care about the dictionary meaning of the words, and can’t know if the sentence it is analyzing “sounds right.” In fact, “stop words” like gerunds and conjunctions are removed, so none of the sentences analyzed sound right. The remaining words are simply numbers in an equation. The score creates word associations that humans might not make. Should we?
Where is the line between relevant word associations and nonsense? Is there a bright line rule (or factor) that distinguishes a novel combination of elements from absurdity? Is it the same for all classes of inventions and CPC sections? I have no idea. I don’t think Jay or Mike know either. But, we are going to find out, and in the process, we will create a machine that redefines what it means to search for information.