It’s back to school time. The neighborhood kids are back to work learning the basics: reading, writing, arithmetic, and Java or C, maybe Python. Let’s hope they’re are also developing an appreciation for learning and a hunger for knowledge.
My hunger for knowledge came from an essay about Ham, the space monkey, in the back of my 2nd grade reader. It was never part of our reading assignments, but I read it 2 or 3 times, transfixed by the possibilities and adventure that space travel held. Science became my thing. I eventually focused on biology and biochemistry, but I never lost interest in everything else. My success as a patent lawyer grew from this fascination with science and technology, allowing me to connect with inventors and learn new technologies quickly.
As far as I know, computers cannot develop a fascination with science and technology, but computers can learn. According to Bill Gates, this technology (i.e. “Machine Learning”) will be worth 10 Microsofts (that’s about $10 trillion for those keeping score). I don’t know about that. I do know that machine learning is already having a big impact on how we identify relevant patents.
"Be cool. Stay in school!"
To Dorothy, the words and sentences of a patent publication are a series of mathematical equations that are assigned based on a set of rules (i.e., a “model”). Words and sentences represented by similar equations have similar meanings. We creates thesauruses by grouping words that have similar equations. The results have been interesting: sometimes amazingly insightful, sometimes completely outrageous. We adjust the weights of different components of the model to eliminate outrageous results while keeping those that are insightful.
There are two parts to creating machine learning algorithms: (1) identifying good (or bad!) results and (2) adjusting the weight of the right component to get as much good and as little bad as possible. “Training data” is data that has been curated by humans and is assumed to be correct. The computer adjusts the weights of the components of the model so that it can do the best possible job of predicting good or bad results for the training data. We expect that this trained model will do nearly as well predicting new cases that it hasn’t yet seen. Since machine learning systems operate on the assumption that the training data is always correct, bias in the training data introduces bias into the system.
I suspect the $10 trillion question is training a model to deal with bias in the training data. Bias is often hard for humans to spot. Models trained with hidden bias can cause huge problems, like penalizing a job applicant’s resume that includes the term “women.”
Lucky for us, the patent database is largely free from bias… we think. “Bias” in our training data is largely errors in fact, incorrect assumptions, and changes in technology over time. This gives us the opportunity to build models that allow Dorothy to learn virtually any technology.
Despite our best efforts, it’s unlikely that Dorothy will develop an appreciation for the knowledge we’re feeding her. Her underlying machine learning models will, however, help us better organize technical information and make it easier to identify relevant patents in your search.