AI is everywhere. It’s telling you what to buy, who to connect with, what you’re looking for when you enter a search query, and what to watch on TV. Many of these AI models, like Netflix, compare the shows you watched and your user preferences to the shows other users with similar preferences have watched to guess what shows you’ll like.
The key to success for these AI models is the availability of data. The more data the model interacts with the better it will be at guessing what you’ll like.
There are better ways to ingest data, Data...
Data, more specifically structured data, is essential to creating good AI models. Structured data is formatted or includes metadata that allows the model to easily ingest the data and put it to use. This “training data” ensures that unformatted input data is properly characterized by the model and produces an appropriate output. Structured data also gives the developer oversight over what data is appropriate and how it is categorized. This is referred to as “supervised learning.”
Models like those used by Netflix or Google create training data from user inputs and activities. This data can be structured in a way that makes it easily accessible to the model. These models can easily “learn” what they need to and produce appropriate outputs.
Historical data is much harder for machine learning models to use as training data. The internet contains more that 1.2 million terabytes of data in text files, images, and video footage. The vast majority of these data is formatted in the way it was entered by the user with no distinguishing features or metadata to tell the computer what it is. This data is “unstructured.”
Unstructured data training data creates a number of problems. Developers often focus on creating models that can ingest unstructured data and properly categorize it at the expense of supervision. Lack of supervision allows the model to create relationships between data that are inconvenient. Amazon’s HR machine learning model is a great example. This model famously scored female candidates lower than male candidates. This bias was introduced into the Amazon model by the training data and exacerbates intrinsic bias at Amazon. The model did not develop this bias organically.
Lucky for us, patent data is “semi-structured,” meaning that there are some signposts are built into patent document in the form of sections like, the Abstract, Claims, and Detailed Description and CPC classifications that are assigned to the application by patent office employees. Additional structure is added by patent lawyers. Terms like “embodiment” or “aspect” specifically call out discrete inventions. Boilerplate language and definitions describe concepts and elements of the invention used in the publication. This “signposts” produce identifiable patterns. Patent lawyers are very good at identifying these patterns and use them to quickly find information in the patent publication.
Machine learning models are great at pattern recognition. Structure of patent documents and “patentese” patterns allow our machine learning models to ingest the data in patent publications, properly categorize it, and “learn” the underlying technology. Dorothy uses the information gathered by our machine learning models to understand the plain English search query and quickly identify patent publications that describe similar technology or specific elements of the search query.
The question many data scientists and machine learning experts ask is whether the 50 million+ patent publications available electronically in the U.S. database? So far, the answer appears to be “yes.” Dorothy can effectively search numerous types of technology and identify relevant results. This will improve as we begin using machine learning to “learn” patent law.