Patent Classification using Relecura Custom Categorizer

Patent Classification using Relecura Custom Categorizer
by,
Hariharan Ramasangu, Bratati Mohapatra, Raja Ayyanar
Introduction
The urge to group things is a fundamental trait of humans that lies at the foundation for all inquiry and knowledge. What is common between one tree and one flower, two trees and two flowers, etc., has led to the abstract concept of number. Through numbers, the mind can put one tree and one flower in the same group, two trees and two flowers in a different group, and so on. The idea of grouping is born, i.e., collecting things that have common properties. The key challenges here are assigning meaning to the terms ‘common’ and ‘properties’. No wonder the most important tasks of Artificial Intelligence/Machine Learning (AI/ML) algorithms happened to be clustering and classification. Both of which are related to the idea of grouping - one in unsupervised learning sense and the other in the supervised learning sense.
The problem is more complex when it comes to patent document classification. How do we classify? In what category should I assign a document? The supervised classification needs good datasets. How do we define a ‘good’ dataset? Questions are plenty, but we need to start somewhere. Humans or subject matter experts assign a label to each patent document. This process is a time consuming one and also prone to errors. It seems this is the time-tested methodology of creating datasets.
Human Annotation Challenges
Human annotations are canopied by myths, as outlined by Lora and Chris [1]. Myths are grown in the same garden where scientific knowledge is grown, but myths are cultivated around the facts rather than from them. Some myths of human annotations are:
-
Only one interpretation is right
-
Disagreements between annotators should be removed or avoided
-
Inconsistency and differences can be removed by restricting the semantics
-
Experts are always better
-
Annotations are valid forever
These issues have to be kept in mind while formulating a methodology to generate datasets.
Gold Standard Dataset
Some of the defining characteristics [2] of a patent classification dataset are a diversity of concepts and technologies, size of the dataset, aspects, etc. Creating a “good” dataset for patent classification tasks is a challenging process. People have been struggling with many ideas regarding creating a “gold standard” for patent classification tasks. Harris, Trippe, et al. [2] have addressed this issue and created case studies.
The first case study in the gold standard dataset [3] is related to Quantum computing. The training dataset contains around 5000 documents comprising patents from US, EP, JP, KR, AU, and CN. The set contains two categories where the first category contains positive patents talking about generating qubits for use in quantum computing systems, and the second category comprises negative patents talking about applications, algorithms, and other aspects of quantum computing. The positive category contains 2283 documents, and the negative category contains 2800 patents. The second dataset is related to Cannabinoid [3].
Relecura Categorizer
Relecura has a generic categorizer [4] and multiple custom categorizers. We will discuss the results of the Relecura custom categorizer when applied to the gold standard datasets. We use various weighted combinations of features using title, abstract, claims, and full-text and also different classifier algorithms. DeepPatent [5], for instance, used only title and abstract information. We have given below only a few representative results from the large number of experiments conducted. All results are obtained using an 80-20 training-test split.
Grouping the patents by family reduces the test accuracy, especially the recall for the positive patents in Cannabinoid dataset has decreased to 0.66.
We will next study here the effect of train-test data split on accuracy and f1-score. The variation in accuracy, precision, and recall is marginal except for the recall in the positive case of the Cannabinoid dataset.
Though there are differences in the choices of the run compared to [2], the results exhibit similar performance levels. As you increase the training set size, the accuracy and f1-score increase almost linearly before saturation at ~0.96 and ~0.98, respectively [2]. Our results show similar behavior in recall. But other measures more or less remain within a narrow variation. We know this is also a highly dataset-dependent behavior based on our observations while investigating other datasets. We also noticed that the interaction among the dominant features from various sections of patents plays a key role in discriminating between positive and negative classes.
The supervised AI/ML algorithm is expected to have a good generalization behavior but proving this ability is a difficult task. The classifier model is built using the given training set. The quality of the training set influences the model and is critical even when the features are not crafted as in deep learning algorithms. Why do we need to have a classifier model that works in all conditions when we know the additional information of the possible structure in the inputs could improve its performance? This is the starting point of building a custom classifier with additional domain and operational information from subject matter experts. The parameter estimation, relative weights of different features, etc., will be considered in the context of the additional information. The custom categorizer is believed to have a generalization ability, at least in the domain where it is used.
Maintaining the 'status quo' offers security in terms of operations but moving away gives an opportunity to innovate. The resistance to innovate springs from the comfort-zone we have in the older ways. We need to see if altering the comfort-zone will bring more security and effectiveness. The patent domain is no different too. There is a conventional manual work pipeline that has been accepted as an industry norm. The disruptive nature of AI/ML has started making inroads into the conventional pipeline. But it is still in the primitive stage of having an algorithm to cluster, having another algorithm to classify, etc. The classifier algorithms are important, but the entire pipeline of the context of relevant documents, query-result paradigm, and searching for semantic relationship is about to transform completely. A new pipeline beckons. The patent domain awaits.
[1] Aroyo, L., & Welty, C. (2015). Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation. AI Magazine, 36(1), 15-24. https://doi.org/10.1609/aimag.v36i1.2564
[2] Harris, S., Trippe, A., Challis, D., & Swycher, N. (2020). Construction and evaluation of gold standards for patent classification—A case study on quantum computing. World Patent Information, 61, 101961.
[3] https://github.com/swh/classification-gold-standard
[4] https://categorizer.relecura.com/
[5] Li, S., Hu, J., Cui, Y., & Hu, J. (2018). DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2), 721-744.