Patent Classification using Relecura Custom Categorizer

Jan 28, 2021

Patent Classification using Relecura Custom Categorizer

by,

Hariharan Ramasangu, Bratati Mohapatra, Raja Ayyanar

Introduction

The urge to group things is a fundamental trait of humans that lies at the foundation for all inquiry and knowledge. What is common between one tree and one flower, two trees and two flowers, etc., has led to the abstract concept of number. Through numbers, the mind can put one tree and one flower in the same group, two trees and two flowers in a different group, and so on. The idea of grouping is born, i.e., collecting things that have common properties. The key challenges here are assigning meaning to the terms ‘common’ and ‘properties’. No wonder the most important tasks of Artificial Intelligence/Machine Learning (AI/ML) algorithms happened to be clustering and classification. Both of which are related to the idea of grouping - one in unsupervised learning sense and the other in the supervised learning sense.

The problem is more complex when it comes to patent document classification. How do we classify? In what category should I assign a document? The supervised classification needs good datasets. How do we define a ‘good’ dataset? Questions are plenty, but we need to start somewhere. Humans or subject matter experts assign a label to each patent document. This process is a time consuming one and also prone to errors. It seems this is the time-tested methodology of creating datasets.

Human Annotation Challenges

Human annotations are canopied by myths, as outlined by Lora and Chris [1]. Myths are grown in the same garden where scientific knowledge is grown, but myths are cultivated around the facts rather than from them. Some myths of human annotations are:

Only one interpretation is right
Disagreements between annotators should be removed or avoided
Inconsistency and differences can be removed by restricting the semantics
Experts are always better
Annotations are valid forever

These issues have to be kept in mind while formulating a methodology to generate datasets.

Gold Standard Dataset

Some of the defining characteristics [2] of a patent classification dataset are a diversity of concepts and technologies, size of the dataset, aspects, etc. Creating a “good” dataset for patent classification tasks is a challenging process. People have been struggling with many ideas regarding creating a “gold standard” for patent classification tasks. Harris, Trippe, et al. [2] have addressed this issue and created case studies.

The first case study in the gold standard dataset [3] is related to Quantum computing. The training dataset contains around 5000 documents comprising patents from US, EP, JP, KR, AU, and CN. The set contains two categories where the first category contains positive patents talking about generating qubits for use in quantum computing systems, and the second category comprises negative patents talking about applications, algorithms, and other aspects of quantum computing. The positive category contains 2283 documents, and the negative category contains 2800 patents. The second dataset is related to Cannabinoid [3].

Relecura Categorizer

Relecura has a generic categorizer [4] and multiple custom categorizers. We will discuss the results of the Relecura custom categorizer when applied to the gold standard datasets. We use various weighted combinations of features using title, abstract, claims, and full-text and also different classifier algorithms. DeepPatent [5], for instance, used only title and abstract information. We have given below only a few representative results from the large number of experiments conducted. All results are obtained using an 80-20 training-test split.

Relecura Custom Categorizer (RCC)	Positive class			Negative class			Accuracy		Dataset
	Precision	Recall	Precision		Recall
RCC1	0.99	0.84	0.97		1	0.98		Cannabinoid
RCC1	0.99	1	1		0.99	0.99		Quantum Computing
RCC2	0.97	0.66	0.88		0.99	0.90		Cannabinoid
RCC2	0.86	0.85	0.94		0.94	0.92		Quantum Computing

Grouping the patents by family reduces the test accuracy, especially the recall for the positive patents in Cannabinoid dataset has decreased to 0.66.

We will next study here the effect of train-test data split on accuracy and f1-score. The variation in accuracy, precision, and recall is marginal except for the recall in the positive case of the Cannabinoid dataset.

Training-Test split %	Positive Class			Negative Class			Accuracy		Dataset
	Precision	Recall	Precision		Recall
70-30	0.88	0.98	0.98		0.89	0.93		Quantum computing
60-40	0.88	0.98	0.98		0.88	0.93		Quantum computing
50-50	0.87	0.97	0.98		0.88	0.92		Quantum computing
40-60	0.86	0.97	0.97		0.87	0.91		Quantum computing
30-70	0.85	0.96	0.96		0.86	0.90		Quantum computing
70-30	1	0.71	0.95		1	0.95		Cannabinoid
60-40	1	0.68	0.95		1	0.95		Cannabinoid
50-50	1	0.67	0.95		1	0.95		Cannabinoid
40-60	1	0.66	0.94		1	0.95		Cannabinoid
30-70	1	0.64	0.94		1	0.95		Cannabinoid

Though there are differences in the choices of the run compared to [2], the results exhibit similar performance levels. As you increase the training set size, the accuracy and f1-score increase almost linearly before saturation at ~0.96 and ~0.98, respectively [2]. Our results show similar behavior in recall. But other measures more or less remain within a narrow variation. We know this is also a highly dataset-dependent behavior based on our observations while investigating other datasets. We also noticed that the interaction among the dominant features from various sections of patents plays a key role in discriminating between positive and negative classes.

The supervised AI/ML algorithm is expected to have a good generalization behavior but proving this ability is a difficult task. The classifier model is built using the given training set. The quality of the training set influences the model and is critical even when the features are not crafted as in deep learning algorithms. Why do we need to have a classifier model that works in all conditions when we know the additional information of the possible structure in the inputs could improve its performance? This is the starting point of building a custom classifier with additional domain and operational information from subject matter experts. The parameter estimation, relative weights of different features, etc., will be considered in the context of the additional information. The custom categorizer is believed to have a generalization ability, at least in the domain where it is used.

Maintaining the 'status quo' offers security in terms of operations but moving away gives an opportunity to innovate. The resistance to innovate springs from the comfort-zone we have in the older ways. We need to see if altering the comfort-zone will bring more security and effectiveness. The patent domain is no different too. There is a conventional manual work pipeline that has been accepted as an industry norm. The disruptive nature of AI/ML has started making inroads into the conventional pipeline. But it is still in the primitive stage of having an algorithm to cluster, having another algorithm to classify, etc. The classifier algorithms are important, but the entire pipeline of the context of relevant documents, query-result paradigm, and searching for semantic relationship is about to transform completely. A new pipeline beckons. The patent domain awaits.

[1] Aroyo, L., & Welty, C. (2015). Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation. AI Magazine, 36(1), 15-24. https://doi.org/10.1609/aimag.v36i1.2564

[2] Harris, S., Trippe, A., Challis, D., & Swycher, N. (2020). Construction and evaluation of gold standards for patent classification—A case study on quantum computing. World Patent Information, 61, 101961.

[3] https://github.com/swh/classification-gold-standard

[4] https://categorizer.relecura.com/

[5] Li, S., Hu, J., Cui, Y., & Hu, J. (2018). DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2), 721-744.

Blogs