NLP: Finding Syntactic Similarity in Text | Blog Posts

Natural Language Processing (NLP), also known as computational linguistics, is one such technology that is garnering the interest of many scientific researchers due to its right blend of language, machine learning, and artificial intelligence. After a detailed discussion about the use of transformer architecture in NLP in a past blog, Lumenci shares an analysis of two methods to calculate syntactic similarity in text, namely Jaccard similarity and Cosine similarity.

Natural Language Processing (NLP) has emerged as a critical area in artificial intelligence. With the vast amount of textual information being generated every day, NLP has garnered the interest of many scientific researchers in hopes of identifying ways to process this information, making it more comprehensive, efficient, and accessible. As depicted in the graph below, the number of papers presented at the Association of Computational Linguistics' (ACL) annual conference rose significantly over the past 20 years.

Number of papers published at the ACL conference by years

In domains, like Intellectual Property (IP), which involves a lot of paperwork and documentation, NLP techniques can help cluster similar documents, which simplifies patent analyses for inventors and patent attorneys. Such an application of NLP will not only increase productivity by reducing the processing time, but it will also enhance the precision with which large patent portfolios are handled and analyzed. In this article, we shed some light on NLP's specific applications for finding similarities between patents.

A patent is typically divided into an abstract, background, description of embodiments, and claims. Although the claims are considered the most important part of a patent, the abstract usually summarizes the overall technology described in the patent. Similarity analysis of a patent can include any of the structural parts including claims, abstract, references, and so on. However, in most cases, it is desirable that patents with similar claims be clustered together for analysis.

To formulate a strategy for this task, a certain degree of clarity must be reached to define the concept of similarity in a quantitative manner. Usually, two major similarity indices are encountered in similarity analysis of text – syntactic similarity and semantic similarity. The syntactic similarity is based on the assumption that the similarity between the two texts is proportional to the number of identical words in them (appropriate measures can be adopted here to ensure that the method does not become biased towards the text with a larger word count, as explained in [1]). On the other hand, semantic similarity focuses more on the meaning and interpretation-based similarity between the two texts. While the syntactic similarity value can be obtained by constructing measures around the word count of the two documents, the semantic analysis uses a more sophisticated method to employ WordNet representations for extracting meaning-based values for the two texts.

This article focuses on calculating the syntactic similarity between two texts, namely Jaccard similarity and Cosine similarity. (Methods for calculating semantic analysis will be discussed in a future blog.)

Before proceeding with the analysis, the text must be pre-processed to remove all special characters, HTML tags, or other predefined sets of words. The pre-processed words thus obtained are then reduced to the respective word roots (or, in other words, lemmatization is performed on the text intended to be analyzed). For our purpose, we can consider only the unique words present in each text, which is a common approach.

Jaccard Similarity

The Jaccard similarity index is proportional to the number of common unique word roots in the two texts. It is inversely proportional to the sum of unique word roots in the two texts [2]. To put it simply, if A and B are respectively the set of all unique word roots present in text A and B, then:

For instance, if the text

A = "AI has transformed the way the world works" and

B = "AI has taken the world by storm",

we have,

set A = {'AI', 'has', 'transform', 'way', 'world', 'work'} and

set B = {'AI', 'has', 'take', 'world', 'storm'}.

Here, we have assumed that the text is pre-processed, and words like articles, prepositions, and the like have been filtered out.

Computing our results, A union B is 8 and A intersection B is 3.

Thus, the Jaccard similarity index for this case will be 3/8 or 0.375. Hence, we conclude that the two texts are syntactically similar by 37.5%.

Here, one can quickly notice the shortcoming of Jaccard similarity. By looking at the Jaccard similarity score, it appears that A and B are not very similar despite the semantic coherence of the two texts.

Cosine Similarity

Mathematically, the cosine similarity finds the angular separation (θ) between two N-dimensional vectors [2]. The vectors corresponding to each text (say, vectors respectively) are formed based on unique word roots present in each text. The two vectors' dot product yields the value of the cosine similarity index between the two texts.

Therefore, considering the example used above, we get the following vectors for the two texts:

Here, we have vector

ā = {1,1,1,1,1,1,0,0} and ē ={1,1,0,0,1,0,1,1} corresponding to text A and B respectively.

The dot product ā.ē = 1*1+1*1+1*0+1*0+1*1+1*0+0*1+0*1 = 3

Also, |ā| = √6 = 2.449 and |ē| = √5 = 2.236

Hence, the cosine similarity index equals = 0.598

Since cosine is a decreasing function, the higher the value of cosine, the smaller the angular separation of the two vectors, and the higher the similarity.

There are other syntactic similarity measures as well that employ identical word count calculations for finding a similarity index for two texts. However, we see that syntactic similarity measures have their own drawbacks, and a need for semantic analysis cannot be avoided. Semantic measures do not solely rely on identical word counts but instead explore the meaning-based closeness between the two documents [3]. Hence, we believe that semantic measures when used in combination with syntactic measures lead to the best outcomes.

Read more about NLP and the use of transformer architecture in this area of technology, here - NLP - Transformers

References

2. An Effective Similarity Metric for Application Traffic Classification

Lumenci's Technology Team is developing a deep-dive analysis of the wireless charging industry that will cover the critical topics related to the Technology, Product Roadmap, Licensing, IP landscape covering the key players and Tech areas, and M&A Dealflow in the industry. Watch this space to know more.