Large Scale Pre-populating Similarity Clusters without Vectors/Embeddings/LLms - NaturalText Graph AI.

Rajasankar Viswanathan
Feb 1
3 min read

In Information retrieval, similarity, the word invokes various meanings in various contexts. It is used in tasks where exact matching is not enough. In business cases where searching for information takes precedence, similarity search is the way to understand the trends, extract insights. In other cases, such as facial recognition, image comparison, etc similarity is a basic functionality requirement.

Most of the business data sits as textual information. Most commonly consumed data is still in text, even if other formats such as video or images are produced and consumed, text still supreme. Text is used as a synonym for human natural language as commonly assumed, however, it also meant other languages including programming languages, biological sequences and specific formats.

Textual or linguistic similarity means contextual similarity. Contextual meaning assigns meaning to a word based on the situation it is used. There is no method to define this pattern nor how to extract it. Adding to the problem is that text as it is can't be used as it is in the text processing or language processing methods. Text has to be transformed into numbers in order to be processed via statistical methods.

Enter Vector Space. Vector space is a mathematical structure called set where vectors or set numbers can be added or multiplied under conditions. Text is transformed into vectors by method of vectorization, there are several methods to vectorization, which is also called features in textual methods including in Machine Learning and Artificial Intelligence.

Once text is transformed into vectors, mathematical operations can be performed. Vectors can be plotted in the 2D graph plots to find similarity. Voila, getting the similarity of the text is easier. That is the idea and it works for simple things. Why it won't capture the nuances of languages is for another article.

Even for this similarity, it has to be searched one by one. In other words, similarity must be found for each word or sentence one by one every time. There is no way to know the similarity beforehand or find and store it in a database.

This is where NaturalText introduces Graph Based Symbolic AI to solve the fundamental problem in making computers to understand language.

NaturalText AI relies on symbols not on vectors. There is no need to convert text into vectors to process.

NaturalText AI pre-populates the similarity on a large scale. Taking the entire corpus of text data, training or clustering enables businesses to understand the data without any manual intervention.

That large scale clustering is equivalent to comparing each data point or textual fragment to every other datapoint. Textual fragments could be sentences, paragraphs or whole documents.

Large scale clustering without needing GPUs or dedicated systems. It can run on the local setup itself.

All this made possible by Graph AI.

NaturalText Algorithms keep text as symbols, create graph representations to plot symbols on 3D graph to create similarity based on connections. This extracts the similarity between the textual fragments keeping the contextual meaning.

This symbolic representation of textual data extended to Enterprise relational data, biological sequences such as DNA, Chemical fingerprints and other formats. Graph representations are not limited to just text information. It can be applied to image and video formats too.

NaturalText AI works well on any language which goes without saying. Without training data or corpus, just using the data to extract patterns in the data itself is achieved.

Mathematical proof for vectors can't match the contextual meaning in the next article. Get in touch via info@naturaltext.com.

Comments