Enterprise Search without Embeddings/LLMs. All-Against-All Comparison with NaturalText Graph AI

Rajasankar Viswanathan
5 days ago
5 min read

OpenSource search frameworks such as OpenSearch, Facebook AI Similarity Search (FAISS), Weaviate and Qdrant etc using vectors for similarity search. An improvement added via LLMs to create embeddings. Here vector means one dimensional numerical representation of data, embedding means multi-dimensional numerical representation of data. Though both are now interchangeably used, vectors are pre-LLM era usage.

Let us see why vectors, embeddings are needed and why it still fails to bring relevant search results in Enterprise Search?

When Google showed the world what their PageRank algorithm can do, that was revolutionary. It contained a basic fault within. It still works great for public data. However, every attempt to replicate the same success for private data i.e. documents within companies, social media data etc failed because what made the PageRank algorithm successful was not present in the private data. That problem is yet to be solved.

PagePank is still working because it uses metadata not content. Google uses information about the website and other metadata to show in which page the results from websites shown. The prevailing understanding was it uses the links between the websites. Yes those links are understood as literal links i.e. one website refers to another via backlinks. However, it is much more than that. PageRank works because of all other metadata taken into consideration.

Now private data search or Enterprise Search. As there are no links between documents, chat, email, social media posts etc, the first attempt to have search working was vectors. Second attempt is embeddings. Text data converted into vectors/embeddings then distance metrics used to compare. So we need to understand two things, one is what is vector/embedding and what is distance metric.

First vectors, this term is post-LLM term, in pre-LLM post NLP era, this is called feature and features are extracted via vectorizers. Vectors are mostly one dimensional array structure but for corpus or group of documents it becomes matrix.

For these sentences the vectors are given below. This is from Scikit-learn, an open source library. Take a look at Vectorization

[ 'This is the first document.',  
'This document is the second document.',  
'And this is the third one.',  
'Is this the first document?',]

Resulting vector

[ [0 1 1 1 0 0 1 0 1]  
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1] ]

There are different types of vectors, take a look at Feature Extraction

Second one is embeddings. Embeddings are the same as vectors i.e. numerical representation of data but a little bit more sophisticated than vectors. Embeddings are created via local models ie transformer trained models or uses LLMs to generate it.

Distance metrics are not about distance in the literal sense. It is about subtraction of one vector from another. In the vector space that denotes the distance. If you have two vectors 0 1 1 1 0 0 1 0 1 and 1 0 0 1 1 0 1 1 1 then distance would be 5. It is calculated by creating a resulting vector 1 1 1 0 1 0 0 1 0 then summing it. It would be a little bit confusing to understand it. And there are several distance metrics too.

Take a look Euclidean distance , Manhattan distance , Chebyshev distance Minkowski Distance, Vector Embeddings , kNN Spaces

Now, understanding the top-k results and k-nearest neighbour (kNN) clustering is the key to understanding how the search works. The letter k in the k means indicates the number of clusters needed. Clustering happens via checking if the neighbours of a vector are similar or not. If it is similar it is added to the cluster if it is not, it might be the starting point of a new cluster. Seems to be simple but ineffective and costly in large datasets.

In order to make it faster in large but dense clusters, the Hierarchical Navigable Small Worlds (HNSW) algorithm was invented two decades ago. This algorithm maps the vectors into graph layers and finds the nearest neighbour by moving from layer to layer in a 3 dimensional manner. The data must be a dense vector in order to find the clusters. It should not be sparse vectors.

Dense vector indicates that vector space is fully filled. Sparse vector indicates empty points in the vector space. What this means in the case of Enterprise text data. Replace the vectors with words. Dense space means most of the words occur in all the text documents. Sparse space means each document may have unique words and more than 50% of words don't occur in all documents. Enterprise data creates only very sparse vector space because that is a real world scenario. Names, places, products, services etc won't be repetitive in all the documents.

These methods are the basis for Enterprise Search and RAG. Now, we can see why this is not working for businesses? Enterprise data that too text data is diverse, sparse and a mix of unstructured and structured. Diverse in the sense of both text and numbers in both unstructured and structured formats.

When this data is simply converted into vectors, it loses context, semantic understanding. Once the contextual meaning is lost, similarity has no meaning because the definition of similarity is just numerical operation not a contextual understanding. But that is the second problem. The first problem is that it doesn't work with real world data.

The real world data creates sparse vector space, these methods/algorithms work for only dense space. Assuming that all the data is related to each other is the fundamental problem these methods suffer. Creating similarity when none exists generates bad similarity scores and results.

It doesn't stop there, the k-nearest search and k-means clustering works only for small datasets. In order to cluster or rank the data, the top-k result is used to cluster which adds more errors to the results than removing it. For large datasets k-means clustering is impossible.

Add to that, this search and clustering happens every time a user enters a query. Users can only search but cant understand what is the bigger picture. They have to know beforehand what they are searching for.

Things such as, how to understand the data, how the concepts are spread in the data, how many concepts are there cant be answered via this vector space search methods.

NaturalText AI with graph representations fix these issues in Enterprise Search. NaturalText AI is a symbolic and zero-shot AI which means that text is kept as text data, preserving the meaning and contextual understanding.

The biggest advantage NaturalText brings is that creating pre-populated clusters beforehand i.e in the training itself. These clusters are automatic which means that there is no need to predict or adjust the number of clusters every time. The clusters are contextual and based on the grammar/pattern structure of the data. This is equivalent to all-against-all comparison of the data i.e. comparing every sentence/datapoint to every other sentence/datapoint in the text corpus.

From these clusters, documents/sentences/datapoints get a ranking score. These ranking scores are based on the links between the documents. This enables searching to be more relevant and doesn't depend upon remembering the keywords. Similar words are automatically included in the search results because search is happening in the pre-populated clusters.

The advantage is also for the physical infrastructure by using only CPU. NaturalText AI doesn't need a GPU for all these. The clustering and searching can happen in commodity servers not needing specialized hardware. The results can be stored in RDBMS which reduces the cost of integrating search in existing setup.

Thus, NaturalText AI offers real and natural search results for any language, any data, any size. Using graph representations, NaturalText AI solves issues in both software and hardware spaces.

Get in touch info@naturaltext.com for solving your businesses search issues.

Comments