Context Graphs - Definition of Context and Creating the Graph representation

Rajasankar Viswanathan
2 days ago
4 min read

Context Graphs are needed for Enterprises to make better decisions and analyze data, that is being agreed upon by a lot of people. The question is what is context and how to create the context graphs from data. Context is generally understood as meaning or description which is not known explicitly but understood implicitly.

In Linguistics, there is a hypothesis called Distributional Hypothesis which states that the meaning of the word depends upon the surrounding words. This hasn't been proven as a theorem as it stays as a hypothesis. Simply put the meaning of the word is not constant or has only one, the meaning is fluid and it can assume several meanings depending upon the place in the sentence. Why this is not proven is a topic of debate for another post.

For example, Bank can mean multiple things.

River Bank

Money Bank

I am banking on you

In the same manner, how the propositions in the English used can change the meaning itself.

"Sitting in the car" is different from "Sitting on the car". Usually the sitting-on-the-car never happens or rarely happens as an accident. Just one word changes the whole meaning of the sentence. This is called Context in Linguistics.

Understanding this is important because what we deal with mostly is text based descriptions, decisions, entries, documentations and so on.

Context in Enterprise data is all about understanding the whole sentence, document, email and input. In the case of data living in a relational database it still needs to be processed for extracting context and used. Now we have data lives in silos such as system of records such as ERP or CRM, in textual data, and in relational databases. One thing is to collect and train an AI in the case of LLMs another thing is to extract context from the data.

Now we need to define Context in the case of Enterprise data. As with Linguistics, Context in Enterprise data would be the same. Information is defined by multiple variables or data points. It could be numbers, it could be text or it could be a collection of both. It could even be symbols, chemical sequences and biological sequences and so on. In simple words, more than one datapoint defines Context and how many variables or datapoints needed for Context depends upon the data not outrightly defined.

Extracting the Context dont have any definitive pattern. It will vary depending upon the domain, situation and information itself. We would need something to understand this complex situation.

Let me put an analogy.

You are walking in a forest. You carry a candle which will shine for a feet. You carry a torch which will shine for a few feet. You carry a modern torchlight which will shine for several feet. Floodlights shine for a few yards. Flares light up an area for a few seconds. You wait for sun and claim up a tree to have an better view

That exactly happens in modern information systems. We need a sun and a bird's eye view in order to make better decisions.

Sun and bird's eye view translates into not about collecting all the data, but grouping, finding similarities and creating a graph. Similarity based groups at scale will be bird's eye view and Sun's rays would be graphs.

The reason for creating the similarity based clustering is to extract all the information surrounding datapoint or action or incident. Knowing what are all the similar datapoints will show more context. This will enable the search for best possible solution. Without the similarity clustering it will be hard to even start the point for search.

Now we need to see the necessity of doing this similarity clustering at scale. Would that be possible by existing methods?

Let us start with LLMs.

LLMs are not a browsing solution, i.e. you can view the pages from a search.

Vector based embedding should be called as hash embedding because that is what is happening, it wont give you a browsing based solution too. Retrieval Augmented Generation is based on Vector Embeddings
Private Document Search such as Solr/ElasticSearch are the actual vector based search and still need to search for every keyword.

This shows that in all the existing systems search or query dominates the information retrieval than knowledge comprehension or compression. We need to extract the knowledge to comprehend for better visibility and the only way to do it is combining Graph based data structure and Similarity Cluster.

To comprehend something, all the information is needed. Not just a small window of information.

With that now, the definition of Context. In the Enterprise data world, Context is not just data but how it is connected

Context must extract how the workflow flows
Context must extract how things are connected
Context must extract how to connect the things, ie creating new information.
Context must make the meaning out of the situation.

Now the main step of extraction of these things and storing it in a graph database for reasoning, access. Data should be transformed into nodes, edges and properties. This needs to be done automatically, i.e. AI must do it.

The similarity clustering already exists. Extracting these from those clusters is fairly straightforward. Here to look at the analogy from languages.

In language, Nouns would be numerous and varied, that would be nodes

Verbs would limited but found commonly connecting nouns that would be edges

Prepositions would be pointing to directions, properties etc so that would labels

As with the text data, the same pattern can be easily used to populate data into graphs.

For mixed data i.e. data with both numbers and text, the same method can be used too. Data from similarity clusters taken directly and populated via graph algorithms.

Clustering based on similarity is a necessity because only then transforming the data into graph structure is easier. Automatic disambiguation of nodes, edges is easier when phrases are extracted from the similar data points. Without that, creating structure from unstructured data is not possible.

Now, Context and Context graphs can be created easily, the uses could be AI Agents or passing to LLMs or using graph query to manage it.