myrelaxsauna.com

Building a Universal Data Agent Quickly with LlamaIndex and Gravitino

Written on

In today's rapidly evolving landscape of data and generative intelligence, teams responsible for data infrastructure face challenges in making company data accessible in a streamlined, efficient, and regulation-compliant manner. This need is particularly pressing for the development of Large Language Models (LLMs) and Agentic Retrieval-Augmented Generation (RAG) systems, which have significantly impacted the analytics domain. This article will guide you through creating a data agent from the ground up, utilizing an open-source data catalog for interaction.

What is an LLM Agent?

To begin, it's essential to understand the function of Agents within RAG Pipelines. While LLMs can generate and comprehend language, they lack advanced reasoning skills. Agents enhance this capability by interpreting instructions to execute complex, domain-specific reasoning, which is then relayed back to the LLM.

Role of LLM agents at a glance

Agents can serve various purposes across different fields, such as solving mathematical problems, facilitating retrieval-augmented conversations, and acting as personal assistants. A data agent is generally aimed at achieving an extractive objective by directly interacting with the data, which can greatly enhance overall application performance and response accuracy.

Below is a typical architecture of a data agent.

Role of LLM agents at a glance

The agent receives instructions from the LLM and, depending on its design, can interface with the user or the LLM via APIs or other agents. It decomposes larger tasks into smaller, manageable ones through planning, incorporating reflection and refinement capabilities. With the aid of memory, the agent can retain and retrieve information over extended contexts using vector storage and retrieval techniques. Furthermore, agents can access external APIs to gather information from various data sources, a feature that is especially beneficial.

Production Issues in RAG Development

Numerous demonstrations, proofs of concept, and tutorials exist detailing the construction of simple data agents; however, production deployment introduces its own set of challenges.

Data Quality and Integrity The precision of responses is directly influenced by the quality and integrity of the data, regardless of the LLM employed. The effectiveness of generated SQL statements hinges on the quality of metadata associated with structured data, which can lead to undesirable outcomes if compromised. Poor source data can adversely affect vector embeddings, leading to nonsensical or erroneous retrieval results. The adage "garbage in, garbage out" is particularly relevant in the context of generative AI.

Retrieving Information from Diverse Sources Organizations typically ingest data from a myriad of sources, encompassing various formats and storage solutions. Data may need to migrate across data centers and regions, complicating retrieval efforts. Failing to efficiently connect to a broad range of data sources can result in significant disadvantages, as critical data and relationships may be overlooked. The traditional ETL approach to centralizing data can also hinder response accuracy, often requiring additional time for data preparation.

Data Privacy, Security, and Compliance Ensuring data privacy, security, and compliance is crucial when developing production-level data systems, including data agents and APIs. This task becomes more complex when LLMs are involved due to their high dimensionality and complexity, making it challenging to trace outputs back to their origins. Troubleshooting such systems, especially with multiple external tool and API calls, is inherently difficult while maintaining privacy and security. It's essential to design data infrastructures with continuous visibility, observability, measurability, and robustness.

What is Apache Gravitino (incubating)?

Apache Gravitino (incubating) is a high-performance, geo-distributed, and federated metadata lake. With a technical data catalog and metadata lake, users can manage access and implement data governance across various data sources (including file stores, relational databases, and event streams) while utilizing multiple engines such as Spark, Trino, or Flink across different formats and cloud providers. This integration is invaluable for quickly deploying LlamaIndex across multiple data sources.

Apache Gravitino's Architecture at A Glance

With Gravitino, users can achieve:

  • A Single Source of Truth for multi-regional data with geo-distributed architecture support.
  • Unified management of Data and AI assets for both users and engines.
  • Centralized security for various sources.
  • Integrated data management and access management.
  • An AI-ready and cost-efficient metadata fabric that standardizes across all data stores.

For further details on Gravitino, please refer to our blog post on Gravitino — the unified metadata lake.

Without Gravitino, a standard agentic RAG system would appear as follows:

LlamaIndex flow

Users would need to utilize different readers to connect to various sources individually, a process that becomes increasingly complex when data is distributed across clouds with differing security protocols.

With Gravitino, the new architecture is streamlined:

LlamaIndex flow

Using Gravitino and LlamaIndex to Build a Universal Data Agent

Let's delve into how to construct a data agent in just 15 minutes, which will offer several benefits:

  • No data movement: The data remains in its original location, eliminating the need for preprocessing or aggregation.
  • Capability to retrieve answers from both structured and unstructured data.
  • A natural language interface that converts user queries into subqueries and generates SQL as needed.

Environment Setup

Below, we outline the code necessary to replicate this setup. If you wish to follow along step by step, a local setup is available. Note that an OpenAI API key is required to run this demo.

To explore the playground further, see: Apache Gravitino Demo Playground

git clone [email protected]:apache/gravitino-playground.git

cd gravitino-playground

./launch-playground.sh

Next, navigate to the Jupyter Notebook with these steps:

  1. Open the Jupyter Notebook in your browser at http://localhost:8888
  2. Open the gravitino_llamaIndex_demo.ipynb notebook
  3. Start the notebook and execute the cells

The overall architecture of the demo included in the local playground is illustrated below:

Demo Architecture

Managing Datasets Using Gravitino

First, we will establish our initial catalog and link it to our file sets, with Hadoop serving as the data source. Next, we define the schemas and specify the storage locations.

demo_catalog = None

try:

demo_catalog = gravitino_client.load_catalog(name=catalog_name)

except Exception as e:

demo_catalog = gravitino_client.create_catalog(name=catalog_name,

catalog_type=Catalog.Type.FILESET,

comment="demo",

provider="hadoop",

properties={})

# Create schema and file set

schema_countries = None

try:

schema_countries = demo_catalog.as_schemas().load_schema(ident=schema_ident)

except Exception as e:

schema_countries = demo_catalog.as_schemas().create_schema(ident=schema_ident,

comment="countries",

properties={})

fileset_cities = None

try:

fileset_cities = demo_catalog.as_fileset_catalog().load_fileset(ident=fileset_ident)

except Exception as e:

fileset_cities = demo_catalog.as_fileset_catalog().create_fileset(ident=fileset_ident,

fileset_type=Fileset.Type.EXTERNAL,

comment="cities",

storage_location="/tmp/gravitino/data/pdfs",

properties={})

Building a Gravitino Structured Data Reader

Once our data sources are connected, we will need a method to query it. For this purpose, we will use Trino, which is linked via SQLAlchemy. Alternatively, PySpark may be utilized if that aligns with your team's existing tools.

from sqlalchemy import create_engine

from trino.sqlalchemy import URL

from sqlalchemy.sql.expression import select, text

trino_engine = create_engine('trino://admin@trino:8080/catalog_mysql/demo_llamaindex')

connection = trino_engine.connect()

with trino_engine.connect() as connection:

cursor = connection.exec_driver_sql("SELECT * FROM catalog_mysql.demo_llamaindex.city_stats")

print(cursor.fetchall())

Building a Gravitino Unstructured Data Reader

With our foundational data infrastructure established, we can now directly read it into LlamaIndex. Gravitino utilizes a virtual file system to present the data as a directory for LlamaIndex to utilize.

from llama_index.core import SimpleDirectoryReader

from gravitino import gvfs

fs = gvfs.GravitinoVirtualFileSystem(

server_uri=gravitino_url,

metalake_name=metalake_name

)

fileset_virtual_location = "fileset/catalog_fileset/countries/cities"

reader = SimpleDirectoryReader(

input_dir=fileset_virtual_location,

fs=fs,

recursive=True

)

wiki_docs = reader.load_data()

Building SQL Metadata Index from the Structured Data Connection

Once constructed, we can initiate the creation of our index and vector stores using the metadata alone.

from llama_index.core import SQLDatabase

sql_database = SQLDatabase(trino_engine, include_tables=["city_stats"])

Building Vector Index from Unstructured Data

from llama_index.core import VectorStoreIndex

from llama_index.core import Settings

from llama_index.llms.openai import OpenAI

# Insert documents into the vector index

# Each document carries metadata of the city

vector_indices = {}

vector_query_engines = {}

for city, wiki_doc in zip(cities, wiki_docs):

vector_index = VectorStoreIndex.from_documents([wiki_doc])

query_engine = vector_index.as_query_engine(

similarity_top_k=2, llm=OpenAI(model="gpt-3.5-turbo")

)

vector_indices[city] = vector_index

vector_query_engines[city] = query_engine

Defining Query Engines and Asking Questions

To transform this into a fully operational chat application, a text-to-SQL interface must be developed. We will leverage LlamaIndex’s built-in functions to interact directly with the index we established earlier.

from llama_index.core.query_engine import NLSQLTableQueryEngine

from llama_index.core.query_engine import SQLJoinQueryEngine

# Define the NL to SQL engine

sql_query_engine = NLSQLTableQueryEngine(

sql_database=sql_database,

tables=["city_stats"],

)

# Define the vector query engines for each city

from llama_index.core.tools import QueryEngineTool

from llama_index.core.tools import ToolMetadata

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine_tools = []

for city in cities:

query_engine = vector_query_engines[city]

query_engine_tool = QueryEngineTool(

query_engine=query_engine,

metadata=ToolMetadata(

name=city, description=f"Provides information about {city}"

),

)

query_engine_tools.append(query_engine_tool)

s_engine = SubQuestionQueryEngine.from_defaults(

query_engine_tools=query_engine_tools, llm=OpenAI(model="gpt-3.5-turbo")

)

# Convert engines to tools and combine them

sql_tool = QueryEngineTool.from_defaults(

query_engine=sql_query_engine,

description=(

"Useful for translating a natural language query into a SQL query over"

" a table containing: city_stats, encompassing the population/country of"

" each city"

),

)

s_engine_tool = QueryEngineTool.from_defaults(

query_engine=s_engine,

description=(

f"Useful for answering semantic questions about different cities"

),

)

query_engine = SQLJoinQueryEngine(

sql_tool, s_engine_tool, llm=OpenAI(model="gpt-4")

)

# Issue query

response = query_engine.query(

"Tell me about the arts and culture of the city with the highest"

" population"

)

The final answer merges responses from two components: The first is derived from the SQL engine, which generates the SQL statement “SELECT city_name, population, country FROM city_stats ORDER BY population DESC LIMIT 1” based on natural language input, revealing that Tokyo has the highest population.

Subsequently, based on this answer, the data agent formulates three follow-up questions: “Can you provide more details about the museums, theaters, and performance venues in Tokyo?” concerning Tokyo's arts and culture.

The final response integrates these elements:

Final response: The city with the highest population is Tokyo, Japan. Tokyo is renowned for its vibrant arts and culture scene, showcasing a blend of traditional and modern influences. Visitors and residents alike can experience a wide array of cultural offerings, ranging from ancient temples and traditional tea ceremonies to cutting-edge technology and contemporary art galleries. The city is also home to numerous museums, theaters, and performance spaces that highlight Japan's rich history and creativity. Unfortunately, based on the context information provided, I cannot supply further details regarding the museums, theaters, and performance venues in Tokyo.

So What’s Next?

This demonstration illustrates the utilization of Gravitino for data ingestion and LlamaIndex for effective data retrieval. With Gravitino’s production-ready features, users can effortlessly construct a universal data agent. We are continuously enhancing Gravitino to ensure it meets the standards necessary for enterprise-grade data agents.

Are you ready to elevate your data agent? Explore the guides and join our ASF Community Slack Channel for support.

A special thanks to co-author Jerry Shao for his collaboration on this piece.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Embracing Feminism: A Path to Shared Equality for All

Exploring the misconceptions about feminism and its importance for everyone, including men.

Navigating the 'No Proposal' Talk After Eight Years Together

Exploring the complexities of discussing marriage after eight years of relationship.

Exploring Planet Habitability: An Astrobiology Perspective

Discover how astrobiologists assess exoplanet habitability through data analysis and observational methods.

Why I Rejoined Spark and Invested in Premium Features

Discover how Spark transformed my email management and why I chose the premium version.

Can You Solve This Engaging Geometry Challenge? (Part II)

Explore a viral geometry puzzle featuring an inscribed circle within a right-angled triangle and discover its area.

Honesty in Debunking: A Call for Truthful Discourse

A call for sincerity in discussions and debunking beliefs, emphasizing the importance of truth and logical reasoning.

Exploring gRPC vs REST APIs: Key Insights and Updates

A comparative analysis of gRPC and REST APIs along with recent Java developments and database framework discussions.

Mastering Form Validation in Vue 3 Using Vee-Validate 4

Learn how to implement form validation in Vue 3 with Vee-Validate 4, including number validation techniques.