DataTechNotes: Building RAG-Based QA System with LlamaIndex

In this tutorial, we will implement a RAG (Retrieval-Augmented Generation) chatbot using LlamaIndex, Hugging Face Transformer, and Flan-T4 model. We use a sample industrial equipment documentation as our knowledge base and allow an LLM (Flan-T5) to generate responses using retrieved external data. We also add relevance filtering for accuracy control. The tutorial covers:

Introduction to RAG
Why LlamaIndex?
Setup and custom data preparation
Creating a vector store index
Load a pre-trained LLM (Flan-T5)
Retrieval with relevance check
Enhanced QA method
Execution
Conclusion
Full code listing

Introduction to RAG

Retrieval-Augmented Generation makes LLMs smarter by giving them access to an external knowledge base. Instead of just relying on what they’ve learned during training, RAG pulls in relevant documents in real time, helping the model generate more accurate and well-informed responses. By using RAG, we can achieve more accurate answers, incorporate domain-specific knowledge, and update the knowledge dynamically.

Why LlamaIndex?

In the previous tutorial, we built a RAG chatbot using the FAISS (Facebook AI Similarity Search) library. In this tutorial, we will use the LlamaIndex framework to implement the RAG chatbot. FAISS and LlamaIndex each play unique but complementary roles. FAISS is great for fast and efficient similarity searches in vector spaces, while LlamaIndex makes it easy to connect and retrieve external knowledge for LLM-powered applications.

LLamaIndex is a framework designed to connect external data sources (like documents and databases) with LLMs. It structures and indexes data, making it easy to query and provide relevant context to LLMs. LlamaIndex can be used to build tools such as:

Knowledge Bases: Create searchable, AI-ready databases for accurate responses.
Data Indexing: Organize unstructured data (e.g., reports, research papers, logs) for efficient retrieval.
Conversational AI: Enhance chatbots with domain-specific knowledge for smarter interactions.

Setup and custom data preparation

Before starting, make sure you have the following Python libraries installed. You can install them using pip.

 
 pip install llama-index-core llama-index-embeddings-huggingface transformers
 

We start by importing the necessary libraries.

 
from llama_index.core import VectorStoreIndex, Settings, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 

We define a list of equipment specifications that the RAG system will use as its external knowledge source.

 
# Custom data (Knowledge Base)
custom_data = """
Equipment Specification: Industrial PMC Milling Machine
Model: XZ-Mill Pro 5000
Manufacturer: XYZ Industrial Solutions
Application: High-precision milling, drilling, and cutting of metal and composite materials
Description
The XZ-Mill Pro 5000 is a high-performance PMC milling machine designed for precision 
machining in industrial applications. It features a robust cast-iron frame, high-speed 
spindle, and advanced control system for accurate and efficient material processing. 
The machine is equipped with an automated tool changer and real-time monitoring system, 
ensuring consistent performance and minimal downtime.
Technical Specifications:
    Spindle Power: 15 kW (20 HP)
    Spindle Speed: 100  12,000 RPM
    Worktable Size: 1200mm x 600mm
    Max Load Capacity: 1000 kg
    Tool Changer: 24-tool automatic carousel
    Precision: 0.005mm
    Control System: Siemens SINUMERIK 840D / Fanuc 31i
    Cooling System: Integrated liquid cooling
    Power Requirements: 400V, 50Hz, 3-phase
    Safety Features: Emergency stop, interlock system, overload protection

Caution & Alerts
- Operational Safety: Ensure that only trained personnel operate the 
  machine. Improper use can lead to serious injuries.
- Material Compatibility: The machine is designed for metal and 
  composite materials. Using incompatible materials may cause damage 
  to the spindle or cutting tools.
- Regular Maintenance: Perform routine maintenance, including lubrication 
  and spindle inspection, to prevent malfunctions.
- Emergency Stop Usage: The E-stop button should be used only in critical 
  situations, as frequent use may cause system calibration issues.
- Electrical Safety: Always disconnect the machine from the power supply 
  before performing maintenance to prevent electrical hazards.
"""

Creating a vector store index

The code below, we use LlamaIndex and configure it for storing and retrieving vector-based document embeddings. The given text (custom_data) is first wrapped into a Document object. We use the HuggingFaceEmbedding model (sentence-transformers/all-MiniLM-L6-v2) to transform the text into numerical vector embeddings. These embeddings are then indexed using VectorStoreIndex, enabling efficient similarity-based search and retrieval.

 
# --- Initialize LlamaIndex ---
# Create document
documents = [Document(text=custom_data)]

# Configure embeddings
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create vector store index
index = VectorStoreIndex.from_documents(documents)

print("LlamaIndex vector store created.")

Load a pre-trained LLM (Flan-T5)

In this section, we initialize the FLAN-T5-base model from Google, a fine-tuned version of T5 specialized for instruction-following tasks like summarization and question-answering. We load the corresponding tokenizer to convert text into tokenized representations and the pretrained sequence-to-sequence model, enabling it to generate meaningful responses based on the input text.

 
# --- Load FLAN-T5 Model ---
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Retrieval with relevance check

We implement a retrieval function with relevance checking to ensure that only the most relevant context is returned for a given query.

Using index.as_retriever(similarity_top_k=1), we configure the retriever to fetch the top matching document chunk based on semantic similarity. When a query is submitted, we retrieve relevant results and check if any match is found. If no results are retrieved, we return None with a confidence score of 0.0. Otherwise, we return the most relevant text chunk along with its similarity score, enabling precise context-based responses.

# --- Improved Retrieval with Relevance Check ---
def retrieve_relevant_chunk(query):
    """Retrieve context with relevance checking."""
    retriever = index.as_retriever(similarity_top_k=1)
    results = retriever.retrieve(query)
    
    if not results:
        return None, 0.0
    
    return results[0].text, results[0].score

Enhanced QA method

We design an enhanced RAG system that intelligently retrieves relevant context and generates accurate responses. The function first calls retrieve_relevant_chunk(query) to fetch the most relevant document snippet along with its similarity score. If the score falls below a specified confidence threshold, we return a fallback response, indicating insufficient information.

We then construct a well-structured prompt, instructing the model to answer strictly based on the provided context. If the answer is not present in the retrieved content, the model is guided to respond with "I don't know." The query is tokenized and processed using a pretrained FLAN-T5 model, generating a response with controlled decoding parameters (num_beams=5, max_new_tokens=150, early_stopping=True).

Finally, we perform post-processing to ensure clarity, explicitly handling cases where the model admits uncertainty by returning a standardized response.

 
# --- Enhanced QA System ---
def ask_rag(query, confidence_threshold=0.1):
    """Improved RAG system with relevance detection."""
    context, score = retrieve_relevant_chunk(query)
    
    # Handle low similarity scores
    if score < confidence_threshold:
        return "I don't know. This question seems unrelated to equipment specifications."
    
    # Enhanced prompt engineering
    prompt = f"""You are a technical assistant. Answer the question based only on the 
    following context. If the answer isn't in the context, say "I don't know."

    Context: {context}
    Question: {query}
    Answer:"""
    
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        max_length=512,
        truncation=True
    )
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        num_beams=5,
        early_stopping=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Post-process response
    response = response.strip()
    if "don't know" in response.lower():
        return "I don't have enough information to answer that question."
        
    return response

Execution

Now, we can execute the code and give test questions to get the answers.

We create a set of test cases to evaluate the performance of our RAG-based QA system. The questions list contains a mix of relevant technical queries (e.g., "What is the maximum load capacity?") and irrelevant questions (e.g., "What's the capital of France?") to test the system's ability to differentiate between valid and out-of-scope inputs.

 
# --- Test Cases ---
questions = [
    "What is the maximum load capacity?",
    "Explain the cooling system requirements",
    "What's the capital of France?",
    "How do I bake a chocolate cake?",
    "What safety features does it have?"
]
 

Using a loop, we pass each question to the ask_rag function and print both the query and the generated answer. The output is formatted clearly, with separators ("-" * 50") to improve readability. This structured approach helps verify that the system retrieves accurate responses for relevant queries while rejecting unrelated questions with an appropriate fallback message.

for question in questions:
    answer = ask_rag(question)
    print(f"Q: {question}")
    print(f"A: {answer}\n")
    print("-" * 50)

It may take few second to load the models and finally the result is as follows:

  
LlamaIndex vector store created.
Q: What is the maximum load capacity?
A: 1000 kg

--------------------------------------------------
Q: Explain the cooling system requirements
A: 400V, 50Hz, 3-phase

--------------------------------------------------
Q: What's the capital of France?
A: I don't know. This question seems unrelated to equipment specifications.

--------------------------------------------------
Q: How do I bake a chocolate cake?
A: I don't know. This question seems unrelated to equipment specifications.

--------------------------------------------------
Q: What safety features does it have?
A: Emergency stop, interlock system, overload protection 
 

Conclusion

In this tutorial, we built an enhanced Retrieval-Augmented Generation (RAG) system that combines semantic search and prompt engineering to deliver precise, context-aware answers. We initialized LlamaIndex for vector-based retrieval, integrated FLAN-T5 for natural language generation, and implemented confidence-based filtering to handle irrelevant queries effectively. Finally, we tested the system with diverse questions to ensure its robustness. This approach can be extended to various domain-specific applications, such as technical support, document Q&A, and knowledge-based assistants.

Full code listing

 
from llama_index.core import VectorStoreIndex, Settings, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Custom data
custom_data = """
Equipment Specification: Industrial PMC Milling Machine
Model: XZ-Mill Pro 5000
Manufacturer: XYZ Industrial Solutions
Application: High-precision milling, drilling, and cutting of metal and composite materials
Description
The XZ-Mill Pro 5000 is a high-performance PMC milling machine designed for precision 
machining in industrial applications. It features a robust cast-iron frame, high-speed 
spindle, and advanced control system for accurate and efficient material processing. 
The machine is equipped with an automated tool changer and real-time monitoring system, 
ensuring consistent performance and minimal downtime.
Technical Specifications:
    Spindle Power: 15 kW (20 HP)
    Spindle Speed: 100  12,000 RPM
    Worktable Size: 1200mm x 600mm
    Max Load Capacity: 1000 kg
    Tool Changer: 24-tool automatic carousel
    Precision: 0.005mm
    Control System: Siemens SINUMERIK 840D / Fanuc 31i
    Cooling System: Integrated liquid cooling
    Power Requirements: 400V, 50Hz, 3-phase
    Safety Features: Emergency stop, interlock system, overload protection

Caution & Alerts
- Operational Safety: Ensure that only trained personnel operate the 
  machine. Improper use can lead to serious injuries.
- Material Compatibility: The machine is designed for metal and 
  composite materials. Using incompatible materials may cause damage 
  to the spindle or cutting tools.
- Regular Maintenance: Perform routine maintenance, including lubrication 
  and spindle inspection, to prevent malfunctions.
- Emergency Stop Usage: The E-stop button should be used only in critical 
  situations, as frequent use may cause system calibration issues.
- Electrical Safety: Always disconnect the machine from the power supply 
  before performing maintenance to prevent electrical hazards.
"""

# --- Initialize LlamaIndex ---
# Create document
documents = [Document(text=custom_data)]

# Configure embeddings
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create vector store index
index = VectorStoreIndex.from_documents(documents)

print("LlamaIndex vector store created.")

# --- Load FLAN-T5 Model ---
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# --- Improved Retrieval with Relevance Check ---
def retrieve_relevant_chunk(query):
    """Retrieve context with relevance checking."""
    retriever = index.as_retriever(similarity_top_k=1)
    results = retriever.retrieve(query)
    
    if not results:
        return None, 0.0
    
    return results[0].text, results[0].score

# --- Enhanced QA System ---
def ask_rag(query, confidence_threshold=0.1):
    """Improved RAG system with relevance detection."""
    context, score = retrieve_relevant_chunk(query)
    
    # Handle low similarity scores
    if score < confidence_threshold:
        return "I don't know. This question seems unrelated to equipment specifications."
    
    # Enhanced prompt engineering
    prompt = f"""You are a technical assistant. Answer the question based only on the 
    following context. If the answer isn't in the context, say "I don't know."

    Context: {context}
    Question: {query}
    Answer:"""
    
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        max_length=512,
        truncation=True
    )
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        num_beams=5,
        early_stopping=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Post-process response
    response = response.strip()
    if "don't know" in response.lower():
        return "I don't have enough information to answer that question."
        
    return response

# --- Test Cases ---
questions = [
    "What is the maximum load capacity?",
    "Explain the cooling system requirements",
    "What's the capital of France?",
    "How do I bake a chocolate cake?",
    "What safety features does it have?"
]

for question in questions:
    answer = ask_rag(question)
    print(f"Q: {question}")
    print(f"A: {answer}\n")
    print("-" * 50)

DataTechNotes

Pages

Building RAG-Based QA System with LlamaIndex

No comments:

Post a Comment