Building RAG-Based QA System with LlamaIndex

           In this tutorial, we will implement a RAG (Retrieval-Augmented Generation) chatbot using LlamaIndex, Hugging Face Transformer, and Flan-T4 model. We use a sample industrial equipment documentation as our knowledge base and allow an LLM (Flan-T5) to generate responses using retrieved external data. We also add relevance filtering for accuracy control. The tutorial covers:

  1. Introduction to RAG
  2. Why LlamaIndex?
  3. Setup and custom data preparation
  4. Creating a vector store index
  5. Load a pre-trained LLM (Flan-T5)
  6. Retrieval with relevance check
  7. Enhanced QA method
  8. Execution
  9. Conclusion
  10. Full code listing


Introduction to RAG

    Retrieval-Augmented Generation makes LLMs smarter by giving them access to an external knowledge base. Instead of just relying on what they’ve learned during training, RAG pulls in relevant documents in real time, helping the model generate more accurate and well-informed responses. By using RAG, we can achieve more accurate answers, incorporate domain-specific knowledge, and update the knowledge dynamically.

 

Why LlamaIndex?

    In the previous tutorial, we built a RAG chatbot using the FAISS (Facebook AI Similarity Search) library. In this tutorial, we will use the LlamaIndex framework to implement the RAG chatbot. FAISS and LlamaIndex each play unique but complementary roles. FAISS is great for fast and efficient similarity searches in vector spaces, while LlamaIndex makes it easy to connect and retrieve external knowledge for LLM-powered applications.

     LLamaIndex is a framework designed to connect external data sources (like documents and databases) with LLMs. It structures and indexes data, making it easy to query and provide relevant context to LLMs. LlamaIndex can be used to build tools such as:

  • Knowledge Bases: Create searchable, AI-ready databases for accurate responses.
  • Data Indexing: Organize unstructured data (e.g., reports, research papers, logs) for efficient retrieval.
  • Conversational AI: Enhance chatbots with domain-specific knowledge for smarter interactions.

 

Setup and custom data preparation

    Before starting, make sure you have the following Python libraries installed. You can install them using pip.

 
 pip install llama-index-core llama-index-embeddings-huggingface transformers

    We start by importing the necessary libraries.

 
from llama_index.core import VectorStoreIndex, Settings, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 

    We define a list of equipment specifications that the RAG system will use as its external knowledge source.

 
# Custom data (Knowledge Base)
custom_data = """
Equipment Specification: Industrial PMC Milling Machine
Model: XZ-Mill Pro 5000
Manufacturer: XYZ Industrial Solutions
Application: High-precision milling, drilling, and cutting of metal and composite materials
Description
The XZ-Mill Pro 5000 is a high-performance PMC milling machine designed for precision
machining in industrial applications. It features a robust cast-iron frame, high-speed
spindle, and advanced control system for accurate and efficient material processing.
The machine is equipped with an automated tool changer and real-time monitoring system,
ensuring consistent performance and minimal downtime.
Technical Specifications:
Spindle Power: 15 kW (20 HP)
Spindle Speed: 100 12,000 RPM
Worktable Size: 1200mm x 600mm
Max Load Capacity: 1000 kg
Tool Changer: 24-tool automatic carousel
Precision: 0.005mm
Control System: Siemens SINUMERIK 840D / Fanuc 31i
Cooling System: Integrated liquid cooling
Power Requirements: 400V, 50Hz, 3-phase
Safety Features: Emergency stop, interlock system, overload protection

Caution & Alerts
- Operational Safety: Ensure that only trained personnel operate the
machine. Improper use can lead to serious injuries.
- Material Compatibility: The machine is designed for metal and
composite materials. Using incompatible materials may cause damage
to the spindle or cutting tools.
- Regular Maintenance: Perform routine maintenance, including lubrication
and spindle inspection, to prevent malfunctions.
- Emergency Stop Usage: The E-stop button should be used only in critical
situations, as frequent use may cause system calibration issues.
- Electrical Safety: Always disconnect the machine from the power supply
before performing maintenance to prevent electrical hazards.
"""


Creating a vector store index

    The code below, we use LlamaIndex and configure it for storing and retrieving vector-based document embeddings. The given text (custom_data) is first wrapped into a Document object. We use the HuggingFaceEmbedding model (sentence-transformers/all-MiniLM-L6-v2) to transform the text into numerical vector embeddings. These embeddings are then indexed using VectorStoreIndex, enabling efficient similarity-based search and retrieval.

 
# --- Initialize LlamaIndex ---
# Create document
documents = [Document(text=custom_data)]

# Configure embeddings
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create vector store index
index = VectorStoreIndex.from_documents(documents)

print("LlamaIndex vector store created.")

 

Load a pre-trained LLM (Flan-T5)

    In this section, we initialize the FLAN-T5-base model from Google, a fine-tuned version of T5 specialized for instruction-following tasks like summarization and question-answering. We load the corresponding tokenizer to convert text into tokenized representations and the pretrained sequence-to-sequence model, enabling it to generate meaningful responses based on the input text.

 
# --- Load FLAN-T5 Model ---
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

 

Retrieval with relevance check

    We implement a retrieval function with relevance checking to ensure that only the most relevant context is returned for a given query. 

    Using index.as_retriever(similarity_top_k=1), we configure the retriever to fetch the top matching document chunk based on semantic similarity. When a query is submitted, we retrieve relevant results and check if any match is found. If no results are retrieved, we return None with a confidence score of 0.0. Otherwise, we return the most relevant text chunk along with its similarity score, enabling precise context-based responses.


# --- Improved Retrieval with Relevance Check ---
def retrieve_relevant_chunk(query):
"""Retrieve context with relevance checking."""
retriever = index.as_retriever(similarity_top_k=1)
results = retriever.retrieve(query)
if not results:
return None, 0.0
return results[0].text, results[0].score


 

 Enhanced QA method

    We design an enhanced RAG system that intelligently retrieves relevant context and generates accurate responses. The function first calls retrieve_relevant_chunk(query) to fetch the most relevant document snippet along with its similarity score. If the score falls below a specified confidence threshold, we return a fallback response, indicating insufficient information.

    We then construct a well-structured prompt, instructing the model to answer strictly based on the provided context. If the answer is not present in the retrieved content, the model is guided to respond with "I don't know." The query is tokenized and processed using a pretrained FLAN-T5 model, generating a response with controlled decoding parameters (num_beams=5, max_new_tokens=150, early_stopping=True). 

    Finally, we perform post-processing to ensure clarity, explicitly handling cases where the model admits uncertainty by returning a standardized response.

 
# --- Enhanced QA System ---
def ask_rag(query, confidence_threshold=0.1):
"""Improved RAG system with relevance detection."""
context, score = retrieve_relevant_chunk(query)
# Handle low similarity scores
if score < confidence_threshold:
return "I don't know. This question seems unrelated to equipment specifications."
# Enhanced prompt engineering
prompt = f"""You are a technical assistant. Answer the question based only on the 
following context. If the answer isn't in the context, say "I don't know."

Context: {context}
Question: {query}
Answer:"""
inputs = tokenizer(
prompt,
return_tensors="pt",
max_length=512,
truncation=True
)
outputs = model.generate(
**inputs,
max_new_tokens=150,
num_beams=5,
early_stopping=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Post-process response
response = response.strip()
if "don't know" in response.lower():
return "I don't have enough information to answer that question."
return response

 

Execution

    Now, we can execute the code and give test questions to get the answers.

    We create a set of test cases to evaluate the performance of our RAG-based QA system. The questions list contains a mix of relevant technical queries (e.g., "What is the maximum load capacity?") and irrelevant questions (e.g., "What's the capital of France?") to test the system's ability to differentiate between valid and out-of-scope inputs.

 
# --- Test Cases ---
questions = [
"What is the maximum load capacity?",
"Explain the cooling system requirements",
"What's the capital of France?",
"How do I bake a chocolate cake?",
"What safety features does it have?"
]
 

    Using a loop, we pass each question to the ask_rag function and print both the query and the generated answer. The output is formatted clearly, with separators ("-" * 50") to improve readability. This structured approach helps verify that the system retrieves accurate responses for relevant queries while rejecting unrelated questions with an appropriate fallback message.


for question in questions:
answer = ask_rag(question)
print(f"Q: {question}")
print(f"A: {answer}\n")
print("-" * 50)

    It may take few second to load the models and finally the result is as follows:

  
LlamaIndex vector store created.
Q: What is the maximum load capacity?
A: 1000 kg

--------------------------------------------------
Q: Explain the cooling system requirements
A: 400V, 50Hz, 3-phase

--------------------------------------------------
Q: What's the capital of France?
A: I don't know. This question seems unrelated to equipment specifications.

--------------------------------------------------
Q: How do I bake a chocolate cake?
A: I don't know. This question seems unrelated to equipment specifications.

--------------------------------------------------
Q: What safety features does it have?
A: Emergency stop, interlock system, overload protection 
 

 


Conclusion

    In this tutorial, we built an enhanced Retrieval-Augmented Generation (RAG) system that combines semantic search and prompt engineering to deliver precise, context-aware answers. We initialized LlamaIndex for vector-based retrieval, integrated FLAN-T5 for natural language generation, and implemented confidence-based filtering to handle irrelevant queries effectively. Finally, we tested the system with diverse questions to ensure its robustness. This approach can be extended to various domain-specific applications, such as technical support, document Q&A, and knowledge-based assistants.

 

Full code listing 

 
from llama_index.core import VectorStoreIndex, Settings, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Custom data
custom_data = """
Equipment Specification: Industrial PMC Milling Machine
Model: XZ-Mill Pro 5000
Manufacturer: XYZ Industrial Solutions
Application: High-precision milling, drilling, and cutting of metal and composite materials
Description
The XZ-Mill Pro 5000 is a high-performance PMC milling machine designed for precision
machining in industrial applications. It features a robust cast-iron frame, high-speed
spindle, and advanced control system for accurate and efficient material processing.
The machine is equipped with an automated tool changer and real-time monitoring system,
ensuring consistent performance and minimal downtime.
Technical Specifications:
Spindle Power: 15 kW (20 HP)
Spindle Speed: 100 12,000 RPM
Worktable Size: 1200mm x 600mm
Max Load Capacity: 1000 kg
Tool Changer: 24-tool automatic carousel
Precision: 0.005mm
Control System: Siemens SINUMERIK 840D / Fanuc 31i
Cooling System: Integrated liquid cooling
Power Requirements: 400V, 50Hz, 3-phase
Safety Features: Emergency stop, interlock system, overload protection

Caution & Alerts
- Operational Safety: Ensure that only trained personnel operate the
machine. Improper use can lead to serious injuries.
- Material Compatibility: The machine is designed for metal and
composite materials. Using incompatible materials may cause damage
to the spindle or cutting tools.
- Regular Maintenance: Perform routine maintenance, including lubrication
and spindle inspection, to prevent malfunctions.
- Emergency Stop Usage: The E-stop button should be used only in critical
situations, as frequent use may cause system calibration issues.
- Electrical Safety: Always disconnect the machine from the power supply
before performing maintenance to prevent electrical hazards.
"""

# --- Initialize LlamaIndex ---
# Create document
documents = [Document(text=custom_data)]

# Configure embeddings
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create vector store index
index = VectorStoreIndex.from_documents(documents)

print("LlamaIndex vector store created.")

# --- Load FLAN-T5 Model ---
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# --- Improved Retrieval with Relevance Check ---
def retrieve_relevant_chunk(query):
"""Retrieve context with relevance checking."""
retriever = index.as_retriever(similarity_top_k=1)
results = retriever.retrieve(query)
if not results:
return None, 0.0
return results[0].text, results[0].score

# --- Enhanced QA System ---
def ask_rag(query, confidence_threshold=0.1):
"""Improved RAG system with relevance detection."""
context, score = retrieve_relevant_chunk(query)
# Handle low similarity scores
if score < confidence_threshold:
return "I don't know. This question seems unrelated to equipment specifications."
# Enhanced prompt engineering
prompt = f"""You are a technical assistant. Answer the question based only on the 
following context. If the answer isn't in the context, say "I don't know."

Context: {context}
Question: {query}
Answer:"""
inputs = tokenizer(
prompt,
return_tensors="pt",
max_length=512,
truncation=True
)
outputs = model.generate(
**inputs,
max_new_tokens=150,
num_beams=5,
early_stopping=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Post-process response
response = response.strip()
if "don't know" in response.lower():
return "I don't have enough information to answer that question."
return response

# --- Test Cases ---
questions = [
"What is the maximum load capacity?",
"Explain the cooling system requirements",
"What's the capital of France?",
"How do I bake a chocolate cake?",
"What safety features does it have?"
]

for question in questions:
answer = ask_rag(question)
print(f"Q: {question}")
print(f"A: {answer}\n")
print("-" * 50)

 


No comments:

Post a Comment