Implementing Retrieval-Augmented Generation (RAG) for Custom Data Q&A

          In this tutorial, we will implement a Retrieval-Augmented Generation (RAG) system in Python using LangChain, Hugging Face Transformers, and FAISS. We will use custom equipment specifications as our knowledge base and allow an LLM (Flan-T5) to generate responses using retrieved external data. The tutorial covers:

  1. Introduction to RAG
  2. Setup and custom data preparation
  3. Creating a vector store (FAISS)
  4. Load a pre-trained LLM (Flan-T5)
  5. Building the RAG system
  6. Execution
  7. Conclusion
  8. Full code listing


Introduction to RAG

    Retrieval-Augmented Generation (RAG) is an approach that enhances LLMs (Large Language Models) by incorporating an external knowledge base to generate more factually grounded responses. Instead of relying solely on pre-trained knowledge, a RAG model retrieves relevant documents and uses them to improve its answer.

Why do we need RAG?

  • More Accurate Answers: LLMs sometimes make mistakes or "hallucinate." RAG helps fix this by pulling real facts from trusted documents.
  • Domain-Specific Knowledge: Regular LLMs might not know specific details like machine specs or medical info. RAG gives them access to custom knowledge bases.
  • Dynamic Knowledge Updates: Traditional models need retraining to learn new facts. With RAG, you just update the database, and the AI stays up to date.
  

Setup and custom data preparation

    Before starting, make sure you have the following Python libraries installed. You can install them using pip.

 
pip install langchain transformers faiss-cpu sentence-transformers 
 

    We start by importing the necessary libraries.

 
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
 

    We define a list of equipment specifications that the RAG system will use as its external knowledge source.

 
# Custom Equipment Specifications (Knowledge Base)
custom_documents = [
"The engine starts at 6 AM and shuts down at 10 PM.",
"High-pressure alert triggers when the pressure exceeds 250 PSI.",
"The cooling fan activates when the temperature reaches 80 degree C.",
"Battery recharge initiates if the voltage drops below 11.5V.",
"The emergency alarm sounds if vibration levels exceed 5.0 mm/s.",
"Oil replacement is required every 1,000 operating hours.",
"System enters standby mode after 15 minutes of inactivity.",
"Fuel consumption rate should not exceed 3.5L per hour.",
"Air filter must be cleaned when airflow drops below 70%.",
"The backup generator activates if the main power is lost for more than 5 seconds."
]


Creating a vector store (FAISS)

    Since LLMs can't search text directly, we first turn our knowledge base into vector embeddings using HuggingFaceEmbeddings. These embeddings are then stored in FAISS
(Facebook AI Similarity Search), a fast and efficient tool for finding similar vectors in large datasets.

    Each document in custom_documents is converted into a numerical vector and saved in FAISS for quick retrieval. When a question is asked, our retriever searches FAISS to find the most relevant document.

 
# Convert text into vector embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Store document embeddings in FAISS (a fast similarity search library)
vector_store = FAISS.from_texts(custom_documents, embeddings)

# Define a retriever to find relevant documents (top-1 match)
retriever = vector_store.as_retriever(search_kwargs={"k": 1})

 

Load a pre-trained LLM (Flan-T5)

    In this section, we use Flan-T5 (Google's fine-tuned T5 model) as our base LLM for answering questions.  Flan-T5 is lightweight model and runs efficiently on a CPU. It can be used in question-answering and reasoning tasks. Additionally, it's fine-tuned for following instructions, which makes it perfect for a RAG system.  

 
# Load Flan-T5 Model and Tokenizer
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define a text generation pipeline
text_generation_pipeline = pipeline(
"text2text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=100
)

# Integrate LLM into LangChain
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

 

Building the RAG system

    Now, we integrate our retriever (FAISS) and LLM (Flan-T5) into a RetrievalQA pipeline.

    In this setup, the retriever identifies the most relevant document from the knowledge base, while the LLM uses the retrieved document to generate a more precise response. The "stuff" chain type simply appends the retrieved text to the prompt.

 
# Create a Retrieval-Augmented Generation (RAG) pipeline
rag_pipeline = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
 

    Next, we create a function to ask questions using our RAG system. 


# Function to ask a question and retrieve answers
def ask(query):
result = rag_pipeline({"query": query})
print(f"Question: {query}")
print(f"Answer: {result['result']}")
print("Sources:")
for doc in result['source_documents']:
print(f"- {doc.page_content}")
print("\n")
 

 

Execution

    Now, we can give test questions to get the answers and execute the the code. 

 
# Example usage
ask("When does the engine start?")
ask("What triggers the high-pressure alert?")
ask("When should we clean the air filter?")
 

    It may take few second to load the models and finally the result is as follows:

  
Question: When does the engine start?
Answer: 6 AM
Sources:
- The engine starts at 6 AM and shuts down at 10 PM.

Question: What triggers the high-pressure alert?
Answer: the pressure exceeds 250 PSI
Sources:
- High-pressure alert triggers when the pressure exceeds 250 PSI.

Question: When should we clean the air filter?
Answer: When airflow drops below 70%
Sources:
- Air filter must be cleaned when airflow drops below 70%. 
 

 

Conclusion

    In this tutorial, we explored how to implement RAG with custom external data source. Here are the some takeaways:

  • RAG enhances LLMs by retrieving relevant knowledge before generating responses. 
  • FAISS enables fast retrieval of relevant facts.
  • Flan-T5 generates human-like answers based on retrieved information.
  • Domain-specific knowledge can be integrated into LLMs without retraining.

The full source code is listed below.

 

Full code listing 

 
# Install Required Libraries
# pip install langchain transformers faiss-cpu sentence-transformers

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Custom Equipment Specifications (Knowledge Base)
custom_documents = [
"The engine starts at 6 AM and shuts down at 10 PM.",
"High-pressure alert triggers when the pressure exceeds 250 PSI.",
"The cooling fan activates when the temperature reaches 80 degree C.",
"Battery recharge initiates if the voltage drops below 11.5V.",
"The emergency alarm sounds if vibration levels exceed 5.0 mm/s.",
"Oil replacement is required every 1,000 operating hours.",
"System enters standby mode after 15 minutes of inactivity.",
"Fuel consumption rate should not exceed 3.5L per hour.",
"Air filter must be cleaned when airflow drops below 70%.",
"The backup generator activates if the main power is lost for more than 5 seconds."
]

# Convert text into vector embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Store document embeddings in FAISS (a fast similarity search library)
vector_store = FAISS.from_texts(custom_documents, embeddings)

# Define a retriever to find relevant documents (top-1 match)
retriever = vector_store.as_retriever(search_kwargs={"k": 1})

# Load Flan-T5 Model and Tokenizer
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define a text generation pipeline
text_generation_pipeline = pipeline(
"text2text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=100
)

# Integrate LLM into LangChain
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create a Retrieval-Augmented Generation (RAG) pipeline
rag_pipeline = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)

# Function to ask a question and retrieve answers
def ask(query):
result = rag_pipeline({"query": query})
print(f"Question: {query}")
print(f"Answer: {result['result']}")
print("Sources:")
for doc in result['source_documents']:
print(f"- {doc.page_content}")
print("\n")

# Example usage
ask("When does the engine start?")
ask("What triggers the high-pressure alert?")
ask("When should we clean the air filter?")

No comments:

Post a Comment