Large Language Models (LLMs) are powerful, but they have a critical limitation: their knowledge is frozen at training time. Retrieval-Augmented Generation (RAG) solves this by grounding LLM responses in your own data. In this article, we will build a complete RAG pipeline from scratch.
What Is RAG?
RAG is a pattern where you:
- Retrieve relevant documents from a knowledge base based on the user's query.
- Augment the LLM prompt with those retrieved documents as context.
- Generate a response that is grounded in the retrieved information.
Architecture Overview
A typical RAG pipeline has two phases:
Indexing Phase (offline):
Documents → Chunking → Embedding → Vector Store
Query Phase (online):
User Query → Embedding → Similarity Search → Context Assembly → LLM → Response
Document Ingestion and Chunking
First, load your documents and split them into manageable chunks. LangChain provides document loaders for PDFs, websites, Notion, and more:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'
import { PDFLoader } from '@langchain/community/document_loaders/fs/pdf'
// Load a PDF document
const loader = new PDFLoader('./knowledge-base/product-docs.pdf')
const docs = await loader.load()
// Split into chunks of ~1000 characters with 200 character overlap
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', '. ', ' ', ''],
})
const chunks = await splitter.splitDocuments(docs)
console.log(Split ${docs.length} document(s) into ${chunks.length} chunks)
The chunkOverlap parameter is critical -- it prevents context from being lost at chunk boundaries.
Embedding and Vector Storage
Next, convert each chunk into a vector embedding and store it in a vector database. Here we use Pinecone, but the pattern is similar for Weaviate, Qdrant, or Chroma:
import { OpenAIEmbeddings } from '@langchain/openai'
import { PineconeStore } from '@langchain/pinecone'
import { Pinecone } from '@pinecone-database/pinecone'
const pinecone = new Pinecone()
const index = pinecone.Index('abyte-knowledge-base')
const embeddings = new OpenAIEmbeddings({
modelName: 'text-embedding-3-small',
dimensions: 1536,
})
// Store chunks in Pinecone
const vectorStore = await PineconeStore.fromDocuments(chunks, embeddings, {
pineconeIndex: index,
namespace: 'product-docs',
})
Each chunk is converted into a 1536-dimensional vector. When a query comes in, we embed the query and find the most similar vectors using cosine similarity.
Retrieval
At query time, embed the user's question and retrieve the top-K most relevant chunks:
const retriever = vectorStore.asRetriever({
k: 4, // return top 4 chunks
filter: { // optional metadata filter
source: 'product-docs',
},
})
const relevantDocs = await retriever.invoke('How do I configure rate limiting?')
console.log(relevantDocs.map((doc) => doc.pageContent))
Generation with Context
Finally, assemble the retrieved context into a prompt and call the LLM:
import { ChatOpenAI } from '@langchain/openai' import { PromptTemplate } from '@langchain/core/prompts' import { StringOutputParser } from '@langchain/core/output_parsers')const model = new ChatOpenAI({ modelName: 'gpt-4o', temperature: 0 })
const prompt = PromptTemplate.fromTemplate(
You are a technical support assistant for abyte. Answer the question based ONLY on the following context. If the context doesn't contain the answer, say "I don't have enough information to answer that."Context: {context}
Question: {question}
Answer:
const chain = prompt.pipe(model).pipe(new StringOutputParser())
const response = await chain.invoke({ context: relevantDocs.map((d) => d.pageContent).join('\n\n---\n\n'), question: 'How do I configure rate limiting?', })
console.log(response)
Improving Retrieval Quality
Raw similarity search is a starting point. To improve results:
- Hybrid search -- combine vector similarity with keyword (BM25) search. Pinecone and Weaviate support this natively.
- Re-ranking -- use a cross-encoder model (e.g., Cohere Rerank) to re-score retrieved chunks before passing them to the LLM.
- Metadata filtering -- filter by document type, date, or category before doing similarity search.
- Query transformation -- use the LLM to rewrite ambiguous queries into clearer search queries.
// Example: Multi-query retrieval
import { MultiQueryRetriever } from 'langchain/retrievers/multi_query'
const multiRetriever = MultiQueryRetriever.fromLLM({
llm: model,
retriever: vectorStore.asRetriever({ k: 4 }),
queryCount: 3, // generate 3 variations of the query
})
Key Takeaways
- RAG grounds LLM responses in your own data, reducing hallucination and keeping answers current.
- Chunk size and overlap matter enormously -- experiment with different values for your domain.
- Use hybrid search and re-ranking to improve retrieval quality beyond basic vector similarity.
- Always include a "I don't know" fallback in your prompt to handle cases where the context does not contain the answer.
Crafted by
Abiyyu Abidiffatir Al MajidSoftware Engineer passionate about building scalable web applications and sharing knowledge about modern web development, system design, and emerging technologies.



