Building RAG Pipelines with LangChain and Vector Databases

Learn how to build Retrieval-Augmented Generation (RAG) pipelines using LangChain and vector databases -- from document ingestion and embedding to retrieval and generation.

AA

Abiyyu Abidiffatir Al Majid

4 min read
Building RAG Pipelines with LangChain and Vector Databases

Large Language Models (LLMs) are powerful, but they have a critical limitation: their knowledge is frozen at training time. Retrieval-Augmented Generation (RAG) solves this by grounding LLM responses in your own data. In this article, we will build a complete RAG pipeline from scratch.

What Is RAG?

RAG is a pattern where you:

  1. Retrieve relevant documents from a knowledge base based on the user's query.
  2. Augment the LLM prompt with those retrieved documents as context.
  3. Generate a response that is grounded in the retrieved information.
This eliminates hallucination for domain-specific questions and keeps your AI up to date without retraining.

Architecture Overview

A typical RAG pipeline has two phases:

Indexing Phase (offline):
  Documents → Chunking → Embedding → Vector Store

Query Phase (online): User Query → Embedding → Similarity Search → Context Assembly → LLM → Response

Document Ingestion and Chunking

First, load your documents and split them into manageable chunks. LangChain provides document loaders for PDFs, websites, Notion, and more:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'
import { PDFLoader } from '@langchain/community/document_loaders/fs/pdf'

// Load a PDF document const loader = new PDFLoader('./knowledge-base/product-docs.pdf') const docs = await loader.load()

// Split into chunks of ~1000 characters with 200 character overlap const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200, separators: ['\n\n', '\n', '. ', ' ', ''], }) const chunks = await splitter.splitDocuments(docs)

console.log(Split ${docs.length} document(s) into ${chunks.length} chunks)

The chunkOverlap parameter is critical -- it prevents context from being lost at chunk boundaries.

Embedding and Vector Storage

Next, convert each chunk into a vector embedding and store it in a vector database. Here we use Pinecone, but the pattern is similar for Weaviate, Qdrant, or Chroma:

import { OpenAIEmbeddings } from '@langchain/openai'
import { PineconeStore } from '@langchain/pinecone'
import { Pinecone } from '@pinecone-database/pinecone'

const pinecone = new Pinecone() const index = pinecone.Index('abyte-knowledge-base')

const embeddings = new OpenAIEmbeddings({ modelName: 'text-embedding-3-small', dimensions: 1536, })

// Store chunks in Pinecone const vectorStore = await PineconeStore.fromDocuments(chunks, embeddings, { pineconeIndex: index, namespace: 'product-docs', })

Each chunk is converted into a 1536-dimensional vector. When a query comes in, we embed the query and find the most similar vectors using cosine similarity.

Retrieval

At query time, embed the user's question and retrieve the top-K most relevant chunks:

const retriever = vectorStore.asRetriever({
  k: 4,           // return top 4 chunks
  filter: {       // optional metadata filter
    source: 'product-docs',
  },
})

const relevantDocs = await retriever.invoke('How do I configure rate limiting?')

console.log(relevantDocs.map((doc) => doc.pageContent))

Generation with Context

Finally, assemble the retrieved context into a prompt and call the LLM:

import { ChatOpenAI } from '@langchain/openai'
import { PromptTemplate } from '@langchain/core/prompts'
import { StringOutputParser } from '@langchain/core/output_parsers'

const model = new ChatOpenAI({ modelName: 'gpt-4o', temperature: 0 })

const prompt = PromptTemplate.fromTemplate( You are a technical support assistant for abyte. Answer the question based ONLY on the following context. If the context doesn't contain the answer, say "I don't have enough information to answer that."

Context: {context}

Question: {question}

Answer:)

const chain = prompt.pipe(model).pipe(new StringOutputParser())

const response = await chain.invoke({ context: relevantDocs.map((d) => d.pageContent).join('\n\n---\n\n'), question: 'How do I configure rate limiting?', })

console.log(response)

Improving Retrieval Quality

Raw similarity search is a starting point. To improve results:

  • Hybrid search -- combine vector similarity with keyword (BM25) search. Pinecone and Weaviate support this natively.
  • Re-ranking -- use a cross-encoder model (e.g., Cohere Rerank) to re-score retrieved chunks before passing them to the LLM.
  • Metadata filtering -- filter by document type, date, or category before doing similarity search.
  • Query transformation -- use the LLM to rewrite ambiguous queries into clearer search queries.
// Example: Multi-query retrieval
import { MultiQueryRetriever } from 'langchain/retrievers/multi_query'

const multiRetriever = MultiQueryRetriever.fromLLM({ llm: model, retriever: vectorStore.asRetriever({ k: 4 }), queryCount: 3, // generate 3 variations of the query })

Key Takeaways

  • RAG grounds LLM responses in your own data, reducing hallucination and keeping answers current.
  • Chunk size and overlap matter enormously -- experiment with different values for your domain.
  • Use hybrid search and re-ranking to improve retrieval quality beyond basic vector similarity.
  • Always include a "I don't know" fallback in your prompt to handle cases where the context does not contain the answer.
#RAG#LangChain#AI#Vector Database
AA

Crafted by

Abiyyu Abidiffatir Al Majid

Software Engineer passionate about building scalable web applications and sharing knowledge about modern web development, system design, and emerging technologies.

Related Articles