RAG Architecture: Building Retrieval-Augmented Generation Systems That Work

Retrieval-Augmented Generation (RAG) has become the default architecture for grounding LLM responses in real, verifiable data. Instead of relying solely on a model's training data, RAG retrieves relevant documents from a knowledge base at inference time and feeds them as context. This dramatically reduces hallucinations and keeps answers current without retraining. Here's how to build a RAG system that works in production.

Document Ingestion and Chunking

The quality of your RAG system begins with how you prepare your documents. The goal is to create chunks that are semantically self-contained—each chunk should make sense on its own.

Semantic chunking outperforms fixed-size chunking for most use cases. Use a threshold-based approach:

from langchain.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",  # or "gradient", "standard_deviation"
    breakpoint_threshold_amount=70
)

chunks = semantic_splitter.split_documents(documents)

For PDFs and structured documents, consider hierarchical chunking: split by section headers first, then sub-chunk large sections. Store both levels in your vector database so you can retrieve either granular details or broad context.

Metadata tagging is often underestimated. Attach source URL, document title, section heading, page number, and creation date to each chunk. This enables filtered retrieval and citation generation later.

Embedding Model Selection

Your embedding model converts text into vectors. The choice directly impacts retrieval quality.

| Model | Dimensions | Cost | Quality (MTEB) | |-------|-----------|------|-----------------| | text-embedding-3-small | 1,536 | $0.02/1M tokens | 62.3 | | text-embedding-3-large | 3,072 | $0.13/1M tokens | 64.6 | | BAAI/bge-large-en-v1.5 | 1,024 | Free (self-hosted) | 64.2 | | intfloat/e5-mistral-7b-instruct | 4,096 | Free (self-hosted) | 66.6 |

Rule of thumb: Use text-embedding-3-small for prototyping—it's cheap and fast. For production with 500K+ chunks, the small accuracy gain from text-embedding-3-large is usually worth the cost. Self-host bge-large-en-v1.5 when you need data privacy.

Vector Database Selection

Choose your vector database based on scale and operational complexity:

pgvector: Perfect if you already run PostgreSQL. Handles up to ~10M vectors with IVFFlat indexing. Zero additional ops overhead.
Pinecone: Serverless, handles any scale. Ideal when you want zero infrastructure management. Realtime index updates.
Weaviate: Best when you need hybrid search (dense + sparse vectors) natively. Strong multi-tenancy support.
Qdrant: Open-source with excellent filtering performance. Lower latency than Pinecone at comparable scales for filtered searches.

-- pgvector setup example
CREATE EXTENSION vector;
CREATE TABLE document_chunks (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536),
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Hybrid Search: Dense + Sparse Together

Pure vector search misses exact keyword matches. Hybrid search combines dense embeddings with sparse (BM25) retrieval:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Dense retriever (vector similarity)
dense_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# Sparse retriever (keyword)
sparse_retriever = BM25Retriever.from_documents(documents)

# Ensemble - weighted combination
ensemble_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, sparse_retriever],
    weights=[0.7, 0.3]
)

Adjust the weights based on your domain: code documentation benefits from higher sparse weight (keywords matter), while conceptual queries favor dense retrieval.

Evaluation and Continuous Improvement

Without evaluation, you can't improve. Set up a CI pipeline that runs on every knowledge base update:

Hit Rate: Does the retrieved context contain the answer? Target >90% for top-5.
MRR (Mean Reciprocal Rank): How early in the results does the correct answer appear?
Answer Faithfulness: Does the LLM response strictly use retrieved context? Use LLM-as-judge to score.

At SoniNow, we design and deploy RAG systems that integrate with your existing data stack. Our AI automation services include custom chunking strategies, embedding optimization, and evaluation dashboards.

Production Considerations

Caching: Cache embedding vectors for stable documents. Invalidate only on content change.
Streaming: Implement streaming responses so users see tokens as they arrive—critical for UX.
Guardrails: Add a content safety layer that blocks out-of-scope queries before they hit your LLM.
Monitoring: Track query latency, chunk retrieval time, and LLM generation time separately.

RAG is the most practical way to deploy LLMs in business contexts today. Done right, it delivers accurate, up-to-date, and verifiable answers. Talk to our team about building a RAG pipeline tailored to your knowledge base.

RAG Architecture: Building Retrieval-Augmented Generation Systems That Work

Document Ingestion and Chunking

Embedding Model Selection

Vector Database Selection

Hybrid Search: Dense + Sparse Together

Evaluation and Continuous Improvement

Production Considerations

Related Insights

AI Content Optimization for Search Rankings: Beyond Keyword Density

AI Content Personalization Engines: Delivering Tailored Digital Experiences

AI-Powered Customer Segmentation: From Clusters to Personalized Experiences