Building a Semantic Search Engine with Embeddings and Vector Search | SoniNow Blog

Limited TimeLearn More

semantic searchembeddingsvector searchaiinformation retrieval

Building a Semantic Search Engine with Embeddings and Vector Search

Published

2026-06-23

Read Time

4 mins

Building a Semantic Search Engine with Embeddings and Vector Search

Keyword search is fundamentally limited: it matches tokens, not meaning. A search for "budget-friendly laptop" won't match "affordable notebook" unless you've manually curated synonyms. Semantic search uses embeddings to understand meaning, delivering results that actually match user intent. Here's how to build a semantic search engine that handles real-world queries.

The Semantic Search Pipeline

A production semantic search system has four stages:

Query → Embed → Retrieve → Rerank → Present

Each stage has specific trade-offs and implementation choices.

Stage 1: Embedding the Query

The query embedding converts user input into a vector that represents its semantic meaning. The same embedding model used for your document index must be used for queries:

from openai import OpenAI

client = OpenAI()

def embed_text(text, model="text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=text,
        dimensions=1536  # Can reduce for speed; never increase
    )
    return response.data[0].embedding

Model selection: text-embedding-3-small offers the best cost-quality ratio for semantic search. For domains with specialized vocabulary (legal, medical, technical), consider fine-tuned models like BAAI/bge-large-en-v1.5 or domain-specific variants from the MTEB leaderboard.

Query expansion improves retrieval by generating multiple query variants:

def expand_query(query):
    """Generate semantically related queries to improve recall."""
    expansions = llm.invoke(f"""
    Generate 3 alternative phrasings of this search query that capture the same intent:
    "{query}"
    Return one per line, no numbering.
    """)
    variants = [query] + expansions.strip().split('\n')
    
    embeddings = [embed_text(v) for v in variants]
    # Average the embeddings
    return np.mean(embeddings, axis=0)

Stage 2: Vector Retrieval

Retrieve the top-K candidates from your vector database:

class SemanticSearchEngine:
    def __init__(self, vector_store, embedding_model):
        self.vector_store = vector_store  # pgvector, Pinecone, etc.
        self.embedding_model = embedding_model
    
    def search(self, query, k=20, filters=None):
        query_embedding = self.embed_text(query)
        
        # Hybrid search: combine vector similarity with keyword boost
        vector_results = self.vector_store.similarity_search_with_score(
            embedding=query_embedding,
            k=k,
            filter=filters
        )
        
        keyword_results = self.vector_store.bm25_search(
            query=query,
            k=k // 2,
            filter=filters
        )
        
        # Combine and deduplicate
        combined = self._reciprocal_rank_fusion(vector_results, keyword_results)
        return combined[:k]
    
    def _reciprocal_rank_fusion(self, *result_sets, k=60):
        """RRF combines multiple ranking signals effectively."""
        scores = defaultdict(float)
        for results in result_sets:
            for rank, (doc_id, _) in enumerate(results, 1):
                scores[doc_id] += 1 / (k + rank)
        return sorted(scores.items(), key=lambda x: -x[1])

Retrieval tuning:

  • Retrieve k=20 to k=50 candidates (generous top-K), then rely on reranking
  • Use RRF (Reciprocal Rank Fusion) to combine vector and keyword results—it's simple and effective
  • Apply filters (category, date range, author) at retrieval time, not after

Stage 3: Reranking

The initial retrieval is fast but imprecise. Reranking applies a slower, more accurate model to the top candidates:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, candidates, top_k=10):
    """Apply cross-encoder reranking to improve relevance ordering."""
    pairs = [(query, doc.content) for doc in candidates]
    scores = reranker.predict(pairs)
    
    scored = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [doc for doc, score in scored[:top_k]]

Why reranking matters: The bi-encoder (embedding) model computes document similarity independently, which is fast but loses query-document interaction information. The cross-encoder reranker considers both query and document together, producing much more accurate relevance scores.

A good reranker improves NDCG@10 by 10-15% over embedding-only search. The trade-off: cross-encoders are 50-100x slower per pair. This is why you only rerank the top 20-50 results.

Stage 4: Presentation and Snippets

The final stage presents results with context. Use the LLM to generate relevant snippets:

def generate_snippet(query, document):
    """Extract the most relevant passage from a document for this query."""
    prompt = f"""
    Given this search query and document, extract the single most relevant
    passage (2-3 sentences) that best answers the query.
    
    QUERY: {query}
    DOCUMENT: {document[:3000]}
    
    Return only the passage, verbatim from the document.
    """
    return llm.invoke(prompt)

Production Deployment

# docker-compose.yml for semantic search stack
version: '3.8'
services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: search
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

  embedder:
    build: ./embedder
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
    ports:
      - "8001:8001"

  search-api:
    build: ./search-api
    ports:
      - "8080:8080"
    depends_on:
      - postgres
      - embedder

volumes:
  pgdata:

Monitoring Semantic Search Quality

  • NDCG@10: Normalized Discounted Cumulative Gain at 10 results. Track daily.
  • Zero-result rate: Percentage of queries returning nothing. Alert above 5%.
  • Click-through rate: Are users clicking the first result? Above 40% is good.
  • User reformulation rate: How often do users immediately re-query? Below 15% is healthy.
  • Latency P99: Total pipeline latency. Target under 200ms.

SoniNow builds semantic search engines that dramatically improve discovery on content platforms, e-commerce sites, and knowledge bases. Our web development and AI automation services cover the full pipeline from embedding strategy to production deployment.

Stop searching for better search. Contact us to build a semantic search engine that understands what your users actually mean.