Follow

Intuitive Insights on AI-Powered Search

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

The Embeddings Handbook: Implementing Semantic Search from Scratch

Learn how to implement semantic search from scratch: embeddings, vector DBs, hybrid retrieval, reranking & scaling for production RAG systems.
how to implement semantic search how to implement semantic search

The Embeddings Handbook: Implementing Semantic Search from Scratch

How to Implement Semantic Search (The Fast Answer)

How to implement semantic search in five core steps:

  1. Chunk your documents — Split text into overlapping segments (e.g., 500 tokens with 50-token overlap) to preserve context.
  2. Generate embeddings — Convert each chunk into a vector using a model like all-MiniLM-L6-v2 or OpenAI’s text-embedding-3-small.
  3. Store vectors in a database — Index embeddings using a vector store like FAISS, pgvector, or Pinecone.
  4. Embed the query — At search time, convert the user’s query into a vector using the same model.
  5. Retrieve by similarity — Find the closest vectors using cosine similarity or Euclidean distance and return the matching documents.

Traditional keyword search breaks in a very specific, frustrating way.

A user asks “vehicle ignition problems” — and your search engine returns nothing, because the documentation says “car won’t start.” Same meaning. Zero overlap. No match.

Advertisement

This is the core problem semantic search solves.

Instead of matching words, semantic search matches meaning. It converts both documents and queries into high-dimensional numerical vectors — called embeddings — where similar concepts land close together in vector space. The result: searches that understand intent, handle paraphrasing, and surface relevant content even when the exact words don’t align.

This matters more than ever because:

  • RAG (Retrieval-Augmented Generation) systems depend on semantic retrieval to ground LLM responses in real data
  • Users search in natural language, not keyword strings
  • Unstructured data — PDFs, tickets, docs — can’t be searched with term frequency alone

The good news? Building a working semantic search system is now practical for any developer, thanks to affordable embedding APIs, open-source tools like Sentence Transformers, and extensions like pgvector that plug directly into PostgreSQL — the most popular database in Stack Overflow’s 2024 Developer Survey.

This guide walks you through the full implementation: from chunking and vectorization to hybrid search, reranking, and production scaling.

Infographic comparing lexical keyword matching vs semantic meaning-based vector retrieval - how to implement semantic search

At its heart, semantic search moves away from “lexical” matching (counting how many times a word appears) and toward “semantic” matching (understanding what the words represent). To do this, it relies on a specific type of machine learning architecture: the transformer model.

Neural network processing text for vector generation - how to implement semantic search

When you look at vector embeddings, you are looking at dense numerical arrays. Unlike “sparse” vectors used in traditional search (where a vector might have 50,000 dimensions but most are zeros), dense vectors are compact. Every number in a dense vector represents a feature of the text’s meaning.

The breakthrough for this technology came with the Sentence-BERT paper, which introduced a way to create these embeddings for entire sentences rather than just individual words. This allows a system to recognize that “king – man + woman” approximately equals “queen,” a famous example of how these models capture relationships. For a deeper look at the SEO implications of these shifts, our semantic-seo-guide explores how search engines have transitioned to this intent-based model.

To determine which documents are “closest” to a query, the system uses mathematical distance metrics:

  • Cosine Similarity: Measures the angle between two vectors. It is the gold standard for text because it focuses on the orientation (meaning) rather than the length of the text.
  • Euclidean Distance: Measures the straight-line distance between points.

The Role of Text Embeddings in Modern Retrieval

Text embeddings translate human language into a language computers can calculate. Each document is assigned a coordinate in a high-dimensional space—often ranging from 384 to over 1,500 dimensions. In this “latent space,” words with similar meanings naturally form clusters.

As explained in the Illustrated Word2Vec, these models learn “latent meaning” by looking at the company a word keeps. If “Apple” and “Microsoft” frequently appear near words like “software” and “stock,” the model learns they are semantically related. This clustering is what allows a search for “large constricting reptiles” to find an article about Boas even if the word “reptile” never appears in the text.

Why Semantic Search is Essential for RAG Systems

Retrieval-Augmented Generation (RAG) is the process of giving an AI model (like GPT-4) specific facts to look at before it answers a question. If the retrieval step fails to find the right facts, the AI will likely “hallucinate” or make up an answer.

Semantic search is the engine that powers this retrieval. Because most enterprise data is unstructured—think of thousands of PDFs, Slack messages, and Notion pages—traditional keyword search often fails to find the specific paragraph needed to answer a nuanced question. By implementing a semantic layer, you ensure the AI has the correct context within its “context window,” significantly reducing errors. For more on how this impacts AI visibility, see our semantic-seo-for-ai-ultimate-guide.

How to Implement Semantic Search: A Step-by-Step Technical Workflow

Building a search engine is no longer a multi-year research project. With Python and a few key libraries, you can have a functional prototype running in an afternoon.

The standard stack for how to implement semantic search involves:

  1. Python: The primary language for AI and data processing.
  2. Sentence Transformers: A library for generating high-quality embeddings locally.
  3. pgvector or FAISS: Tools for storing and searching through those embeddings.

When starting a semantic-search-implementation, your first task is building an indexing pipeline. This is the “factory” that takes raw documents and turns them into searchable vectors.

Data Preprocessing and How to Implement Semantic Search Chunking

You cannot simply embed a 50-page PDF as a single vector. If you do, the “meaning” becomes too diluted, and the search will be inaccurate. Instead, you must perform “chunking.”

The most effective method is recursive splitting. This involves breaking text at logical points (paragraphs, then sentences) until you reach a target size, such as 500 tokens. A critical best practice is to include token overlap (usually 10-15%). If one chunk ends and the next begins exactly at a specific sentence, the context of that sentence might be lost. Overlap ensures that the “tail” of one chunk is the “head” of the next, preserving the narrative flow.

You can find practical examples of this logic in this implementation on GitHub, which demonstrates how to process millions of documents without crashing your system’s memory.

Vectorization and How to Implement Semantic Search Indexing

Once your text is chunked, you need to turn it into numbers. You have two main paths:

  • Open-Source Models: Using the all-MiniLM-L6-v2 model via the Sentence Transformers library is free and fast. It produces 384-dimensional vectors and is excellent for most general-purpose tasks.
  • Hosted APIs: OpenAI’s text-embedding-3-small is highly accurate and handles larger chunks (up to 8,191 tokens), though it comes with a per-request cost.

After generating the vectors, you need a place to put them. For many developers, pgvector is the ideal choice because it allows you to store your vectors right next to your existing relational data in PostgreSQL. By using pgai, you can even automate the embedding process directly within the database, removing the need to manage a separate vectorization server. This streamlined approach is a key part of how google-semantic-search handles massive scales of information.

Optimizing Retrieval with Hybrid Search and Reranking

While semantic search is powerful, it isn’t a silver bullet. It sometimes struggles with very specific technical terms or acronyms that weren’t in its training data. This is where “Hybrid Search” comes in.

Feature Keyword Search (BM25) Semantic Search (Vector)
Matching Type Exact word overlap Meaning and context
Strength Product IDs, names, acronyms Natural language, synonyms
Weakness Synonyms, typos, phrasing Specific technical jargon
Speed Extremely fast Fast (with indexing)

The most robust systems use Reciprocal Rank Fusion (RRF) to combine the results of both methods. If a document ranks #2 in keyword search and #5 in semantic search, RRF gives it a high combined score, ensuring it appears at the top of the user’s results.

To measure if your implementation is actually improving, you should track metrics like nDCG@10 (Normalized Discounted Cumulative Gain), which rewards the system for putting the most relevant results at the very top of the list.

Hybrid search is the industry standard for production systems. In a PostgreSQL environment, you can achieve this by Combining Semantic Search and Full-Text Search.

A common weighting strategy is the 70/30 rule: give semantic similarity 70% of the “vote” and keyword matching 30%. This ensures that if a user searches for a specific part number like “XJ-9000,” the keyword engine finds it immediately, but if they search for “the best tool for fixing a leaky pipe,” the semantic engine takes the lead.

Improving Precision with Cross-Encoder Reranking

Standard vector search uses “Bi-Encoders,” where documents and queries are embedded separately. This is fast but can occasionally miss subtle nuances.

To achieve maximum precision, you can add a second stage called reranking. This uses a “Cross-Encoder” model. You take the top 50 results from your initial search and pass them—along with the original query—into a more powerful model that looks at both simultaneously. While computationally heavier, it is much more accurate at determining the final order of results. This multi-stage approach is essential for semantic-entity-seo-for-ai strategies where accuracy is paramount.

Scaling and Production Considerations for Vector Databases

As your document collection grows from thousands to millions, “brute force” searching (comparing the query to every single document) becomes too slow. You need an Approximate Nearest Neighbor (ANN) index.

The HNSW algorithm (Hierarchical Navigable Small World) is currently the most popular choice for high-performance search. It creates a multi-layered graph that allows the search engine to “skip” through the data, finding the right neighborhood in milliseconds.

However, HNSW is memory-intensive. For massive datasets where costs must be kept low, StreamingDiskANN is a newer alternative that allows you to store the majority of the index on disk rather than in expensive RAM. This choice significantly impacts the performance and cost-efficiency of your system, as detailed in our semantic-seo-agency-ultimate-guide.

Efficient Indexing for Large Document Collections

When managing millions of vectors, memory management is everything. Using numpy.memmap allows you to map large files on your disk directly into your application’s memory space without actually loading the whole file. The operating system handles the heavy lifting, only pulling in the “pages” of data you are currently searching.

For production PostgreSQL users, the philosophy that PostgreSQL is all you need is becoming a reality. By using extensions like pgvectorscale, you can achieve performance that rivals dedicated vector databases while keeping your stack simple and your data synchronized.

Enterprise Security and Access Controls

In a corporate environment, you cannot show every user every document. If an employee searches for “salary ranges,” the semantic search should not retrieve the CEO’s private HR file unless the user has permission.

This requires metadata filtering. When you index your embeddings, you must also store “Access Control Lists” (ACLs) as metadata. When the user performs a search, the query is modified to include a filter: “Find the most similar vectors where user_group = ‘finance’.”

Tools like pgvectorscale are designed to handle these filtered searches efficiently, ensuring that security doesn’t slow down the user experience. Additionally, robust monitoring and error handling (such as retrying failed API calls to embedding providers) are non-negotiable for production resilience.

Fuzzy search is a lexical technique. It uses algorithms like Levenshtein distance to find words that look similar (e.g., matching “apple” to “aple”). It is great for catching typos.

Semantic search is intent-based. It doesn’t care if the words look similar; it cares if they mean the same thing. Fuzzy search won’t help you find “automobile” when you search for “car,” but semantic search will, because their vectors are in close proximity in the embedding space.

Which embedding model should I choose for production?

The “best” model depends on your budget and latency requirements.

  • OpenAI: Best for ease of use and high accuracy, but requires an internet connection and has per-query costs.
  • Open-Source (Sentence Transformers): Best for privacy, speed, and zero per-query costs. Consult the MTEB leaderboard to compare the latest models on specific tasks like retrieval or summarization.

How do I evaluate the quality of my semantic search implementation?

You should use standard Information Retrieval (IR) metrics:

  • Precision@K: What percentage of the top K results are actually relevant?
  • Recall@K: Did the system find all the relevant documents in the top K?
  • Mean Reciprocal Rank (MRR): How far down the list did the user have to look to find the first relevant result?

If you don’t have human-labeled data, you can use synthetic query generation. Use an LLM to look at your documents and generate 5-10 questions that each document could answer. Then, run those questions through your search engine and see if it retrieves the correct source document.

Conclusion

Learning how to implement semantic search is one of the most valuable skills in the modern AI landscape. By moving beyond simple keyword matching, you create systems that truly understand user intent and provide more accurate, human-like responses.

Whether you are building a small internal knowledge base or a massive RAG system, the tools are now available to make it happen. Start with a simple semantic-search-implementation using Python and FAISS, then scale up to production-grade PostgreSQL systems as your needs grow.

At eOptimize, we believe that data-driven research and structured content are the foundations of the next generation of search. By following the best practices of chunking, vectorization, and hybrid retrieval, you can build a search experience that doesn’t just find words—it finds answers.

Intuitive Insights on AI-Powered Search

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Advertisement