Exploring vector embeddings with ChromaDB

Exploring vector embeddings with ChromaDB

Norbert Aberor
29th June 2026

Home Insights Exploring vector embeddings with ChromaDB

Introduction

Ever wondered how ChatGPT remembers context from your conversation, or how Spotify finds songs similar to your favorites? The secret lies in vector embeddings, a technology that's revolutionizing how applications understand and work with data.

What Are Vector Embeddings?

Vector embeddings transform any data (text, images, audio) into numerical arrays that capture semantic meaning. Similar concepts cluster together in this "embedding space" (more on this below), enabling machines to understand relationships and context just like humans do.

For example, just as (40.7589, -73.9851) represents Times Square in two-dimensional space, embeddings represent concepts in high-dimensional space. Words like "dog" and "puppy" end up close together, while "dog" and "mathematics" are far apart.

What does high-dimensional mean? If all you need to describe is position and altitude, then you only need three dimensions: two horizontally and one for height. Call them X, Y, and Z. But what if you want to describe five characteristics: two for horizontal position, one for altitude, one for “greenness”, and one for “ability to thrive in a cold climate”? Each characteristic needs its own additional dimension, essentially perpendicular to the others. You can’t visualize this beyond three dimensions, but it’s perfectly valid mathematically. And to represent rich, nuanced knowledge, embeddings may use hundreds, thousands, or even millions of dimensions. This is a high-dimensional vector space.

In another example, the Colors (255, 0, 0) and (250, 5, 5) are both "red" - similar vectors represent similar concepts. The distance between color values determines visual similarity, just like embedding distance determines semantic similarity.

In biology, similar genetic sequences indicate related organisms. Vector embeddings work the same way.

The embedding Space

What happens: A sentence like "The cat sat on the mat" becomes something like [0.23, -0.45, 0.67, ..., 0.12] - a list of 384 or 1536 numbers.

These numbers aren't random. They're learned patterns that capture meaning:

-Words with similar meanings cluster together -Relationships are preserved as vector math operations -Context and intent are encoded in the numerical patterns

Imagine you have a computer program that turns words into lists of numbers (vectors) so that similar words have similar numbers. But it’s even smarter than just matching similar words, it can understand relationships between words Like in this example:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This shows how embeddings capture not just similarity, but relationships and analogies. Practical applications & problems vector embeddings solve Before we dive into the examples, here’s what to expect. Each scenario starts with a real problem, shows why traditional methods fall short, and then demonstrates how embeddings change the outcome with just enough code to make it practical. You don’t need an ML background; just remember that similar things live close together in vector space. Keep an eye on the business impact callouts to connect the technical shift to measurable results.

Semantic Search

The Problem: Traditional search fails with context and synonyms. The user searches for "car troubles" but the document says "automotive issues". Keyword search returns zero results despite a perfect semantic match. Users must guess exact terminology the content creator used

The Solution: Semantic search using embeddings

Traditional: "car troubles" → No matches for "automotive issues"

Semantic: "car troubles" → Finds "automotive issues", "vehicle problems", "car repair needs"

Real World impact example: Shopify improved product search accuracy by 37% using semantic search. Users find relevant products even with imprecise queries.

Retrieval-Augmented Generation (RAG)

The Problem: LLMs can hallucinate and lack domain specific knowledge

GPT-5 doesn't know your company's internal policies Medical chatbots shouldn't guess about treatments Customer service needs accurate, up to date information

The Solution: RAG Pipeline — this lets your AI go out and actively search for and retrieve the most relevant, helpful documents and information, not just rely on what it already "knows."

Convert knowledge base into embeddings The user asks: "What's our vacation policy for remote workers?" Find similar embedded documents LLM generates answer using retrieved context

Code Example

Query: "vacation policy remote workers" ChromaDB finds: Employee_Handbook_Section_4.2, Remote_Work_Guidelines.pdf LLM Response: "Based on our employee handbook, remote workers receive..."

Business Impact: Companies reduce customer service costs by 60% while improving accuracy and response time.

Success Story: An internal support assistant at a mid size SaaS company used RAG over policies, release notes, and architecture docs, reducing average handle time for complex tickets and increasing answer accuracy by grounding responses in vetted sources.

Reverse image search

The Problem: Keyword search can’t describe visual nuance (style, color, layout). Users can search by uploading an image and finding near duplicates, visually similar items, or the same product from your vector image database.

The Solution: Generate image embeddings (e.g., CLIP) for your catalog and for the query image, then retrieve nearest neighbors in embedding space. Optionally enable cross modal search so text queries ("red midi dress with floral pattern") also find visually similar images.

Success Story: An e‑commerce marketplace added vision search so shoppers could upload a photo and instantly see similar products across sellers, improving discovery for long tail inventory and increasing conversion on visually led categories. Recommendation Systems Ever wondered how Netflix seems to know you’ll love that obscure documentary, or how Spotify keeps surfacing songs you’ve never heard but instantly enjoy? The secret is in vector embeddings. Instead of just matching genres or categories, modern recommendation systems represent both users and content as vectors in a high dimensional space.

A user’s vector might capture their viewing or listening history, favorite genres, and even subtle behavior patterns. Each movie, show, or song is also embedded as a vector based on its features, think: genre, mood, themes, or even visual and narrative style. When it’s time to recommend something, the system simply finds content vectors that are closest to the user’s vector meaning, the most similar in taste and style, not just in category.

This approach powers features like Spotify’s Discover Weekly, which uses embedding similarity to build personalized playlists. The result? Listeners discover 40% more new music, and recommendations feel uncannily relevant no matter how niche your interests. Anomaly Detection Detecting fraud among millions of transactions is a moving target. Every user has their own unique credit card usage patterns, and fraudsters are constantly inventing new tricks that don’t resemble anything seen before. Traditional rule based systems struggle to keep up, especially when you need to spot suspicious activity in real time.

With vector embeddings, you can capture the subtle patterns of normal user behavior by mapping transactions into a high dimensional space. Most legitimate activity forms tight clusters and think of them as neighborhoods of “normal.” When something truly unusual happens, like a fraudulent transaction, it stands out as an outlier, far from the usual clusters. By simply measuring the distance from a new transaction to these clusters, you can flag anomalies instantly, even if the fraud technique is brand new.

Case Study: PayPal reduces false positives by 50% while catching 95% of actual fraud using transaction embedding anomaly detection. Why ChromaDB? ChromaDB stands out as the "SQLite of vector databases," offering a remarkably simple and accessible approach to working with vector embeddings. With a single command (pip install chromadb), you can get started with no servers, configuration, or DevOps required. Its embedded by default design means you can run ChromaDB locally without complex infrastructure, making it ideal for both prototyping and production. The same API scales seamlessly from small projects to distributed deployments, and as an open source solution, ChromaDB delivers enterprise grade performance while remaining easy to use and highly flexible. ChromaDB is multi modal by design, supporting text embeddings with built in sentence transformers, image embeddings via CLIP model integration, custom embeddings with your own models, and the ability to store mixed data types text, images, and metadata together in a single collection. Installation & Setup

Install ChromaDB with all embedding functions

pip install chromadb

This command installs the library. If you're in a virtual environment, activate it first to keep dependencies isolated.

Environment Configuration

from chromadb.config import Settings

# Local development (default). We would be using local development
client = chromadb.Client()

# You can also initialise the client in persistence mode or client server mode

# Persistent storage
# client = chromadb.PersistentClient(path="./chroma_db")

# Client-server mode
# client = chromadb.HttpClient(host="localhost", port=8000)

# Production settings
# client = chromadb.PersistentClient(
#     path="./chroma_data",
#     settings=Settings(
#         chroma_db_impl="duckdb+parquet",
#         persist_directory="./chroma_data"
#     )
# )

Creating Collections To create a collection we call the create_collection method on the chroma client instance

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_db")

# Create collection with default embedding function
collection = client.create_collection(
    name="knowledge_base",
    metadata={"description": "Company knowledge base"}
)

# Or specify custom embedding function
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.create_collection(
    name="documents",
    embedding_function=sentence_transformer_ef
)

In this example, we initialize a persistent client so data survives restarts, create collections (similar to database tables) to group related items, optionally pass an embedding_function so ChromaDB auto generates embeddings when you add text, and can include collection metadata to describe the collection's purpose or ownership.

You may get the following error if you try to create a collection with a name that already exists in your ChromaDB instance. ChromaDB enforces unique collection names to prevent accidental data overwrites. If you attempt to create a collection with a duplicate name, an exception will be raised.

Error: "Collection already exists"

❌ Problem
collection = client.create_collection("my_docs")  # First time: OK
 collection = client.create_collection("my_docs")  # Second time: Error!
✅ Better approach
    collection = client.create_collection("my_docs")
except Exception:
    collection = client.get_collection("my_docs")
Or use get_or_create_collection

collection = client.get_or_create_collection("my_docs")

Adding Documents

To add a document or documents we call the .add() method on the collection we created earlier and pass a list of documents and document ids

Simple add - embeddings generated automatically
    documents=[
        "ChromaDB is an open-source vector database",
        "Vector databases specialize in similarity search",
        "Embeddings capture semantic meaning in numbers"
    ],
    ids=["doc1", "doc2", "doc3"]
)
Add with metadata for filtering
    documents=[
        "Q: How do I install ChromaDB? A: Use pip install chromadb",
        "Q: What is vector search? A: Finding similar items using embeddings",
    ],
    metadatas=[
        {"type": "faq", "category": "installation", "difficulty": "beginner"},
        {"type": "faq", "category": "concepts", "difficulty": "beginner"}
    ],
    ids=["faq1", "faq2"]
)

Each id must be unique within a collection, so use stable IDs if you plan to update documents later. The metadatas parameter enables fast filtering during queries, such as by type, category, or difficulty. For large imports, adding documents in batches reduces overhead; you can adjust the batch_size parameter to fit your available memory and CPU resources.

Querying for Similarity

Basic similarity search
results = collection.query(
    query_texts=["How to use vector databases"],
    n_results=5
)

print(f"Found {len(results['documents'][0])} similar documents:")
for doc, distance in zip(results['documents'][0], results['distances'][0]):
    print(f"Similarity: {1-distance:.3f} | Text: {doc[:100]}...")
Query with metadata filtering
results = collection.query(
    query_texts=["installation help"],
    where={"category": "installation", "difficulty": "beginner"},
    n_results=3
)
Multiple queries at once
results = collection.query(
    query_texts=[
        "database setup",
        "similarity search",
        "embedding models"
    ],
    n_results=2
)

When reading results, the n_results parameter determines how many nearest neighbors you retrieve, while the where parameter allows you to filter by metadata to keep results relevant. Many backends return a distance value (where lower means more similar), and you can use 1 - distance as a quick similarity score for display.

Updates and Deletes for Dynamic Datasets
# Update existing document
collection.update(
    ids=["doc1"],
    documents=["Updated: ChromaDB is the best vector database"],
    metadatas=[{"updated": "2024-01-15", "version": "2.0"}]
)

# Delete specific documents
collection.delete(ids=["doc1", "doc2"])

# Delete with metadata filter
collection.delete(where={"type": "temporary"})

# Upsert operation (update if exists, insert if not)
collection.upsert(
    ids=["doc_new"],
    documents=["This document will be updated or inserted"],
    metadatas=[{"operation": "upsert"}]
)
When to use which:

update: Document already exists; replace text and/or metadata for the same ID. delete: Remove by IDs or via where filters. upsert: Convenient when you don’t know if an ID exists—updates if found, inserts otherwise.

Advanced Features

As mentioned earlier vector databases could also be used to embed images and audio. Let's take a look at an example that combines text and image embedding.

Multimodal Embeddings (Text + Images)

# CLIP model for text and image embeddings
clip_ef = embedding_functions.OpenCLIPEmbeddingFunction()

# Create collection that handles both text and images
multimodal_collection = client.create_collection(
    name="products",
    embedding_function=clip_ef
)

# Add text descriptions
multimodal_collection.add(
    documents=["Red sports car with leather interior"],
    ids=["car_desc_1"]
)

# Add image (ChromaDB will embed the image using CLIP)
import base64
with open("car_image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

multimodal_collection.add(
    documents=[image_data],  # Base64 encoded image
    ids=["car_image_1"],
    metadatas=[{"type": "image", "format": "jpg"}]
)

# Search with text, find both text and images
results = multimodal_collection.query(
    query_texts=["luxury vehicle"],
    n_results=5
)

This works by allowing the same collection to store both text and images, with CLIP embedding them into a shared vector space; images are added as base64 strings and differentiated by storing a type in their metadata, enabling cross modal search so that a text query can retrieve relevant images and vice versa.

Custom Embedding Functions

By default, ChromaDB uses its own built-in embedding models for vectorization. However, you can easily plug in your own custom embedding functions, including those from providers like OpenAI, Cohere, or HuggingFace. This allows you to tailor the vector representations to your specific use case, language, or quality requirements.

# OpenAI embeddings (high quality, API cost)
custom_embedding_function = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-ada-002"
)

# Cohere embeddings (good multilingual support)
custom_embedding_function = embedding_functions.CohereEmbeddingFunction(
    api_key="your-api-key",
    model_name="embed-english-v2.0"
)

# Custom model from HuggingFace
custom_embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

collection = client.create_collection(
    name="multilingual_docs",
    embedding_function=custom_embedding_function
)

Selecting the appropriate embedding model is crucial for achieving optimal results in your application. Consider factors such as language coverage, domain specificity, model size, and performance requirements. Experiment with different models and evaluate their effectiveness on your own data to ensure the best fit for your use case. Here are a few examples

General Text Applications
# Fast and efficient for most use cases
"all-MiniLM-L6-v2"           # 384 dimensions, 80MB, good quality/speed balance
"all-mpnet-base-v2"          # 768 dimensions, 420MB, highest quality

# Multilingual support
"paraphrase-multilingual-MiniLM-L12-v2"  # 50+ languages
"distiluse-base-multilingual-cased"      # Good performance, compact

Domain-Specific Models

# Legal documents
"nlpaueb/legal-bert-base-uncased"

# Medical/Biomedical text
"dmis-lab/biobert-base-cased-v1.1"
"allenai/scibert_scivocab_uncased"

# Code and technical documentation
"microsoft/codebert-base"
"sentence-transformers/multi-qa-mpnet-base-dot-v1"  # Good for Q&A

# Financial documents
"ProsusAI/finbert"
Common Pitfalls & Troubleshooting
Problem: Inconsistent Text Preprocessing

# ❌ Bad: Inconsistent preprocessing leads to poor similarity
documents = [
    "ChromaDB is AMAZING!!!",           # Uppercase, exclamation
    "chromadb works well",              # Lowercase, no punctuation
    "   ChromaDB    performs great   ", # Extra whitespace
    "ChromaDB's performance is good."   # Possessive, punctuation
]

# ✅ Good: Consistent preprocessing pipeline
import re

def preprocess_text(text):
    """Standardize text before embedding"""
    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove special characters (optional, depends on use case)
    text = re.sub(r'[^\w\s]', '', text)

    # Handle possessives
    text = re.sub(r"'s\b", '', text)

    return text

processed_docs = [preprocess_text(doc) for doc in documents]
# Result: More consistent embeddings and better similarity matching

Error: "Embedding dimension mismatch"

When you create a collection or add documents to a collection using a custom embedding function be sure to use the same embedding function when adding documents subsequently. Embeddings miss matches will results and errors like below

# ❌ Problem: Mixed embedding dimensions
collection.add(
    ids=["doc1"],
    documents=["Text document"],
    embeddings=openai_embedding_function  # Used OpenAIs embeddings here
)

collection.add(
    ids=["doc2"],
    documents=["Another document"],
    embeddings=cohere_embedding_function  # Error! Cohere embeddings and OpenAI embeddings don’t have the same dimensions. So if you want to use them interchangeably or store them in the same vector database, you’d need to make sure they have the same dimensionality (or store them in separate collections)
)

# ✅ Solution: Consistent embedding functions
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction()
collection = client.create_collection(
    name="consistent_embeddings",
    embedding_function=embedding_function  # Ensures consistent dimensions
)

Conclusion ChromaDB is an easy to use, open source vector database. It’s simple to install (pip install chromadb), works out of the box, and supports text, images, and custom embeddings. It’s fast, efficient, and doesn’t lock you in.

Use ChromaDB if you want a simple, flexible, and local vector database that works out of the box and can scale from prototyping to larger deployments. Consider other options if you require a fully managed cloud service, advanced large scale features, or specialized hardware for ultra low latency.

Whether you're building semantic search, recommendation systems, or retrieval of augmented generation pipelines, ChromaDB provides a robust foundation to accelerate your projects. Happy embedding!

If you think we can help your business with any of the techniques discussed in this article, please get in touch with us.

Additional Resources Official Documentation ChromaDB Docs API Reference GitHub Repository

Share Article

Insights.

My NewRedo Work Placement
My NewRedo Work Placement

Discover More
Challenges with Next.js, SSR, SSG, and Environment Variables
Challenges with Next.js, SSR, SSG, and Environment Variables

Discover More
Mastering Application Deployment: Why We Choose Kubernetes Over Alternatives
Mastering Application Deployment: Why We Choose Kubernetes Over Alternatives

Discover More