Vector Search and Embedding Models with Chroma DB and OpenAI

10 min read
01 Apr 2024

Vector search enables text searches based on word semantic relationships and complex concepts. This method can serve as the foundation for building advanced search engines that integrate the general knowledge and text-understanding capabilities of Large Language Models (LLMs)

Have you ever struggled with searching a database because you didn't know which keywords to use? Maybe you could describe what you were looking for, but without using the exact terms, the search results were not as relevant as you expected.
Traditional searches primarily rely on keywords, with search results heavily influenced by keyword matching and keyword frequency within texts.
In this scenario, the relevance and quality of our search results are closely tied to the explicit use of those keywords in the search query.

Vector Search: How Does it Work? 

Vector search relies on algorithms that compute distances between numeric vectors, which represent word and sentence meanings. This allows it to return results that are semantically closest to the input text description.
In a nutshell, vector search helps you find the information you're looking for within large amounts of unstructured text data, even if you don't know the exact terms to use in your query. This is what makes it a highly relevant concept in the Large Language Models (LLMs) domain.

But what exactly are these vectors? They're essentially numerical representations of texts, called "vector embeddings", that capture word relationships and meaning. Vector numbers are generated through machine learning models and represent specific text features and attributes.
The more similar two texts are to each other, the closer their vector numbers will be within their vector space. This means that the distance between vectors for texts related to "cats" and "feline" will be shorter than the distance between vectors for texts related to "fish" and "mountains".
Leveraging this approach, vector searches enable more advanced searches, identifying complex relationships that leads to more accurate and context-aware results.

Let's now focus on an example that might sound familiar to you: searching the ideal movie to watch for a relaxing night at home.
Without knowing the exact title, you probably start by browsing predefined categories (e.g., comedy, drama, thriller, etc.), or maybe you rely on the recommendation algorithm from your streaming platform.

Imagine you could now include broader concepts or emotions you wish to feel while watching the movie. Your queries could be something like:

  • "A sci-fi movie with time travel and temporal paradoxes"
  • A historical drama about fighting for civil rights
  • An action movie set during a global pandemic 

Through vector search, the provided descriptions are translated into numerical vectors, which are then compared with the vectors of movie plots in the database.
This results in a selection of films that are more relevant and consistent with the genre you are truly looking for. 

If this example of searching for the perfect movie has intrigued you, good news!
In this article, we will work with the dataset IMDB Movies Dataset - Top 1000 Movies by IMDB Rating to set up a vector database with Chroma DB. We will also use OpenAI APIs to generate embeddings from movie titles and descriptions.

The final result will be a small semantic search engine that will allow us to find movies of interest using textual descriptions of plots, without relying on predefined categories.

Technical Requirements

To develop the vector search engine, we will use the following tools:

  • Python 3.10 to write the source code
  • Chroma DB, an open-source vector database, to store numerical vectors and execute search queries
  • OpenAI Embeddings to generate vectors based on movie descriptions

Before moving forward with the development, run the following command in the terminal to install the required libraries.

pip install pandas, chromadb

Chroma DB and OpenAI Embeddings API Setup

Now, let's create the file chroma_search.py in the project directory. We will import the necessary libraries, include our OpenAI API key, and assign the model "text-embedding-3-small" to a new variable. This model will allow us to generate the required vectors.

Next, we initialize the chromadb client, which will create the path for our vector database in the project directory. Then, we create the "imdb_movies" collection, which will contain all the vectors generated from the movie descriptions provided later.


import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import pandas as pd

# Insert your OpenAI API key
OPENAI_API_KEY = "sk-XXXXXXXXX"

# Define the embedding model to be used for text embedding
EMBEDDING_MODEL = "text-embedding-3-small"

# Initialize an instance of OpenAIEmbeddingFunction with the API key and embedding model
openai_embedding_function = OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name=EMBEDDING_MODEL)

# Initialize a persistent ChromaDB client
client_chromadb = chromadb.PersistentClient(path="chromadb")

# Get or create the "imdb_movies" collection in ChromaDB
collection = client_chromadb.get_or_create_collection("imdb_movies", embedding_function=openai_embedding_function)

Dataset Import and Pre-Processing

In this phase, we will import our dataset and transform our data to feed the embedding model. To do this, we will create three new functions.

The import_movie_dataset function imports a CSV file and returns a Pandas dataframe.
Since the movies in the original file did not have a unique identifier, we will use the dataframe index as an incremental ID. This step is crucial because Chroma DB explicitly requires an ID for each document in its collection, otherwise the insert operation would fail.
In this stage, we also convert the movie release year to a numeric format. This will allow us to filter the results based on the year, such as showing all movies released after 2010.


def import_movie_dataset(path: str = "dataset/imdb_top_1000.csv") -> pd.DataFrame:
    """
    Import movie dataset from the specified path and return it as a pandas DataFrame.

    Args:
        path (str, optional): The file path to the dataset. Defaults to "dataset/imdb_top_1000.csv".

    Returns:
        pd.DataFrame: A DataFrame containing the imported movie dataset.
    """
    movies = pd.read_csv(path)
    movies["id"] = (movies.index + 1).astype(str)

    # Convert released year to integer
    movies["Released_Year"] = movies["Released_Year"].str.replace('[^0-9]', '0', regex=True).fillna(0).astype(int)

    return movies

The create_text_from_movie function is responsible for assembling the text corpus that will be used to generate vectors using the embedding model.
The content selected at this stage will determine the values within our vectors and influence the relevance of search results. Since potential search queries will primarily be general movie descriptions, we decided to create the text corpus by including the title, genre, and a brief description of the movie.

If you want to convert a large amount of data into vectors, it is recommended to calculate the cost in advance using OpenAI's Tiktoken library. In this case, the operation will only cost a few cents.

def create_text_from_movie(movie:dict) -> str:
    """
    Returns a string containing the title, genre, and description of the movie
    """
    return f"""Title: {movie["Series_Title"]}
    Genre: {movie["Genre"]}
    Description: {movie["Overview"]}
    """

The extract_metadata_from_movie function is optional and involves extracting movie metadata for insertion into Chroma DB, in addition to the text corpus of the documents. Adding this metadata can be useful if you want to filter your searches based on additional parameters, such as the release year of the movies you want to include in the results.


def extract_metadata_from_movie(movie: dict, meta_keys: list[str]) -> dict:
    """
    Returns a dictionary with metadata from the movie based on the selected keys.
    """
    metadata = {}
    for key in meta_keys:
        if key in movie and pd.notnull(movie[key]):
            metadata[key] = movie[key]
    return metadata

Writing Data to the Database

The add_movie_vectors_to_db function handles writing data to the vector database using the upsert method from the chromadb package. This method is idempotent, meaning it creates a new document if it does not already exist and updates it if it is already present in the database with the same ID.
Since this is a performance-intensive operation, it is recommended to execute it only during the initial insert operation or when making future updates to the database to avoid redundant executions with each use of the search engine.


def add_movie_vectors_to_db(
    movies: pd.DataFrame,
    collection: chromadb.Collection,
    meta_keys: list[str] = None
) -> None:
    """
    Adds movie vectors to the specified ChromaDB collection.

    Args:
    - movies: DataFrame containing movie data.
    - collection: Collection in the database to which movie vectors will be added.
    - meta_keys: List of metadata keys to extract from movies. Defaults to None.
    """

    # Convert DataFrame to dictionary
    movies_dict = movies.to_dict(orient="records")

    # Create text embeddings for each movie in the dataset
    movies_text = [create_text_from_movie(movie) for movie in movies_dict]

    # Prepare list of movie IDs
    movies_ids = [movie["id"] for movie in movies_dict]

    if meta_keys:
        # Select and extract metadata fields for each movie
        movies_meta = [extract_metadata_from_movie(movie, meta_keys) for movie in movies_dict]

        try: 
            # Upsert movie vectors and metadata into the database collection
            collection.upsert(
                ids=movies_ids,
                documents=movies_text,
                metadatas=movies_meta
            )
            print("UPLOADED MOVIES TO DB WITH METADATA")

        except:
            print("FAILED TO LOAD MOVIES TO DB")
    else:
        try: 
            # Upsert movie vectors into the database collection without metadata
            collection.upsert(
                ids=movies_ids,
                documents=movies_text
            )
            print("UPLOADED MOVIES TO DB WITHOUT METADATA")

        except:
            print("FAILED TO LOAD MOVIES TO DB")

To write data into the Chroma DB collection, we can execute the following commands from our Python terminal. The third argument of the add_movie_vectors_to_db function is optional and allows us to use the values in the selected columns as document metadata.


>>> from chroma_search import client_chromadb, \
						   collection, \
                           import_movie_dataset, \
                           create_text_from_movie, \
                           extract_metadata_from_movie, \
                           add_movie_vectors_to_db
>>> movies = import_movie_dataset()
>>> add_movie_vectors_to_db(movies, collection, ["Released_Year", "IMDB_Rating", "Director", "Gross"])

Using Search Queries

We are almost ready to run our first queries on the Chroma DB collection to get immediate movie results. To do this, we first declare the query_text_vector_db function.


def query_text_vector_db(
    collection: chromadb.Collection,
    query_text: str,
    dataframe: pd.DataFrame,
    n_results: int = 5,
    where_clause: dict = {}
) -> pd.DataFrame:
    """
    Perform a vector search on a chromadb collection from a query text and returns a DataFrame with results sorted by the nearest vector distances.

    Args:
        collection (chromadb.Collection): A collection object representing the chromadb collection to be searched.
        query_text (str): A query text to search for in the collection.
        dataframe (pd.DataFrame): A DataFrame containing the data associated with the collection.
        n_results (int, optional): Number of results to retrieve for each query text. Defaults to 5.
        where_clause (dict, optional): Additional query constraints. Defaults to {}.

    Returns:
        pd.DataFrame: A DataFrame containing the search results
    """
    
    # Query the collection with provided query texts
    results = collection.query(
            query_texts=query_text,
            n_results=n_results,
            where=where_clause
        )
    
    # Get result ids for the query text
    term_result_ids = results["ids"][0]

    # Filter the main DataFrame to get the suggested movies
    suggested_movies = dataframe.copy()[dataframe["id"].isin(term_result_ids)]
    
    # Add a column with the vector distance of each result from the query vector
    suggested_movies["vector_distance"] = results["distances"][0]

    # Sort search results by vector distance for relevance
    suggested_movies = suggested_movies.sort_values(by=["vector_distance"])

    # Define columns to filter from the DataFrame
    filter_cols = [
        'vector_distance',
        'Series_Title',
        'Overview',
        'Released_Year',
        'Director'
    ]
    
    return suggested_movies[filter_cols]

The search function returns a Pandas dataframe with results sorted by relevance and accepts the following arguments:

  • the chromadb collection with our movies
  • a search query
  • the number of desired results
  • the original dataframe with the movies to extract
  • optional additional filters based on metadata

Let’s now examine the key points that make up our function.

Search Results Structure

To understand the response structure saved in the intermediate variable results, here is an example of the result for the query "a movie about wizards", with the number of results limited to one for readability.

{
  "ids": [["782"]],
  "distances": [[0.28548556566238403]],
  "metadatas": [[
    {
      "Director": "Mike Newell",
      "Gross": "290,013,036",
      "IMDB_Rating": 7.7,
      "Released_Year": 2005
    }
  ]],
  "embeddings": null,
  "documents": [
    [
      "Title: Harry Potter and the Goblet of Fire\n    Genre: Adventure, Family, Fantasy\n    Description: Harry Potter finds himself competing in a hazardous tournament between rival schools of magic, but he is distracted by recurring nightmares.\n    "
    ]
  ],
  "uris": null,
  "data": null
}

We can observe that by default, the key "embeddings" does not return the value of the result vector, as it is very lengthy and not particularly useful for reading results. In contrast, a significant value is represented by the key "distances", which indicates the distance (calculated based on cosine similarity) between the vector generated by our text query and the vector associated with the retrieved movie, in this case, "Harry Potter and the Goblet of Fire."

In essence, the response provides the top n results sorted by the proximity of their vectors to the query vector entered.

It is interesting to note that, although the movie description text used to feed the search engine does not contain the word "wizards" at all, the embedding model used to generate the vectors still captured the semantic relationships between words, allowing it to identify a highly relevant and meaningful result.

Using Metadata Filters

With the addition of metadata to documents in Chroma DB, we now have the option to use the where_clause parameter in the search function. For example, if we wanted to search only for movies released before 2000, we could pass the following dictionary as a value.


where_clause = {
    "Released_Year": {
        "$lt": 2000
    }
}

Search filters can also be combined using AND and OR operators. For a comprehensive list, refer to the Chroma DB documentation on using filters.

Presenting the Results

Below is the main function from our file. We can execute it with any query of our choice to query the database using vector search.


def main(query):
    # Importing the movie dataset
    movies = import_movie_dataset()

    # Querying the text vector database to find relevant search results
    search_results = query_text_vector_db(collection=collection, query_text=query, dataframe=movies)

    print(search_results)
    
if __name__ == "__main__":
    main("A cyberpunk movie")

Here are the results obtained from the query "A cyberpunk movie."

vector-search-results-movies
Search results for "A cyberpunk movie."

Limitations

Our search engine does not set a minimum similarity threshold between vectors for returning results. This means that if there are no relevant items for the search query, results that are mathematically closer will still be shown, even if they are not useful.
To evaluate the accuracy of the results, we can examine the values in the "distances" key and set an acceptance threshold to determine which results are considered relevant.

For educational purposes, we used a dataset with a small number of records and brief descriptions, containing at most one or two sentences per movie plot. This means that our search engine has a limited knowledge of movie plot details and can be significantly influenced by vague descriptions or titles of movies.
For example, the query "a movie about car racing" ranks "Gran Torino" third, even though there are other films more relevant to the query in question.
The search engine is influenced in this case by the fact that the movie title contains the name of a car, even though the film is not about car racing.

In designing our vector search engine, it is therefore crucial to focus on the careful selection and pre-processing of textual data, ensuring that we have a sufficient amount of data to guarantee result relevance.

Conclusions

In this article, we began with a theoretical explanation of vector searches, describing how the process of retrieving information works through vector embeddings, which are essentially numerical representations of textual data in vector space.

Based on these concepts, we created a semantic search engine using Chroma DB, an open-source vector database. This tool helps facilitate the search for the ideal movie, leveraging the language understanding provided by OpenAI’s embeddings.

Additional notes on what we developed:

  • The techniques described in this article can similarly be applied to create recommendation systems based on the same vector proximity concept. For example, you can use the vector of a movie in our database as a starting point to find other similar movies.
  • Besides Chroma DB, there are many other options to consider for designing your vector database, including both open-source and commercial solutions. The same applies to embedding models other than those from OpenAI, such as those available on Hugging Face, which provides a wide range of open-source models.
  • Using a larger amount of textual data to process into vectors can lead to more precise and contextually relevant search results. However, it is important to consider the limitations on the maximum number of tokens that models can handle and adopt chunking strategies to optimize the relevance of our search results.