Developing a Vector Search Engine with Langchain and Streamlit

12 min read
01 May 2024

In this project, we will leverage the linguistic capabilities of Large Language Models (LLMs) to develop a vector search engine. This engine will allow us to perform detailed searches across large volumes of unstructured text data. 

In a previous article, we introduced the concept and functionality of vector searches with textual data. In a nutshell, this search method relies on algorithms that calculate the distance between numeric vectors, which are used to semantically represent words and phrases, in order to return the most relevant results that match the meaning of the text used as a query.
Unlike traditional searches, which are based on keyword matching, vector search also considers the conceptual relationships between the search query and the texts stored in a vector database, ensuring more contextually relevant and accurate results.

Practical Applications of Vector Searches

In the business world, it's common to deal with large volumes of unorganized and poorly categorized textual data. However, within this data may lie valuable information that we want to access quickly and efficiently.

Here are some examples of practical applications of vector search in different contexts:

  • E-commerce: Allow users to perform more descriptive and detailed searches to find the right product to purchase.
  • Content recommendation: Provide users with suggestions for similar content to view, based on the semantic similarity of previously viewed texts.
  • Customer service: Offer customers a system that helps them search for answers to questions and concerns from a large collection of Frequently Asked Questions (FAQs).
  • Internal knowledge base: Provide a team with a search system for internal documentation to meet specific operational needs.

Building on the example presented in the introductory article, we will develop a vector search engine that allows us to find the ideal movie based on the plot description. The dataset we will use is Wikipedia Movie Plots with AI Plot Summaries, which includes over 34,000 movie plots, both in full and summarized versions.

Here is the result we will achieve by the end of the project.

vector-search-engine-langchain-streamlit

Why Use Langchain?

Langchain has become the go-to framework for developing applications based on Large Language Models (LLMs), thanks to the ease with which it integrates useful components and third-party tools. Its approach allows for seamless transitions between external libraries while keeping much of the code structure intact. This flexibility enables us to easily experiment with different models, compare the results, and optimize our solution.

In our project, we will leverage this feature of Langchain to allow users to choose the model they want to use for searches. The two available options are the 'all-MiniLM-L6-v2', a sentence transformer available for free on Hugging Face, and OpenAI's 'text-embedding-3-small'.

Technical Requirements

To develop the vector search engine and its UI, we will use the following tools:

  • Python 3.10 to write the source code
  • Langchain as the main framework to integrate and orchestrate the components of the search engine
  • Chroma DB, an open-source vector database, to store the numerical vectors and execute search queries
  • OpenAI Embeddings and Sentence Transformers to generate vectors based on movie descriptions
  • Streamlit for creating a user-friendly graphical interface to query the vector database.

Before proceeding with the development, let's save the CSV file containing the movies in the dataset folder and run the following command in the terminal to install the required libraries.

pip install python-dotenv pandas chromadb langchain langchain_openai sentence-transformers streamlit

After the library installation is complete, we can create the app.py file and import the necessary packages. Additionally, you need to create the environment variable OPENAI_API_KEY with your OpenAI API key inside the .env file.


import pandas as pd
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
import streamlit as st
import os
from dotenv import load_dotenv

load_dotenv()

Preparing the Textual Content

Before diving into the technical aspects of data preparation, it’s important to understand a key characteristic of the numerical vectors we will be working with. Each embedding model generates a fixed-dimension vector, regardless of the length of the input text.
For instance, the 'text-embedding-3-small' model produces vectors of 1,536 numbers, representing specific attributes and features of the text. This means that the vector generated from a movie plot consisting of 10 paragraphs will have the same length as the vector for a plot consisting of just one sentence.

Although the vector size generated by each model will remain constant, the length of the input text used to generate the vectors still plays a significant role during the content preparation phase. There are two key aspects to consider:

  1. Each embedding model has its own optimal input character length to ensure the best performance and effectively capture the essence of the text.
  2. Every embedding model has a maximum character limit (context window) that can be processed or converted into a vector in a single operation.

To address these constraints and limitations, a common practice known as "chunking" is often employed. This involves dividing the text into smaller blocks of a predefined length. This allows embedding models to process each chunk of text within the limits of their context window, maximizing their ability to capture semantic relationships and build meaningful vector representations.
This approach is critical in the development of LLM-based applications, such as chatbots using the Retrieval-Augmented Generation (RAG) mechanism, which focus on a targeted question-and-answer system that retrieves granular information from texts.

In the case of our project, however, this practice might introduce an undesirable effect, as movies with longer plots would generate more chunks, which would then be indexed in the database. This could result in a higher likelihood of these movies appearing in search results, potentially overshadowing other equally relevant films with shorter plots.

An alternative strategy would be to use an LLM to summarize the text of interest, in order to produce plots of uniform length, better suited for the embedding models we will be using.

By summarizing text content, we would inevitably trade-off some of the detailed information in the movie plots. However, this approach allows for a better balance in the weighting of the search engine results, though it comes at the cost of reduced accuracy in searches that involve more detail-rich queries.


In the case of our dataset, this process has already been completed, allowing us to use the content in the 'PlotSummary' column, which has been summarized from the 'Plot' column.

The load_csv_to_docs function loads the CSV file into a Pandas dataframe using the Langchain loader, with the purpose of formatting its content into documents that can be inserted into the vector database.
The content_col argument of the function allows us to specify which column of the dataset to extract the text from for creating embeddings, while the remaining columns of the dataframe will be treated as metadata for the documents.


def load_csv_to_docs(file_path:str="./dataset/wiki_movie_plots_deduped_with_summaries.csv", 
                     content_col:str="PlotSummary"
                     ) -> list:
    """
    Load a CSV file into documents using Langchain DataFrame loader.

    Args:
        file_path (str): The file path to the CSV file.
        content_col (str): The name of the column containing the content of each document.

    Returns:
        list: A list of documents loaded from the CSV file.
    """

    df = pd.read_csv(file_path)

    loader = DataFrameLoader(df, page_content_column=content_col)

    documents = loader.load()

    return documents

The next function, split_docs_to_chunks, takes the documents generated by the previous function as input and splits them into text chunks that are ready to be processed by the embedding model.


def split_docs_to_chunks(documents:list, chunk_size:int=1000, chunk_overlap:int=0) -> list:
    """
    Split documents into chunks and format each chunk.

    Args:
        documents (list): A list of documents to be split.
        chunk_size (int, optional): The size of each chunk. Defaults to 1000.
        chunk_overlap (int, optional): The overlap between consecutive chunks. Defaults to 0.

    Returns:
        list: A list of formatted chunks.
    """
    # Create a RecursiveCharacterTextSplitter instance
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    # Split documents into chunks using the text splitter
    chunks = text_splitter.split_documents(documents)
    
    # Iterate over each chunk
    for chunk in chunks:
        # Extract metadata from the chunk
        title = chunk.metadata['Title']
        origin = chunk.metadata['Origin/Ethnicity']
        genre = chunk.metadata['Genre']
        release_year = chunk.metadata['Release Year']
        
        # Extract content from the chunk
        content = chunk.page_content
        
        # Format the content with metadata
        final_content = f"TITLE: {title}\nORIGIN: {origin}\nGENRE: {genre}\nYEAR: {release_year}\nBODY: {content}\n"
        
        # Update the page content of the chunk with formatted content
        chunk.page_content = final_content
    
    return chunks

This step serves two main purposes:

  1. Ensuring that each chunk does not exceed a certain length, set to a default of 1,000 characters, to prevent the text from exceeding the model's context window. In our case, we'll be working with content that is already below this threshold, but this function is still useful to include in the program, in case you want to experiment with alternative texts that exceed the maximum length.
  2. Providing the chunks with additional contextual information, such as the movie metadata, which will be included in the text string used to generate the vector embeddings. This practice improves the accuracy of vector retrieval operations by introducing global context into the chunks.

Configuring Chroma DB

The create_or_get_vectorstore function leverages the two previously created functions to either create a new instance of the Chroma DB vector database or load an existing one from the project directory. This function also manages the process of generating vectors from our dataset, based on the embedding model chosen as an argument. The available options for the embedding model are 'OpenAI' and 'SentenceTransformer'.


def create_or_get_vectorstore(file_path: str, content_col: str, selected_embedding: str) -> Chroma:
    """
    Create or get a Chroma vector store based on the selected embedding model.

    Args:
        file_path (str): The file path to the dataset.
        content_col (str): The name of the column containing the content of each document.
        selected_embedding (str): The selected embedding model ('OpenAI' or 'SentenceTransformer').

    Returns:
        Chroma: A Chroma vector store.
    """
    # Determine the embedding function and database path based on the selected embedding model
    if selected_embedding == 'OpenAI':
        embedding_function = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=100, show_progress_bar=True)
        db_path = './chroma_openai'

    elif selected_embedding == 'SentenceTransformer':
        embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
        db_path = './chroma_hf'

    # Check if the database directory exists
    if not os.path.exists(db_path):
        # If the directory does not exist, create the database
        documents = load_csv_to_docs(file_path, content_col)
        chunks = split_docs_to_chunks(documents)

        print("CREATING DB...")
        db = Chroma.from_documents(chunks, embedding_function, persist_directory=db_path)

    else:
        # If the directory exists, load the existing database
        print('LOADING DB...')
        db = Chroma(persist_directory=db_path, embedding_function=embedding_function)

    return db

It's important to note that depending on the chosen embedding model, two separate database instances will be created in different directories. The vectors generated by the two models will have different lengths and parameters, so it is essential to load the corresponding database instances later using the same embedding key that was used during their creation.

In the code for OpenAI embedding, we specified the value of the chunk_size parameter as 100. This parameter, despite its name, doesn't apply actual chunking to the content but determines how many documents to process at most in each batch for generating the embeddings. It is advisable to set this parameter to a value lower than the default of 1,000 to avoid OpenAI API usage limit errors. By also setting show_progress_bar to True, we display a progress bar during the embedding process, which should take approximately 10 minutes for our dataset.

For the Sentence Transformer embedding, batch splitting is not required, and the vectorization process should complete within a few seconds.

Now we can proceed by creating two instances of our vector database by running the following commands in the Python terminal.


>>> from app import load_csv_to_docs, \
				    split_docs_to_chunks, \
                    create_or_get_vectorstore
>>> documents = load_csv_to_docs()
>>> chunks = split_docs_to_chunks(documents)
>>> create_or_get_vectorstore('./dataset/wiki_movie_plots_deduped_with_summaries.csv','PlotSummary','OpenAI')
>>> create_or_get_vectorstore('./dataset/wiki_movie_plots_deduped_with_summaries.csv','PlotSummary','SentenceTransformer')

Handling Search Queries

Our final function, query_vectostore, is designed to find the top k most relevant results using vector search within the database.


def query_vectorstore(db:Chroma, 
                      query:str, 
                      k:int=20, 
                      filter_dict:dict={}
                      ) -> pd.DataFrame:
    """
    Query a Chroma vector store for similar documents based on a query.

    Args:
        db (Chroma): The Chroma vector store to query.
        query (str): The query string.
        k (int, optional): The number of similar documents to retrieve. Defaults to 20.
        filter_dict (dict, optional): A dictionary specifying additional filters. Defaults to {}.

    Returns:
        pd.DataFrame: A DataFrame containing metadata of the similar documents.
    """
    # Perform similarity search on the vector store
    results = db.similarity_search(query, filter=filter_dict, k=k)

    # Initialize an empty list to store metadata from search results
    results_metadata = []

    # Extract metadata from results
    for doc in results:
        results_metadata.append(doc.metadata)

    # Convert metadata to a DataFrame
    df = pd.DataFrame(results_metadata)
    
    # Drop duplicate rows based on the 'Wiki Page' column
    df.drop_duplicates(subset=['Wiki Page'], keep='first', inplace=True)

    return df

This function requires two mandatory inputs: the instance of the vector database to query and a textual search query. Additionally, it allows for specifying the number of desired results (default is 20) and applying search filters to the metadata via a dictionary. In our interface, we will include a filter box enabling users to narrow down results based on the movie’s release year.

It’s important to note that each chunk created earlier corresponds to a distinct document entity in our database. This means that a search could return multiple documents referring to the same movie. To ensure the proposed films are not duplicates, we’ve included a final step in the function to deduplicate results. In our case, we’ve created only one chunk per movie, but when working with longer texts, it's important to consider that the number of unique results might be lower than the value specified in the k parameter of the function.

Developing the UI with Streamlit

Now that we have all the necessary functions for the search engine to work, we can focus on developing an interface that allows the user to interact with the application easily.

To build this interface, we will use Streamlit, a free and open-source framework that simplifies transforming a Python script into a web application. The way Streamlit works is that it re-executes the entire Python script every time the user interacts with the interface, such as by entering new input in a form.

We will proceed by placing the entire logic of the Streamlit application inside the main function of our file.

The first part of this function handles the toggle for selecting which model to use for vector searches. Depending on the model chosen, the corresponding vector database instance will be initialized.

By using the session_state method in Streamlit, we can create session variables that persist user input data even when the script is re-executed. This prevents the overwriting of certain data with each user interaction, ensuring that key information remains intact throughout the session.

def main():
	# Apply Streamlit page config
    st.set_page_config(
    page_title=" Vector Search Engine | Datasense",
    page_icon="https://143998935.fs1.hubspotusercontent-eu1.net/hubfs/143998935/Datasense_Favicon-2.svg"
    )

    # Read and apply custom CSS style
    with open('./css/style.css') as f:
        css = f.read()
    st.markdown(f"<style>{css}</style>", unsafe_allow_html=True)
    
    # Display logo and title
    st.image("https://143998935.fs1.hubspotusercontent-eu1.net/hubfs/143998935/Datasense%20Logo_White.svg", width=180)
    st.title("Building a Vector Search Engine with Langchain and Streamlit")
    st.markdown("Find your ideal movie among over 30k films with an AI-powered vector search engine.")

    # Toggle for using OpenAI embeddings
    if "openai_on" not in st.session_state:
        st.session_state.openai_on = False

    openai_on = st.toggle('Use OpenAI embeddings')

    # Check if the toggle value has changed
    if openai_on != st.session_state.openai_on:
        # Clear the existing database from the session state
        if "db" in st.session_state:
            del st.session_state.db

    # Determine selected embedding model
    if openai_on:
        selected_embedding = "OpenAI"
        st.session_state.openai_on = True
    else:
        selected_embedding = "SentenceTransformer"
        st.session_state.openai_on = False

    # Create or get the vector store database
    file_path = './dataset/wiki_movie_plots_deduped_with_summaries.csv'
    content_col = 'PlotSummary'
    st.session_state.db = create_or_get_vectorstore(file_path, content_col, selected_embedding)
    
    # ...

The next part of the function sets up the text area where the user can input their search query and manages the creation of a dictionary to filter the search, which can optionally be passed to the query_vectorstore function.

This filter allows users to select movies released before, after, or in a specific year. Active filters are displayed in a dedicated section assigned to the filter_box variable.


def main():

    # ...
    
    # Text input for query
    query = st.text_input("Tonight I'd like to watch...", "A thriller movie about memory loss")

    # Display filter options
    filter_box = st.empty()
    with st.expander("Filter by year"):
        with st.form("filters"):
            filter_dict = {}
            st.write("Release year...")
            year_operator = st.selectbox(
                label="Operator",
                options=("is equal to", "is after", "is before")
            )
            year = st.number_input(
                label="Year",
                min_value=1900,
                max_value=2023,
                value=2000
            )
            submitted = st.form_submit_button("Apply filter")
            operator_signs = {
                "is equal to": "$eq",
                "is after": "$gt",
                "is before": "$lt"
            }

            if submitted:
                filter_dict = {
                    "Release Year": {
                        f"{operator_signs[year_operator]}": year
                    }
                }
                # Escape the HTML tags
                filter_box.markdown(
                    f"<p><b>Active filter</b>:</p> <span class='active-filter'>Released year {year_operator} {year}</span>", 
                    unsafe_allow_html=True
                )
                
     # ...

In the final phase, the function performs the search in the vector database using the query_vectorstore function. Here, we provide the vector database corresponding to the chosen model, the text query, and any additional search filters specified by the user.

Since this function returns a dataframe, the search results are displayed by iterating over each row. The relevant metadata columns are then selected and presented in a dedicated box for each movie. This ensures that the user can see the key details of each film that matches their query.


def main():
    # ...

    # Perform search if query exists
    if query:
        # Perform vector store query
        results_df = query_vectorstore(
            db=st.session_state.db,
            query=query,
            filter_dict=filter_dict
        )

        # Display search results
        for index, row in results_df.iterrows():
            # Escape the HTML tags
            st.markdown(
                f"""
                <div class='result-item-box'>
                    <span class='label-genre'{row['Genre']}</span>
                    <h4>{row['Title']}</h4>
                    <div class='metadata'>
                        <p><b>Year:</b> {row['Release Year']}</p>
                        <p><b>Director:</b> {row['Director']}</p>
                        <p><b>Origin:</b> {row['Origin/Ethnicity']}</p>
                    </div>
                    <a href='{row['Wiki Page']}'>Read more →</a>
                </div>
                """,
                unsafe_allow_html=True
            )

if __name__ == '__main__':
    main()

To launch our search engine, simply run the following command in the terminal:

streamlit run app.py

The application will automatically open on port 8501 in localhost.

vector-search-engine-ui-2
Application running on localhost:8501

Conclusion

In this project, we explored how vector search can provide an effective and context-aware way to access large amounts of unstructured textual data, particularly when combined with the language understanding capabilities of LLMs and their respective embedding vectors.

For building our search engine, we used Langchain, a framework that simplifies the integration of various components related to LLMs, making it easier to experiment with different models and optimize the application's performance.

During the data preparation phase, we emphasized the importance of the quality of the data provided to the embedding models. Specifically, we demonstrated how chunking and/or summarizing content can be used to optimize textual data before proceeding with vectorization.

The final result of our project is a vector search engine equipped with a Streamlit-based interface, allowing users to search through a large collection of movies using two different embedding models.

If you're interested in the complete codebase of the search engine, you can find it in the project's GitHub repository.

Forked from
Find your ideal movie among over 30k films with an AI-powered vector search engine.
Python
  1
  0