Key activities:
• Developing Python scripts to automate repetitive tasks
• Transforming and integrating data through APIs
• Designing and deploying cloud solutions
• Data modeling and web application development
In a previous article, we introduced the concept and functionality of vector searches with textual data. In a nutshell, this search method relies on algorithms that calculate the distance between numeric vectors, which are used to semantically represent words and phrases, in order to return the most relevant results that match the meaning of the text used as a query.
Unlike traditional searches, which are based on keyword matching, vector search also considers the conceptual relationships between the search query and the texts stored in a vector database, ensuring more contextually relevant and accurate results.
In the business world, it's common to deal with large volumes of unorganized and poorly categorized textual data. However, within this data may lie valuable information that we want to access quickly and efficiently.
Here are some examples of practical applications of vector search in different contexts:
Building on the example presented in the introductory article, we will develop a vector search engine that allows us to find the ideal movie based on the plot description. The dataset we will use is Wikipedia Movie Plots with AI Plot Summaries, which includes over 34,000 movie plots, both in full and summarized versions.
Here is the result we will achieve by the end of the project.
Langchain has become the go-to framework for developing applications based on Large Language Models (LLMs), thanks to the ease with which it integrates useful components and third-party tools. Its approach allows for seamless transitions between external libraries while keeping much of the code structure intact. This flexibility enables us to easily experiment with different models, compare the results, and optimize our solution.
In our project, we will leverage this feature of Langchain to allow users to choose the model they want to use for searches. The two available options are the 'all-MiniLM-L6-v2'
, a sentence transformer available for free on Hugging Face, and OpenAI's 'text-embedding-3-small'
.
Before proceeding with the development, let's save the CSV file containing the movies in the dataset
folder and run the following command in the terminal to install the required libraries.
pip install python-dotenv pandas chromadb langchain langchain_openai sentence-transformers streamlit
After the library installation is complete, we can create the app.py
file and import the necessary packages. Additionally, you need to create the environment variable OPENAI_API_KEY
with your OpenAI API key inside the .env
file.
import pandas as pd
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
import streamlit as st
import os
from dotenv import load_dotenv
load_dotenv()
Before diving into the technical aspects of data preparation, it’s important to understand a key characteristic of the numerical vectors we will be working with. Each embedding model generates a fixed-dimension vector, regardless of the length of the input text.
For instance, the 'text-embedding-3-small'
model produces vectors of 1,536 numbers, representing specific attributes and features of the text. This means that the vector generated from a movie plot consisting of 10 paragraphs will have the same length as the vector for a plot consisting of just one sentence.
Although the vector size generated by each model will remain constant, the length of the input text used to generate the vectors still plays a significant role during the content preparation phase. There are two key aspects to consider:
To address these constraints and limitations, a common practice known as "chunking" is often employed. This involves dividing the text into smaller blocks of a predefined length. This allows embedding models to process each chunk of text within the limits of their context window, maximizing their ability to capture semantic relationships and build meaningful vector representations.
This approach is critical in the development of LLM-based applications, such as chatbots using the Retrieval-Augmented Generation (RAG) mechanism, which focus on a targeted question-and-answer system that retrieves granular information from texts.
In the case of our project, however, this practice might introduce an undesirable effect, as movies with longer plots would generate more chunks, which would then be indexed in the database. This could result in a higher likelihood of these movies appearing in search results, potentially overshadowing other equally relevant films with shorter plots.
An alternative strategy would be to use an LLM to summarize the text of interest, in order to produce plots of uniform length, better suited for the embedding models we will be using.
In the case of our dataset, this process has already been completed, allowing us to use the content in the 'PlotSummary'
column, which has been summarized from the 'Plot'
column.
The load_csv_to_docs
function loads the CSV file into a Pandas dataframe using the Langchain loader, with the purpose of formatting its content into documents that can be inserted into the vector database.
The content_col
argument of the function allows us to specify which column of the dataset to extract the text from for creating embeddings, while the remaining columns of the dataframe will be treated as metadata for the documents.
def load_csv_to_docs(file_path:str="./dataset/wiki_movie_plots_deduped_with_summaries.csv",
content_col:str="PlotSummary"
) -> list:
"""
Load a CSV file into documents using Langchain DataFrame loader.
Args:
file_path (str): The file path to the CSV file.
content_col (str): The name of the column containing the content of each document.
Returns:
list: A list of documents loaded from the CSV file.
"""
df = pd.read_csv(file_path)
loader = DataFrameLoader(df, page_content_column=content_col)
documents = loader.load()
return documents
The next function, split_docs_to_chunks
, takes the documents generated by the previous function as input and splits them into text chunks that are ready to be processed by the embedding model.
def split_docs_to_chunks(documents:list, chunk_size:int=1000, chunk_overlap:int=0) -> list:
"""
Split documents into chunks and format each chunk.
Args:
documents (list): A list of documents to be split.
chunk_size (int, optional): The size of each chunk. Defaults to 1000.
chunk_overlap (int, optional): The overlap between consecutive chunks. Defaults to 0.
Returns:
list: A list of formatted chunks.
"""
# Create a RecursiveCharacterTextSplitter instance
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# Split documents into chunks using the text splitter
chunks = text_splitter.split_documents(documents)
# Iterate over each chunk
for chunk in chunks:
# Extract metadata from the chunk
title = chunk.metadata['Title']
origin = chunk.metadata['Origin/Ethnicity']
genre = chunk.metadata['Genre']
release_year = chunk.metadata['Release Year']
# Extract content from the chunk
content = chunk.page_content
# Format the content with metadata
final_content = f"TITLE: {title}\nORIGIN: {origin}\nGENRE: {genre}\nYEAR: {release_year}\nBODY: {content}\n"
# Update the page content of the chunk with formatted content
chunk.page_content = final_content
return chunks
This step serves two main purposes:
The create_or_get_vectorstore
function leverages the two previously created functions to either create a new instance of the Chroma DB vector database or load an existing one from the project directory. This function also manages the process of generating vectors from our dataset, based on the embedding model chosen as an argument. The available options for the embedding model are 'OpenAI' and 'SentenceTransformer'.
def create_or_get_vectorstore(file_path: str, content_col: str, selected_embedding: str) -> Chroma:
"""
Create or get a Chroma vector store based on the selected embedding model.
Args:
file_path (str): The file path to the dataset.
content_col (str): The name of the column containing the content of each document.
selected_embedding (str): The selected embedding model ('OpenAI' or 'SentenceTransformer').
Returns:
Chroma: A Chroma vector store.
"""
# Determine the embedding function and database path based on the selected embedding model
if selected_embedding == 'OpenAI':
embedding_function = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=100, show_progress_bar=True)
db_path = './chroma_openai'
elif selected_embedding == 'SentenceTransformer':
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db_path = './chroma_hf'
# Check if the database directory exists
if not os.path.exists(db_path):
# If the directory does not exist, create the database
documents = load_csv_to_docs(file_path, content_col)
chunks = split_docs_to_chunks(documents)
print("CREATING DB...")
db = Chroma.from_documents(chunks, embedding_function, persist_directory=db_path)
else:
# If the directory exists, load the existing database
print('LOADING DB...')
db = Chroma(persist_directory=db_path, embedding_function=embedding_function)
return db
It's important to note that depending on the chosen embedding model, two separate database instances will be created in different directories. The vectors generated by the two models will have different lengths and parameters, so it is essential to load the corresponding database instances later using the same embedding key that was used during their creation.
In the code for OpenAI embedding, we specified the value of the chunk_size
parameter as 100. This parameter, despite its name, doesn't apply actual chunking to the content but determines how many documents to process at most in each batch for generating the embeddings. It is advisable to set this parameter to a value lower than the default of 1,000 to avoid OpenAI API usage limit errors. By also setting show_progress_bar
to True
, we display a progress bar during the embedding process, which should take approximately 10 minutes for our dataset.
For the Sentence Transformer embedding, batch splitting is not required, and the vectorization process should complete within a few seconds.
Now we can proceed by creating two instances of our vector database by running the following commands in the Python terminal.
>>> from app import load_csv_to_docs, \
split_docs_to_chunks, \
create_or_get_vectorstore
>>> documents = load_csv_to_docs()
>>> chunks = split_docs_to_chunks(documents)
>>> create_or_get_vectorstore('./dataset/wiki_movie_plots_deduped_with_summaries.csv','PlotSummary','OpenAI')
>>> create_or_get_vectorstore('./dataset/wiki_movie_plots_deduped_with_summaries.csv','PlotSummary','SentenceTransformer')
Our final function, query_vectostore
, is designed to find the top k
most relevant results using vector search within the database.
def query_vectorstore(db:Chroma,
query:str,
k:int=20,
filter_dict:dict={}
) -> pd.DataFrame:
"""
Query a Chroma vector store for similar documents based on a query.
Args:
db (Chroma): The Chroma vector store to query.
query (str): The query string.
k (int, optional): The number of similar documents to retrieve. Defaults to 20.
filter_dict (dict, optional): A dictionary specifying additional filters. Defaults to {}.
Returns:
pd.DataFrame: A DataFrame containing metadata of the similar documents.
"""
# Perform similarity search on the vector store
results = db.similarity_search(query, filter=filter_dict, k=k)
# Initialize an empty list to store metadata from search results
results_metadata = []
# Extract metadata from results
for doc in results:
results_metadata.append(doc.metadata)
# Convert metadata to a DataFrame
df = pd.DataFrame(results_metadata)
# Drop duplicate rows based on the 'Wiki Page' column
df.drop_duplicates(subset=['Wiki Page'], keep='first', inplace=True)
return df
This function requires two mandatory inputs: the instance of the vector database to query and a textual search query. Additionally, it allows for specifying the number of desired results (default is 20) and applying search filters to the metadata via a dictionary. In our interface, we will include a filter box enabling users to narrow down results based on the movie’s release year.
It’s important to note that each chunk created earlier corresponds to a distinct document entity in our database. This means that a search could return multiple documents referring to the same movie. To ensure the proposed films are not duplicates, we’ve included a final step in the function to deduplicate results. In our case, we’ve created only one chunk per movie, but when working with longer texts, it's important to consider that the number of unique results might be lower than the value specified in the k
parameter of the function.
Now that we have all the necessary functions for the search engine to work, we can focus on developing an interface that allows the user to interact with the application easily.
To build this interface, we will use Streamlit, a free and open-source framework that simplifies transforming a Python script into a web application. The way Streamlit works is that it re-executes the entire Python script every time the user interacts with the interface, such as by entering new input in a form.
We will proceed by placing the entire logic of the Streamlit application inside the main
function of our file.
The first part of this function handles the toggle for selecting which model to use for vector searches. Depending on the model chosen, the corresponding vector database instance will be initialized.
def main():
# Apply Streamlit page config
st.set_page_config(
page_title=" Vector Search Engine | Datasense",
page_icon="https://143998935.fs1.hubspotusercontent-eu1.net/hubfs/143998935/Datasense_Favicon-2.svg"
)
# Read and apply custom CSS style
with open('./css/style.css') as f:
css = f.read()
st.markdown(f"<style>{css}</style>", unsafe_allow_html=True)
# Display logo and title
st.image("https://143998935.fs1.hubspotusercontent-eu1.net/hubfs/143998935/Datasense%20Logo_White.svg", width=180)
st.title("Building a Vector Search Engine with Langchain and Streamlit")
st.markdown("Find your ideal movie among over 30k films with an AI-powered vector search engine.")
# Toggle for using OpenAI embeddings
if "openai_on" not in st.session_state:
st.session_state.openai_on = False
openai_on = st.toggle('Use OpenAI embeddings')
# Check if the toggle value has changed
if openai_on != st.session_state.openai_on:
# Clear the existing database from the session state
if "db" in st.session_state:
del st.session_state.db
# Determine selected embedding model
if openai_on:
selected_embedding = "OpenAI"
st.session_state.openai_on = True
else:
selected_embedding = "SentenceTransformer"
st.session_state.openai_on = False
# Create or get the vector store database
file_path = './dataset/wiki_movie_plots_deduped_with_summaries.csv'
content_col = 'PlotSummary'
st.session_state.db = create_or_get_vectorstore(file_path, content_col, selected_embedding)
# ...
The next part of the function sets up the text area where the user can input their search query and manages the creation of a dictionary to filter the search, which can optionally be passed to the query_vectorstore
function.
This filter allows users to select movies released before, after, or in a specific year. Active filters are displayed in a dedicated section assigned to the filter_box
variable.
def main():
# ...
# Text input for query
query = st.text_input("Tonight I'd like to watch...", "A thriller movie about memory loss")
# Display filter options
filter_box = st.empty()
with st.expander("Filter by year"):
with st.form("filters"):
filter_dict = {}
st.write("Release year...")
year_operator = st.selectbox(
label="Operator",
options=("is equal to", "is after", "is before")
)
year = st.number_input(
label="Year",
min_value=1900,
max_value=2023,
value=2000
)
submitted = st.form_submit_button("Apply filter")
operator_signs = {
"is equal to": "$eq",
"is after": "$gt",
"is before": "$lt"
}
if submitted:
filter_dict = {
"Release Year": {
f"{operator_signs[year_operator]}": year
}
}
# Escape the HTML tags
filter_box.markdown(
f"<p><b>Active filter</b>:</p> <span class='active-filter'>Released year {year_operator} {year}</span>",
unsafe_allow_html=True
)
# ...
In the final phase, the function performs the search in the vector database using the query_vectorstore
function. Here, we provide the vector database corresponding to the chosen model, the text query, and any additional search filters specified by the user.
Since this function returns a dataframe, the search results are displayed by iterating over each row. The relevant metadata columns are then selected and presented in a dedicated box for each movie. This ensures that the user can see the key details of each film that matches their query.
def main():
# ...
# Perform search if query exists
if query:
# Perform vector store query
results_df = query_vectorstore(
db=st.session_state.db,
query=query,
filter_dict=filter_dict
)
# Display search results
for index, row in results_df.iterrows():
# Escape the HTML tags
st.markdown(
f"""
<div class='result-item-box'>
<span class='label-genre'{row['Genre']}</span>
<h4>{row['Title']}</h4>
<div class='metadata'>
<p><b>Year:</b> {row['Release Year']}</p>
<p><b>Director:</b> {row['Director']}</p>
<p><b>Origin:</b> {row['Origin/Ethnicity']}</p>
</div>
<a href='{row['Wiki Page']}'>Read more →</a>
</div>
""",
unsafe_allow_html=True
)
if __name__ == '__main__':
main()
To launch our search engine, simply run the following command in the terminal:
streamlit run app.py
The application will automatically open on port 8501 in localhost.
In this project, we explored how vector search can provide an effective and context-aware way to access large amounts of unstructured textual data, particularly when combined with the language understanding capabilities of LLMs and their respective embedding vectors.
For building our search engine, we used Langchain, a framework that simplifies the integration of various components related to LLMs, making it easier to experiment with different models and optimize the application's performance.
During the data preparation phase, we emphasized the importance of the quality of the data provided to the embedding models. Specifically, we demonstrated how chunking and/or summarizing content can be used to optimize textual data before proceeding with vectorization.
The final result of our project is a vector search engine equipped with a Streamlit-based interface, allowing users to search through a large collection of movies using two different embedding models.
If you're interested in the complete codebase of the search engine, you can find it in the project's GitHub repository.
Data Engineer at Ander Group