Chroma is an embedded database application that is embedded into our code in the form of a package. The advantage of Chroma is its simplicity. If you need a vector database to implement LLM memory in developing LLM applications, it supports text similarity language search and you do not want to install an independent vector database, Chroma is a good choice. Currently, the Chroma library supports two languages: Python and JavaScript. This tutorial is mainly based on Python.
1. Install Chromadb
pip install chromadb
Note: The current version of chromadb is not friendly with Python 3.11. It is recommended to downgrade the python version.
2. Initialize the Chroma client
import chromadb
chroma_client = chromadb.Client()
3. Create a collection
A collection is similar to a table in the chroma database, where vector data (including documents and other source data) is stored. Create a collection as follows:
collection = chroma_client.create_collection(name="tizi365")
4. Add data
After defining a collection, add data to the collection. Chroma will store our data and create a special vector index based on the vector of the text data for easy querying later.
4.1. Calculate vectors using the built-in embedding model
collection.add(
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)
4.2. Specify vector values when adding data
collection.add(
embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)
5. Query data
Now you can query similar text content, and Chroma will return "n" most similar results. Below is an example of querying similar document content based on the query_texts query parameter:
results = collection.query(
query_texts=["This is a query document"],
n_results=2
)
By default, data in Chroma is stored in memory, so the data is lost when the program is restarted. Of course, you can set Chroma to persist data to the hard disk, so the program will load data from the disk when it starts.