Python Chromadb Detailed Development Guide

Installation

pip install chromadb

Persisting Chromadb Data

import chromadb

You can specify the storage path for the Chroma database file. If the data exists, the database file will be automatically loaded when the program starts.

client = chromadb.PersistentClient(path="/data/tizi365.db")

The path parameter is the path to the Chroma database file.

Note: For a Chroma database, creating a client object once is sufficient. Loading and saving multiple clients in the same path may lead to unexpected behavior, including data deletion. Generally, only one Chroma client should be created in the application.

Some commonly used functions of the client object:

client.reset()  # Clears and completely resets the database

Collection Operations

Chromadb uses the collection primitive to manage collections of vector data, which can be likened to tables in MYSQL.

Creating, Viewing, and Deleting Collections

Chroma uses the collection name in the URL, so it has some naming restrictions:

The name length must be between 3 and 63 characters.
The name must start and end with a lowercase letter or number, and can contain periods, hyphens, and underscores in between.
The name cannot contain two consecutive periods.
The name cannot be a valid IP address.

To create a collection, you need to specify the collection name and an optional vector calculation function (also known as an embedding function). If an embedding function is provided, it must be provided each time the collection is accessed.

Note: The purpose of the vector calculation function (embedding function) is to compute the text vector.

collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
collection = client.get_collection(name="my_collection", embedding_function=emb_fn)

The embedding function takes text as input and returns a computed vector.

Note: Beginners can learn about text embedding model tutorials.

You can reference an existing collection with the .get_collection function, and use .delete_collection to delete a collection. You can also use .get_or_create_collection to reference a collection (if it exists) or create it if it does not exist.

collection = client.get_collection(name="tizi365")
collection = client.get_or_create_collection(name="tizi365")
client.delete_collection(name="tizi365")

Other commonly used collection operations:

collection.peek() # Returns a list of the first 10 data in the collection
collection.count() # Returns the total number of data in the collection
collection.modify(name="new_name") # Renames the collection

Specifying Vector Distance Calculation Method

The create_collection function also includes an optional metadata parameter. By setting the value of hnsw:space to customize the vector space distance calculation method.

Note: Vector data represents the similarity between vectors by calculating the spatial distance between vectors. The closer the distance, the higher the similarity, and vice versa.

collection = client.create_collection(
        name="collection_name",
        metadata={"hnsw:space": "cosine"} # l2 is the default calculation method
    )

The valid options for hnsw:space are "l2", "ip", or "cosine". The default is "l2".

Adding Data to a Collection

Use the .add method to add data to Chroma.

Add data directly without specifying document vectors:

collection.add(
    documents=["lorem ipsum...", "doc2", "doc3", ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

If Chroma receives a list of documents, it will automatically use the collection's embedding function to calculate the vectors for the documents (if an embedding function was not provided when creating the collection, the default value will be used). Chroma will also store the documents themselves. If a document is too large to calculate using the selected embedding function, an exception will occur.

Each document must have a unique ID (ids). Adding the same ID twice will result in storing the initial value only. Optionally, you can provide a list of metadata dictionaries (metadatas) for each document, to store additional information that can be used for filtering data during queries.

Alternatively, you can directly provide a list of the document's related vector data, and Chroma will use the vector data you provide without automatically calculating the vectors.

collection.add(
    documents=["doc1", "doc2", "doc3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

If the provided vector data dimensions (length) do not match the collection's dimensions, an exception will occur.

You can also store the documents elsewhere and provide Chroma with the vector data and metadata list. You can use ids to associate the vectors with the documents stored elsewhere.

collection.add(
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

Note: The core function of the vector database is semantic similarity search based on vector data. To reduce the size of the vector database and improve efficiency, we can choose to store vector data and some necessary filtering attributes in the vector database. Other data, such as article content, can be stored in databases like MYSQL, as long as they are associated via IDs.

Querying Collection Data

The .query method can be used to query Chroma data sets in multiple ways.

You can query using a set of query_embeddings (vector data).

Tip: In real development scenarios, query_embeddings are usually obtained by first calculating the vector of the user's query through a text embedding model, and then using this vector to query similar content.

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

The query will return the n_results results that best match each query vector (query_embedding) in sequence. An optional where filter dictionary can be provided to filter results based on metadata associated with each document. Additionally, an optional where_document filter dictionary can be provided to filter results based on the document content.

If the provided query_embeddings are not consistent with the dimensions of the collection, an exception will occur. To ensure consistent vector dimensions, use the same text embedding model to calculate vectors.

You can also query using a set of query texts. Chroma will first calculate the vector for each query text using the collection's embedding function, and then perform the query using the generated text vectors.

collection.query(
    query_texts=["doc10", "thus spake zarathustra", ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

You can also use .get to query data from the collection by id.

collection.get(
    ids=["id1", "id2", "id3", ...],
    where={"style": "style1"}
)

.get also supports where and where_document filters. If no id is provided, it will return all items in the collection that match the where and where_document filters.

Specifying Return Fields

When using get or query, you can use the include parameter to specify the data to be returned--embeddings, documents, or metadatas, and for queries, distance data needs to be returned. By default, Chroma returns documents and metadata, and returns distance data for queries, while "ids" always returns. You can specify the fields to be returned by passing an array of field names to the includes parameter of the query or get method.

collection.get(
    include=["documents"]
)

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    include=["documents"]
)

Using Where Filters

Chroma supports filtering queries based on metadata and document content. The where filter is used to filter metadata, and the where_document filter is used to filter document content, and the following explains how to write filter condition expressions.

Filtering by Metadata

To filter metadata, you must provide a where filter dictionary for the query. The dictionary must have the following structure:

{
    "metadata_field": {
        <Operator>: <Value>
    }
}

Filtering metadata supports the following operators:

$eq - equal to (string, integer, float)
$ne - not equal to (string, integer, float)
$gt - greater than (int, float)
$gte - greater than or equal to (int, float)
$lt - less than (integer, float)
$lte - less than or equal to (int, float)

Using the $eq operator is equivalent to using the where filter.

{
    "metadata_field": "search_string"
}

{
    "metadata_field": {
        "$eq": "search_string"
    }
}

Filtering Document Content

To filter document content, you must provide a where_document filter dictionary for the query. The dictionary must have the following structure:

{
    "$contains": "search_string"
}

Using Logical Operators

You can also use the logical operators $and and $or to combine multiple filters.

The $and operator will return results matching all the filters in the list.

{
    "$and": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

The $or operator will return results matching any of the filter conditions in the list.

{
    "$or": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

Updating Data in a Collection

Using .update allows you to update any properties of the data in a collection.

collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

If an id is not found in the collection, an error will be recorded and the update will be ignored. If the provided document does not have a corresponding vector, the collection's embedding function will be used to calculate the vector.

If the provided vector data is of a different dimension than the collection, an exception will occur.

Chroma also supports the upsert operation, which can update existing data and insert new data if it doesn’t exist.

collection.upsert(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

Deleting Collection Data

Chroma supports using .delete to remove data from a collection by id. The vectors, documents, and metadata associated with each data will also be deleted.

collection.delete(
    ids=["id1", "id2", "id3",...],
    where={"chapter": "20"}
)

.delete also supports a where filter. If no id is provided, it will delete all items in the collection that match the where filter.

Python Chromadb Vector Database Development Guide