Initializing Persistent Chroma Data

import { ChromaClient } from 'chromadb'

Initializing the Client

const client = new ChromaClient();

Common Client Operations

await client.reset() // Clear the database

Working with Collections

Chromadb uses the concept of collection to manage sets of vector data, which can be likened to tables in MySQL.

Creating, Viewing, and Deleting Collections

Chroma uses the collection name in the URL, so there are some naming restrictions:

  • The name length must be between 3 and 63 characters.
  • The name must start and end with lowercase letters or numbers, and can contain periods, hyphens, and underscores in between.
  • The name cannot contain two consecutive periods.
  • The name cannot be a valid IP address.

To create a collection, the collection name and an optional vector calculation function (also known as an embedding function) need to be specified. If an embedding function is provided, it must be provided every time the collection is accessed.

Note: The embedding function is used to calculate text vectors.

import { ChromaClient } from 'chromadb'

Create and reference a collection as shown below:

let collection = await client.createCollection({name: "my_collection", embeddingFunction: emb_fn})

let collection2 = await client.getCollection({name: "my_collection", embeddingFunction: emb_fn})

The embedding function takes text as input and returns a calculated vector data.

Note: Beginners can learn about text embedding models from this tutorial.

Existing collections can be referenced using .getCollection by name, and can also be deleted using .deleteCollection.

const collection = await client.getCollection({name: "tizi365"}) // Reference the collection tizi365
await client.deleteCollection({name: "my_collection"}) // Delete the collection

Common Collection Functions

await collection.peek() // Returns the first 10 data records in the collection
await collection.count() // Returns the total number of data records in the collection

Adjusting Vector Distance Calculation Methods

The createCollection also includes an optional metadata parameter, where the value of hnsw:space can be set to customize the distance calculation method for the vector space.

Note: Vector data represents the similarity between vectors by calculating the spatial distance between them, with closer distances indicating higher similarity and vice versa.

let collection = client.createCollection("collection_name", undefined, metadata={ "hnsw:space": "cosine" })

The valid options for hnsw:space are "l2", "ip", or "cosine". The default is "l2".

Adding Data to a Collection

Use .add to add data to the Chroma collection.

Add data directly without specifying document vectors:

await collection.add({
    ids: ["id1", "id2", "id3", ...],
    metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents: ["lorem ipsum...", "doc2", "doc3", ...],
})
// Parameter Explanation
// ids - required
// embeddings - optional
// metadata - optional
// documents - optional

If Chroma receives a list of documents, it will automatically use the collection's embedding function to calculate vectors for the documents (if an embedding function was not provided when creating the collection, the default value will be used). Chroma will also store the documents themselves. If a document is too large to be used with the selected embedding function, an exception will occur.

Each document must have a unique ID (ids). Adding the same ID twice will result in only the initial value being stored. An optional list of metadata dictionaries (metadatas) can be provided for each document to store additional information for filtering data during queries.

Alternatively, you can directly provide a list of document-related vector data, and Chroma will use the vector data you provide without automatically calculating the vector.

await collection.add({
    ids: ["id1", "id2", "id3", ...],
    embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents: ["lorem ipsum...", "doc2", "doc3", ...],
})

If the provided vector data dimensions (length) do not match the collection's dimensions, an exception will occur.

You can also store documents elsewhere and simply provide the vector data and metadata list to Chroma. You can use ids to associate the vectors with the documents stored elsewhere.

await collection.add({
    ids: ["id1", "id2", "id3", ...],
    embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
})

Note: The core function of the vector database is semantic similarity search based on vector data. To reduce the size of the vector database and improve efficiency, we can choose to store vector data and some filterable attribute conditions in the vector database. Other data, such as article content, can be stored in databases like MYSQL, as long as they are associated through IDs.

Querying Collection Data

The .query method can be used to query Chroma dataset in multiple ways.

You can query using a set of query_embeddings (vector data).

Tip: To obtain query_embeddings, in actual development scenarios, the user's query is usually first calculated into a query vector through a text embedding model, and then this vector is used to query similar content.

const result = await collection.query({
    queryEmbeddings: [[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    nResults: 10,
    where: {"metadata_field": "is_equal_to_this"},
})
// input order
// query_embeddings - optional
// n_results - required
// where - optional
// query_texts - optional

The query will return the top n_results results for each query vector (query_embedding) in order. An optional where filter dictionary can be provided to filter results based on metadata associated with each document. Additionally, an optional where_document filter dictionary can be provided to filter results based on the document content.

If the provided query_embeddings are not consistent with the collection's dimensions, an exception will occur. To ensure vector consistency, it is recommended to use the same text embedding model for calculating vectors.

You can also query using a set of query texts. Chroma will first calculate the vector for each query text using the collection's embedding function, and then execute the query using the generated text vectors.

await collection.query({
    nResults: 10, // n_results
    where: {"metadata_field": "is_equal_to_this"}, // where
    queryTexts: ["doc10", "thus spake zarathustra", ...], // query_text
})

You can also use .get to query data from the collection by id.

await collection.get({
    ids: ["id1", "id2", "id3", ...], //ids
    where: {"style": "style1"} // where
})

.get also supports where and where_document filters. If no id is provided, it will return all data in the collection that matches the where and where_document filters.

Specifying Returned Fields

When using get or query, you can use the include parameter to specify the data fields to be returned, including vector data, documents, and any data in metadata. By default, Chroma returns documents, metadata, and vector distances. You can specify the fields to be returned by passing an array of field names to the includes parameter of get or query.

collection.get(
    include=["documents"]
)

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    include=["documents"]
)

Using Where Filters

Chroma supports filtering queries based on metadata and document content. The where filter is used to filter metadata, and the where_document filter is used to filter document content. Below, we explain how to write filter condition expressions.

Filtering by Metadata

To filter metadata, a where filter dictionary must be provided for the query. The dictionary must have the following structure:

{
    "metadata_field": {
        <Operator>: <Value>
    }
}

Filtering metadata supports the following operators:

  • $eq - Equals (string, integer, float)
  • $ne - Not equals (string, integer, float)
  • $gt - Greater than (int, float)
  • $gte - Greater than or equal to (int, float)
  • $lt - Less than (integer, float)
  • $lte - Less than or equal to (int, float)

Using the $eq operator is equivalent to using the where filter.

{
    "metadata_field": "search_string"
}


{
    "metadata_field": {
        "$eq": "search_string"
    }
}

Filtering Document Content

To filter document content, a where_document filter dictionary must be provided for the query. The dictionary must have the following structure:

{
    "$contains": "search_string"
}

Using Logical Operators

You can also use logical operators $and and $or to combine multiple filters.

The $and operator will return results that match all the filters in the list.

{
    "$and": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

The $or operator will return results that match any filtering condition in the list.

{
    "$or": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

Updating Data

Chroma also supports the upsert operation, which can update existing data and insert new data if the data does not exist.

await collection.upsert({
    ids: ["id1", "id2", "id3"],
    embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}],
    documents: ["doc1", "doc2", "doc3"]
})

Deleting Data

Chroma supports using .delete to delete data from the collection by id.

await collection.delete({
    ids: ["id1", "id2", "id3",...], //ids
    where: {"chapter": "20"} //where
})