Initializing Persistent Chroma Data
import { ChromaClient } from 'chromadb'
Initializing the Client
const client = new ChromaClient();
Common Client Operations
await client.reset() // Clear the database
Working with Collections
Chromadb uses the concept of collection
to manage sets of vector data, which can be likened to tables in MySQL.
Creating, Viewing, and Deleting Collections
Chroma uses the collection name in the URL, so there are some naming restrictions:
- The name length must be between 3 and 63 characters.
- The name must start and end with lowercase letters or numbers, and can contain periods, hyphens, and underscores in between.
- The name cannot contain two consecutive periods.
- The name cannot be a valid IP address.
To create a collection, the collection name and an optional vector calculation function (also known as an embedding function) need to be specified. If an embedding function is provided, it must be provided every time the collection is accessed.
Note: The embedding function is used to calculate text vectors.
import { ChromaClient } from 'chromadb'
Create and reference a collection as shown below:
let collection = await client.createCollection({name: "my_collection", embeddingFunction: emb_fn})
let collection2 = await client.getCollection({name: "my_collection", embeddingFunction: emb_fn})
The embedding function takes text as input and returns a calculated vector data.
Note: Beginners can learn about text embedding models from this tutorial.
Existing collections can be referenced using .getCollection
by name, and can also be deleted using .deleteCollection
.
const collection = await client.getCollection({name: "tizi365"}) // Reference the collection tizi365
await client.deleteCollection({name: "my_collection"}) // Delete the collection
Common Collection Functions
await collection.peek() // Returns the first 10 data records in the collection
await collection.count() // Returns the total number of data records in the collection
Adjusting Vector Distance Calculation Methods
The createCollection
also includes an optional metadata
parameter, where the value of hnsw:space can be set to customize the distance calculation method for the vector space.
Note: Vector data represents the similarity between vectors by calculating the spatial distance between them, with closer distances indicating higher similarity and vice versa.
let collection = client.createCollection("collection_name", undefined, metadata={ "hnsw:space": "cosine" })
The valid options for hnsw:space are "l2", "ip", or "cosine". The default is "l2".
Adding Data to a Collection
Use .add
to add data to the Chroma collection.
Add data directly without specifying document vectors:
await collection.add({
ids: ["id1", "id2", "id3", ...],
metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
documents: ["lorem ipsum...", "doc2", "doc3", ...],
})
// Parameter Explanation
// ids - required
// embeddings - optional
// metadata - optional
// documents - optional
If Chroma receives a list of documents, it will automatically use the collection's embedding function to calculate vectors for the documents (if an embedding function was not provided when creating the collection, the default value will be used). Chroma will also store the documents themselves. If a document is too large to be used with the selected embedding function, an exception will occur.
Each document must have a unique ID (ids). Adding the same ID twice will result in only the initial value being stored. An optional list of metadata dictionaries (metadatas) can be provided for each document to store additional information for filtering data during queries.
Alternatively, you can directly provide a list of document-related vector data, and Chroma will use the vector data you provide without automatically calculating the vector.
await collection.add({
ids: ["id1", "id2", "id3", ...],
embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
documents: ["lorem ipsum...", "doc2", "doc3", ...],
})
If the provided vector data dimensions (length) do not match the collection's dimensions, an exception will occur.
You can also store documents elsewhere and simply provide the vector data and metadata list to Chroma. You can use ids to associate the vectors with the documents stored elsewhere.
await collection.add({
ids: ["id1", "id2", "id3", ...],
embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
})
Note: The core function of the vector database is semantic similarity search based on vector data. To reduce the size of the vector database and improve efficiency, we can choose to store vector data and some filterable attribute conditions in the vector database. Other data, such as article content, can be stored in databases like MYSQL, as long as they are associated through IDs.
Querying Collection Data
The .query
method can be used to query Chroma dataset in multiple ways.
You can query using a set of query_embeddings (vector data).
Tip: To obtain query_embeddings, in actual development scenarios, the user's query is usually first calculated into a query vector through a text embedding model, and then this vector is used to query similar content.
const result = await collection.query({
queryEmbeddings: [[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
nResults: 10,
where: {"metadata_field": "is_equal_to_this"},
})
// input order
// query_embeddings - optional
// n_results - required
// where - optional
// query_texts - optional
The query will return the top n_results results for each query vector (query_embedding) in order. An optional where filter dictionary can be provided to filter results based on metadata associated with each document. Additionally, an optional where_document filter dictionary can be provided to filter results based on the document content.
If the provided query_embeddings are not consistent with the collection's dimensions, an exception will occur. To ensure vector consistency, it is recommended to use the same text embedding model for calculating vectors.
You can also query using a set of query texts. Chroma will first calculate the vector for each query text using the collection's embedding function, and then execute the query using the generated text vectors.
await collection.query({
nResults: 10, // n_results
where: {"metadata_field": "is_equal_to_this"}, // where
queryTexts: ["doc10", "thus spake zarathustra", ...], // query_text
})
You can also use .get
to query data from the collection by id.
await collection.get({
ids: ["id1", "id2", "id3", ...], //ids
where: {"style": "style1"} // where
})
.get
also supports where
and where_document
filters. If no id
is provided, it will return all data in the collection that matches the where
and where_document
filters.
Specifying Returned Fields
When using get
or query
, you can use the include
parameter to specify the data fields to be returned, including vector data, documents, and any data in metadata. By default, Chroma returns documents, metadata, and vector distances. You can specify the fields to be returned by passing an array of field names to the includes parameter of get or query.
collection.get(
include=["documents"]
)
collection.query(
query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
include=["documents"]
)
Using Where Filters
Chroma supports filtering queries based on metadata and document content. The where
filter is used to filter metadata, and the where_document
filter is used to filter document content. Below, we explain how to write filter condition expressions.
Filtering by Metadata
To filter metadata, a where filter dictionary must be provided for the query. The dictionary must have the following structure:
{
"metadata_field": {
<Operator>: <Value>
}
}
Filtering metadata supports the following operators:
- $eq - Equals (string, integer, float)
- $ne - Not equals (string, integer, float)
- $gt - Greater than (int, float)
- $gte - Greater than or equal to (int, float)
- $lt - Less than (integer, float)
- $lte - Less than or equal to (int, float)
Using the $eq operator is equivalent to using the where filter.
{
"metadata_field": "search_string"
}
{
"metadata_field": {
"$eq": "search_string"
}
}
Filtering Document Content
To filter document content, a where_document filter dictionary must be provided for the query. The dictionary must have the following structure:
{
"$contains": "search_string"
}
Using Logical Operators
You can also use logical operators $and
and $or
to combine multiple filters.
The $and
operator will return results that match all the filters in the list.
{
"$and": [
{
"metadata_field": {
<Operator>: <Value>
}
},
{
"metadata_field": {
<Operator>: <Value>
}
}
]
}
The $or
operator will return results that match any filtering condition in the list.
{
"$or": [
{
"metadata_field": {
<Operator>: <Value>
}
},
{
"metadata_field": {
<Operator>: <Value>
}
}
]
}
Updating Data
Chroma also supports the upsert operation, which can update existing data and insert new data if the data does not exist.
await collection.upsert({
ids: ["id1", "id2", "id3"],
embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
metadatas: [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}],
documents: ["doc1", "doc2", "doc3"]
})
Deleting Data
Chroma supports using .delete to delete data from the collection by id.
await collection.delete({
ids: ["id1", "id2", "id3",...], //ids
where: {"chapter": "20"} //where
})