1. Introduction to Embeddings
1.1. What are Embeddings
Embeddings, in the field of machine learning, especially in addressing natural language processing (NLP) problems, is a technique to transform text data into numerical vectors. In human language, the meaning of words and phrases is determined by their context and usage. The goal of embeddings is to capture the semantics of these linguistic units, enabling computers to understand and process them.
The core idea of embeddings is to map words with similar meanings to nearby points in a mathematical space, representing words as points in a high-dimensional space. This way, words with close semantic meanings (e.g., "king" and "queen") will be close in the space. Embeddings are typically composed of floating-point arrays, allowing even very different text fragments (such as "dog" and "canine") to have similar embedding representations.
Tip: As an application developer, you can simply understand that for two text sentences with similar meanings, their embedding vector similarity is high.
1.2. Applications of Embeddings
Embeddings are widely used in various scenarios, here are some primary use cases:
- Search: Using embedding features to rank search results based on relevance to the query text.
- Clustering: Embeddings can help identify and categorize semantically similar text fragments.
- Recommendation Systems: Recommending items based on similarity can help discover and recommend other items similar to the known ones.
- Anomaly Detection: Embeddings can be used to identify data points significantly different from the main dataset.
- Diversity Measurement: Embeddings can also be used to analyze the similarity distribution between different texts.
- Classification: Comparing text with a set of known label embeddings to classify it into the most similar category.
2. Introduction to OpenAI Embeddings
2.1. Overview of OpenAI Embeddings Models
OpenAI provides third-generation embedding models, including text-embedding-3-small
and text-embedding-3-large
. These models are built on OpenAI's unique deep learning technology, aiming to provide highly multilingual performance while also attempting to reduce costs.
These models have unique characteristics when processing embeddings. For instance, text-embedding-3-small
offers 1536-dimensional embedding vectors, while text-embedding-3-large
provides 3072-dimensional embedding vectors to cover more complex text features. By adjusting parameters, the dimensionality of embeddings can be controlled to meet the specific needs of application scenarios.
2.2. Model Selection and Usage
Choosing the appropriate embedding model depends on the specific requirements of the application. Here's how to make the selection in different application scenarios:
-
In performance-focused scenarios: If you need to capture more detailed semantic information, such as in fine-grained recommendation systems or high-precision text classification, it is usually recommended to use
text-embedding-3-large
. Although it is more expensive than the smaller models, it can provide a richer representation of text features. -
In cost-sensitive applications: For applications that require processing a large amount of data but do not have particularly high precision requirements, such as initial data exploration or rapid prototyping,
text-embedding-3-small
is a more economical choice. It maintains relatively high performance while significantly reducing costs. -
Multilingual environments: These embedding models have high multilingual performance, making them particularly useful in cross-lingual or multilingual scenarios, making them an ideal choice for global applications.
Choosing the right embedding model will depend on specific requirements, data complexity, and the desired balance point between performance and cost.
3. How to Use Embeddings
3.1 Using curl
to Call the Embeddings API
curl
is a commonly used command-line tool for sending HTTP requests. The following example shows how to use curl
to obtain the embedding representation of text:
curl https://api.openai.com/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"input": "Machine learning is a branch of artificial intelligence.",
"model": "text-embedding-3-small"
}'
In the command above, the $OPENAI_API_KEY
variable contains the user's OpenAI API key, which should be replaced with a valid key for actual use.
After executing this command, the OpenAI Embeddings API will return a response containing the text embedding representation. Here's an example of an API call result:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [ // Here is the feature vector
-0.006929283495992422,
-0.005336422007530928,
... // Remaining numbers omitted for display
-4.547132266452536e-05,
-0.024047505110502243
]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
3.2 Using Python Client to Call Embeddings API
In addition to using curl
to directly call the API from the command line, you can also use a Python client. This requires first installing the official openai
library. Here's an example of how to get text embeddings using Python:
import openai
openai.api_key = 'YOUR_OPENAI_API_KEY' # Replace with your OpenAI API key
response = openai.Embedding.create(
input="Artificial intelligence is changing the world.",
model="text-embedding-3-small"
)
embedding_vector = response['data'][0]['embedding']
print(embedding_vector)
By running this Python script, you'll get a similar embedding vector as when using curl
. This vector is a list of floating-point numbers, representing the numerical representation of the input text in the embedding space.
The call result is as follows:
[-0.0032198824, 0.0022555287, ..., 0.0015886585, -0.0021505365]
3.2 Operating Embedding Vectors
OpenAI only provides the Embeddings text vectorization computation model. If you want to use Embeddings to implement functions such as text similarity search, you need to learn about vector databases, such as Qdrant, Chroma, and Milvus.
Please refer to the following vector database tutorials: