Embeddings (Text Embeddings)

1. Introduction to Embeddings

1.1. What are Embeddings

Embeddings, in the field of machine learning, especially in addressing natural language processing (NLP) problems, is a technique to transform text data into numerical vectors. In human language, the meaning of words and phrases is determined by their context and usage. The goal of embeddings is to capture the semantics of these linguistic units, enabling computers to understand and process them.

The core idea of embeddings is to map words with similar meanings to nearby points in a mathematical space, representing words as points in a high-dimensional space. This way, words with close semantic meanings (e.g., "king" and "queen") will be close in the space. Embeddings are typically composed of floating-point arrays, allowing even very different text fragments (such as "dog" and "canine") to have similar embedding representations.

Tip: As an application developer, you can simply understand that for two text sentences with similar meanings, their embedding vector similarity is high.

1.2. Applications of Embeddings

Embeddings are widely used in various scenarios, here are some primary use cases:

Search: Using embedding features to rank search results based on relevance to the query text.
Clustering: Embeddings can help identify and categorize semantically similar text fragments.
Recommendation Systems: Recommending items based on similarity can help discover and recommend other items similar to the known ones.
Anomaly Detection: Embeddings can be used to identify data points significantly different from the main dataset.
Diversity Measurement: Embeddings can also be used to analyze the similarity distribution between different texts.
Classification: Comparing text with a set of known label embeddings to classify it into the most similar category.

2. Introduction to OpenAI Embeddings

2.1. Overview of OpenAI Embeddings Models

OpenAI provides third-generation embedding models, including text-embedding-3-small and text-embedding-3-large. These models are built on OpenAI's unique deep learning technology, aiming to provide highly multilingual performance while also attempting to reduce costs.

These models have unique characteristics when processing embeddings. For instance, text-embedding-3-small offers 1536-dimensional embedding vectors, while text-embedding-3-large provides 3072-dimensional embedding vectors to cover more complex text features. By adjusting parameters, the dimensionality of embeddings can be controlled to meet the specific needs of application scenarios.

2.2. Model Selection and Usage

Choosing the appropriate embedding model depends on the specific requirements of the application. Here's how to make the selection in different application scenarios:

In performance-focused scenarios: If you need to capture more detailed semantic information, such as in fine-grained recommendation systems or high-precision text classification, it is usually recommended to use text-embedding-3-large. Although it is more expensive than the smaller models, it can provide a richer representation of text features.
In cost-sensitive applications: For applications that require processing a large amount of data but do not have particularly high precision requirements, such as initial data exploration or rapid prototyping, text-embedding-3-small is a more economical choice. It maintains relatively high performance while significantly reducing costs.
Multilingual environments: These embedding models have high multilingual performance, making them particularly useful in cross-lingual or multilingual scenarios, making them an ideal choice for global applications.

Choosing the right embedding model will depend on specific requirements, data complexity, and the desired balance point between performance and cost.

3. How to Use Embeddings

3.1 Using `curl` to Call the Embeddings API

curl is a commonly used command-line tool for sending HTTP requests. The following example shows how to use curl to obtain the embedding representation of text:

curl https://api.openai.com/v1/embeddings \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
        "input": "Machine learning is a branch of artificial intelligence.",
        "model": "text-embedding-3-small"
    }'

In the command above, the $OPENAI_API_KEY variable contains the user's OpenAI API key, which should be replaced with a valid key for actual use.

After executing this command, the OpenAI Embeddings API will return a response containing the text embedding representation. Here's an example of an API call result:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [  // Here is the feature vector
        -0.006929283495992422,
        -0.005336422007530928,
        ...  // Remaining numbers omitted for display
        -4.547132266452536e-05,
        -0.024047505110502243
      ]
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

3.2 Using Python Client to Call Embeddings API

In addition to using curl to directly call the API from the command line, you can also use a Python client. This requires first installing the official openai library. Here's an example of how to get text embeddings using Python:

import openai

openai.api_key = 'YOUR_OPENAI_API_KEY'  # Replace with your OpenAI API key

response = openai.Embedding.create(
  input="Artificial intelligence is changing the world.",
  model="text-embedding-3-small"
)

embedding_vector = response['data'][0]['embedding']
print(embedding_vector)

By running this Python script, you'll get a similar embedding vector as when using curl. This vector is a list of floating-point numbers, representing the numerical representation of the input text in the embedding space.

The call result is as follows:

[-0.0032198824, 0.0022555287, ..., 0.0015886585, -0.0021505365]

3.2 Operating Embedding Vectors

OpenAI only provides the Embeddings text vectorization computation model. If you want to use Embeddings to implement functions such as text similarity search, you need to learn about vector databases, such as Qdrant, Chroma, and Milvus.

Please refer to the following vector database tutorials: