Vision Model (GPT-4-Vision) Preview

1. Introduction to GPT-4 Vision Model

The GPT-4 Vision model (GPT-4V) is a multimodal artificial intelligence model introduced by OpenAI, which integrates visual understanding capabilities based on GPT-4. Unlike traditional text processing models, GPT-4V can receive and analyze image content, providing descriptions, answering questions, and engaging in interactions related to the images.

Example Applications:

Product Recognition and Classification: E-commerce platforms can use GPT-4V to identify and provide descriptions for product images, helping improve search and recommendation systems.
Assisting Medical Decisions: While GPT-4V is not suitable for direct professional medical image diagnosis, it can assist medical personnel in initial image understanding and data organization.
Education and Research: In teaching and scientific research, GPT-4V can be used to analyze charts, experiment results, and automatically interpret scientific image data.
Traffic Monitoring and Analysis: By analyzing traffic surveillance images, GPT-4V can assist traffic management systems in real-time condition reporting and accident identification.

2. Simple Example

Below is a simple example using a CURL request to demonstrate how to use the GPT-4 Vision model to analyze images:

API Request Parameters:

model: Specifies the model version to be used, in this case "gpt-4-vision-preview".
messages: Contains role definition and content, where the content can include text and image links.
max_tokens: Specifies the maximum length limit for generating text.

CURL Request Example:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4-vision-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "Your Image Link"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

With the above request, we submitted an image to the GPT-4 Vision model and asked a simple question: "What is in this image?" The model will analyze the image content and provide an answer based on the image content.

3. Uploading Images Using Base64 Encoding

In some cases, you may need to upload a local image file to the GPT-4 Vision model. In such cases, we can embed the image data into the API request using base64 encoding.

Python Code Example:

import base64
import requests

api_key = "YOUR_OPENAI_API_KEY"

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "path_to_your_image.jpg"

base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

In the above code, we first convert a local image file into a base64 encoded string, and then send this string as part of the request to the API. The model's response contains a description of the image content.

4. Handling Multiple Image Inputs

Sometimes it's necessary to analyze multiple images at once. The GPT-4 Vision model supports receiving multiple image inputs simultaneously and allows users to ask questions about these images or compare their differences.

Multiple Image Input Example:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4-vision-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What's different in these images?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "URL of the first image",
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "URL of the second image",
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

In this request, we submit two images to the API. The model will analyze each image separately and provide descriptions and comparisons based on the questions posed. This approach is ideal for overall analysis of a collection of images.

5. Setting Image Analysis Detail Level

When using the GPT-4 visual model for image analysis, you can set the detail level according to your needs. By adjusting the detail parameter, you can choose one of low, high, or auto. Here's a detailed explanation of each option and how to set it:

low: Choosing the low detail level will disable the "high-resolution" model. This model will receive a low-resolution version of the image at 512 pixels x 512 pixels and use a budget of 65 tokens to represent the image. This is suitable for scenarios that do not require high details, helping to achieve a faster response time and consume fewer input tokens.
high: The high detail level allows the model to first see a low-resolution image, and then create a detailed cropped version in a 512-pixel grid based on the input image's size. Each detailed crop is represented with a doubled budget of 129 tokens (i.e., 65 tokens per default crop).
auto: The automatic detail level will determine whether to use the low or high detail level based on the size of the input image.

The following code example shows how to set the detail level:

import base64
import requests

api_key = "your_OPENAI_API_KEY"

image_path = "path_to_your_image.jpg"

base64_image = base64.b64encode(open(image_path, "rb").read()).decode('utf-8')

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}",
            "detail": "high"  # Set to high detail level
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

6. Understanding Model Limitations and Managing Images

6.1. Model Limitations

Despite the powerful functionality of the GPT-4 visual model, it is not without limitations, and understanding these limitations is crucial for using it for image understanding. Here is an overview of some known constraints:

Medical Images: The model is not suitable for interpreting professional medical images, such as CT scans, and should not be used for medical advice.
Non-English Text: The model may not perform well when processing images containing non-Latin alphabet text, such as Japanese or Korean.
Spatial Localization: The model's performance is suboptimal in tasks requiring precise spatial location associations, such as identifying positions of pieces on a chessboard.
Image Details: The model may struggle to understand charts or text with color and style variations (e.g., solid lines, dashed lines) in an image.
Image Rotation: The model may misinterpret skewed or upside-down text and images.

6.2 Managing Images in Sessions

Since the Chat Completions API is stateless, you need to manage the messages (including images) passed to the model yourself. If you want to use the same image multiple times, you need to resend the image data with each API request.


additional_payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Based on this image, what suggestions do you have?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

new_response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=additional_payload)

print(new_response.json())

7. Cost Calculation

Each image using the detail: low option consumes a fixed 85 tokens. For images with the detail: high option, the image is first scaled proportionally to fit the size of 2048px x 2048px, and then the shorter side of the image is ensured to be 768px. The image is then divided into multiple 512px squares, with each square consuming 170 tokens, and an additional 85 tokens are added to the final total count.

For example, if an image has dimensions of 1024px x 1024px and the detail: high option is chosen, the token cost would be:

First, since 1024 is less than 2048, there is no initial size adjustment.
Then, with the shorter side being 1024, the image is resized to 768 x 768.
4 512px squares are needed to represent the image, so the final token cost will be 170 * 4 + 85 = 765.

For a detailed understanding of the cost calculation method, please refer to the documentation for the GPT-4 Vision model.

8. Frequently Asked Questions

Below are some common questions and their answers that users may encounter when using the GPT-4 Vision model:

Q: Can I fine-tune the image capabilities of `gpt-4`?

A: Currently, we do not support fine-tuning the image capabilities of gpt-4.

Q: Can I use `gpt-4` to generate images?

A: No, you can use dall-e-3 to generate images and use gpt-4-vision-preview to understand images.

Q: What types of file uploads are supported?

A: We currently support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif) files.