Learning Voice Models (Text-to-Speech, Speech-to-Text)

1. Overview of Speech Models

1.1. Introduction to OpenAI's TTS and STT Models

Text-to-Speech (TTS) Model

OpenAI's TTS model can convert text data into speech output. This process involves text analysis, application of speech synthesis algorithms, and adjustments to sound quality. It enables the computer to read any written text, making the content more understandable and accessible. TTS is an important technology for visually impaired individuals, drivers, or anyone who prefers to receive information through listening.

Speech-to-Text (STT) Model

Corresponding to TTS, the STT model can convert speech information into written text. When processing the original audio input, the STT system first performs speech detection, followed by feature extraction. It then maps the audio signal to the vocabulary using acoustic and language models, ultimately generating text output. STT technology is widely used in speech recognition, meeting transcription, and real-time caption generation scenarios.

1.2. Application Scenarios

Blog reading aloud
Multilingual speech generation

3. Text-to-Speech API

3.1. Quick Start

In this section, we will demonstrate how to quickly convert text into speech using the curl command and a Python client. Whether you are a developer or a non-technical user, you can easily generate speech files by simply sending an API request.

Sending Requests Using Curl

To generate speech using the curl command line tool, follow these steps:

Ensure that curl is installed on your system and that you have a valid OpenAI API key.
Use the following curl command to convert text into speech:

curl https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Today is a great day to build products people love!",
    "voice": "alloy"
  }' \
  --output speech.mp3

In the above command, $OPENAI_API_KEY represents your API key, the input field is the text you want to convert, the model field specifies the voice model to use, and the voice parameter selects the voice. Here, we choose the alloy synthetic metal voice. The final --output option specifies the name and format of the output file.

Using Python Client

If you prefer to use the Python programming language, you can use the following code example:

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Today is a great day to build products people love!"
)

response.stream_to_file("output.mp3")

In this code snippet, we first import the openai library and create an instance of the OpenAI client. Then, we use the audio.speech.create method to send a request, specifying the model, voice, and text to convert. Finally, we use the stream_to_file method to save the generated speech stream to a file.

3.2. Choosing Audio Quality and Voice

Selecting the right audio quality and voice for your project is a crucial step in ensuring the best user experience. Our API offers two options for audio quality models: tts-1 and tts-1-hd.

tts-1: Provides lower latency, suitable for real-time applications, but with relatively lower audio quality.
tts-1-hd: Delivers higher quality audio output, suitable for non-real-time high-quality speech generation needs.

In addition, OpenAI's TTS API offers various voice options:

Alloy
Echo
Fable
Onyx
Nova
Shimmer

Depending on the project requirements and target audience, you can test different voice samples to choose the most suitable voice. Consider factors such as speaking style, speech rate, and intonation to find a voice that conveys appropriate emotions and professionalism.

3.3. Supported Output Formats and Languages

OpenAI's TTS API defaults to the MP3 output format but also supports various other audio formats:

Opus: Suitable for internet streaming and communication, with low latency.
AAC: Used for digital audio compression, preferred by platforms like YouTube, Android, and iOS.
FLAC: Lossless audio compression format used by audio enthusiasts for archiving.

In terms of language support, the API mainly follows the Whisper model, providing a wide range of language options to support many national languages.

3.4. Real-time Audio Streaming Functionality

To meet the needs of real-time applications, our API provides support for real-time audio streaming. Below is a Python example for implementing real-time audio streaming:

from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello world! This is a streaming test.",
)

response.stream_to_file("output.mp3")

4. Speech to Text API

4.1. Quick Start

In this section, we will primarily introduce the functionality of OpenAI's API for speech to text.

First, you need to have a valid OpenAI API key and prepare an audio file.

You can use the curl command to send a POST request containing the audio file. Replace OPENAI_API_KEY with your API key and set the correct file path.

curl --request POST \
  --url https://api.openai.com/v1/audio/transcriptions \
  --header 'Authorization: Bearer OPENAI_API_KEY' \
  --header 'Content-Type: multipart/form-data' \
  --form file=@/path/to/your/audio/file.mp3 \
  --form model=whisper-1

After executing the above command, you will receive a JSON formatted response containing the transcribed text information.

For example:

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. ..."
}

4.2. Supported File Formats and Sizes

This API supports various common audio file formats to meet different needs in different scenarios. Supported file formats include but are not limited to mp3, mp4, mpeg, mpga, m4a, wav, webm, etc. This enables users to easily process audio files from various sources.

As for file size, the current API has a limit of not exceeding 25MB. This means that if your audio file is larger than 25MB, you will need to split it into multiple segments smaller than 25MB, or use a more efficient compression format. For example, mp3 and opus formats usually provide efficient compression, reducing the file size without sacrificing too much audio quality.

If you encounter a file larger than 25MB, you can consider using the PyDub library in Python to segment your audio:

from pydub import AudioSegment

audio_file = AudioSegment.from_file("your_large_audio_file.mp3")

interval = 10 * 60 * 1000  # 10 minutes

chunks = make_chunks(audio_file, interval)

for i, chunk in enumerate(chunks):
    chunk_name = f"audio_chunk{i}.mp3"
    chunk.export(chunk_name, format="mp3")

In the above code, the make_chunks function will segment a large audio file into multiple audio segments with a 10-minute time interval. These segments are all within the file size limit required by the API and can be uploaded separately to the OpenAI API for transcription.

Please note that while PyDub provides us with an easy way to handle audio files, it is still recommended to pay extra attention to the security and stability of any third-party software when using it. OpenAI does not provide any guarantees for third-party software.