Create speech

Convert text to speech

curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id} \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false
}'

This response does not have an example.

POST

text-to-speech

{voice_id}

Convert text to speech

curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id} \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false
}'

This response does not have an example.

Endpoint

https://supertoneapi.com/v1/text-to-speech/{voice_id}

Path Parameters

Name	Required	Description
`voice_id`	Yes	The ID of the target voice.

Request Body

Content-Type: application/json

Name	Required	Description
`text`	Yes	The text to convert (max 300 characters).
`language`	Yes	Language code. Supported: `en`, `ko`, `ja`.
`style`	No	Emotional style. E.g., `neutral`, `happy`, `sad`, etc. If not specified, the character’s default style is applied
`model`	No	TTS model. Default: `sona_speech_1`.
`output_format`	No	Output format. Options: `wav`, `mp3`. Default: `wav`.
`voice_settings`	No	Advanced voice parameters (see below).
`include_phonemes`	No	If `true`, returns phoneme timing data along with audio (Base64-encoded). Default: `false`.

Voice Settings (optional)

Name	Range	Default	Description
`pitch_shift`	-24 → 24	0	Pitch adjustment in semitones.
`pitch_variance`	0 → 2	1	Degree of pitch variation.
`speed`	0.5 → 2	1	Adjusts the generated audio uniformly faster or slower. (ratio)
`duration`	0 → 60	0	When provided, speech is generated to match the given duration (seconds)
`similarity`	1 → 5	3	Controls how closely the generated speech matches the original character voice.
`text_guidance`	0 → 4	1	Controls how sensitively speech characteristics adapt to the input text content.
`subharmonic_amplitude_control`	0 → 2	1	Controls the amount of subharmonic amplitude of the generated speech.

Response

Depending on include_phonemes, returns: Binary Audio
(Default & when include_phonemes=false)
audio/wav – Raw WAV file.
audio/mpeg – Raw MP3 file. JSON with Phoneme Data
(when include_phonemes=true)

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

Headers:

X-Audio-Length (number) – Duration of the audio in seconds.

Notes

A 400 error will occur if the text length exceeds 300 characters.
speed is applied after duration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)
Calls are possible even without style, but default styles may vary by character, so please call Get Voices API to check the default style (the first value in the styles array is the default).
The audio file in the response can be directly saved or played (appropriate handling required depending on client).

Authorizations

x-sup-api-key

string

header

required

Path Parameters

voice_id

string

required

Body

application/json

text

string

required

The text to convert to speech

Maximum length: 300

language

enum<string>

required

The language code of the text

Available options:

en,

ko,

ja

style

string

The style of character to use for the text-to-speech conversion

model

string

default:sona_speech_1

The model type to use for the text-to-speech conversion

output_format

enum<string>

default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:

wav,

mp3

voice_settings

object

Show child attributes

include_phonemes

boolean

default:false

Return phoneme timing data with the audio

Response

Returns either binary audio or JSON with phoneme data based on include_phonemes parameter

Binary audio file (when include_phonemes=false or omitted)

Create cloned voice

Stream speech

⌘I

Supertone API

Voices

Custom voices

Text to speech

Usage

Endpoint

Path Parameters

Request Body

Voice Settings (optional)

Response

Headers:

Notes

Authorizations

Path Parameters

Body

Response

Supertone API

Voices

Custom voices

Text to speech

Usage

​Endpoint

​Path Parameters

​Request Body

​Voice Settings (optional)

​Response

​Headers:

​Notes

Authorizations

Path Parameters

Body

Response

Endpoint

Path Parameters

Request Body

Voice Settings (optional)

Response

Headers:

Notes