Endpoint
Path Parameters
| Name | Required | Description |
|---|---|---|
voice_id | Yes | The ID of the target voice. |
Request Body
Content-Type: application/json| Name | Required | Description |
|---|---|---|
text | Yes | The text to convert (max 300 characters). |
language | Yes | Language code. Supported: en, ko, ja. |
style | No | Emotional style. E.g., neutral, happy, sad, etc. If not specified, the character’s default style is applied |
model | No | TTS model. Default: sona_speech_1. |
output_format | No | Output format. Options: wav, mp3. Default: wav. |
voice_settings | No | Advanced voice parameters (see below). |
include_phonemes | No | If true, returns phoneme timing data along with audio (Base64-encoded). Default: false. |
Voice Settings (optional)
| Name | Range | Default | Description |
|---|---|---|---|
pitch_shift | -24 → 24 | 0 | Pitch adjustment in semitones. |
pitch_variance | 0 → 2 | 1 | Degree of pitch variation. |
speed | 0.5 → 2 | 1 | Adjusts the generated audio uniformly faster or slower. (ratio) |
duration | 0 → 60 | 0 | When provided, speech is generated to match the given duration (seconds) |
similarity | 1 → 5 | 3 | Controls how closely the generated speech matches the original character voice. |
text_guidance | 0 → 4 | 1 | Controls how sensitively speech characteristics adapt to the input text content. |
subharmonic_amplitude_control | 0 → 2 | 1 | Controls the amount of subharmonic amplitude of the generated speech. |
Response
Depending oninclude_phonemes, returns:
Binary Audio(Default & when include_phonemes=false)
audio/wav – Raw WAV file.
audio/mpeg – Raw MP3 file. JSON with Phoneme Data
(when include_phonemes=true)
Headers:
X-Audio-Length (number) – Duration of the audio in seconds.Notes
- A 400 error will occur if the
textlength exceeds 300 characters. speedis applied afterduration. (Example: duration=5seconds, speed=2times → final audio ≈ 10seconds)- Calls are possible even without
style, but default styles may vary by character, so please call Get Voices API to check the default style (the first value in the styles array is the default). - The audio file in the response can be directly saved or played (appropriate handling required depending on client).
Authorizations
Path Parameters
Body
application/json
The text to convert to speech
Maximum length:
300The language code of the text
Available options:
en, ko, ja The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
The desired output format of the audio file (wav, mp3). Default is wav.
Available options:
wav, mp3 Return phoneme timing data with the audio
Response
Returns either binary audio or JSON with phoneme data based on include_phonemes parameter
Binary audio file (when include_phonemes=false or omitted)