Connect: Establish a WebSocket connection to the endpoint with your API key in the Authorization header.
Send Text: Send a JSON payload containing the text, voiceId (and optional settings) you want to convert. Text limit is up to 3,000 characters max. Large text inputs are automatically split and processed sequentially.
Receive Audio: For every message you send, the server streams back sequential base64-encoded audio chunks formatted as { "success": true, "audio": "..." }.
Completion Signal: When the final chunk is delivered, the response includes isFinal: true so you know the audio is complete. Responses are delivered in sequence. Use this flag to detect completion and concatenate chunks on the client if you need a single audio file.
Timeout: The WebSocket connection automatically closes after one minute of inactivity. Each message you send resets this timeout.
Rate Limit: Supports up to 20 concurrent requests. (Contact support for higher limits.)
{
"Engine": "neural",
"VoiceId": "ai3-Jony",
"LanguageCode": "en-US",
"Text": "Welcome to Voicemaker API.",
"OutputFormat": "mp3",
"SampleRate": "48000",
"Effect": "default",
"MasterVolume": "0",
"MasterSpeed": "0",
"MasterPitch": "0"
}
Text (*): The text content to convert to speech. Supports SSML tags.
VoiceId (*): The ID of the voice to use for speech synthesis. (e.g., ai3-Jony, ai3-Aria)
LanguageCode (*): The language code for the voice. (e.g., en-US, en-GB, multi-lang for Pro voices)
Engine: standard, neural (Default: neural)
OutputFormat: mp3, wav (Default: mp3)
SampleRate: Audio sample rate. Common values: 22050, 24000, 44100, 48000
Effect: Voice effect to apply. (e.g., default, whispered, happy, sad, angry, excited, friendly)
MasterSettings: advanced_v1, advanced_v2 (Default: advanced_v1)
MasterVolume: Volume adjustment: -20 to 20 (Default: 0)
MasterSpeed: Speed adjustment: -100 to 100 (Default: 0)
MasterPitch: Pitch adjustment: -100 to 100 (Default: 0)
AccentCode: Accent code for multilingual voices. (e.g., en-US, en-GB, fr-FR)
CustomFileName: Custom filename for the output audio file.
Stability (ProPlus voices only): Stability setting: 0 to 100 (Default: 50)
Similarity (ProPlus voices only): Similarity setting: 0 to 100 (Default: 80)
ProEngine (ProPlus voices only): turbo, highres, expressive (Default: highres)