WebSocket

The Voicemaker API supports WebSocket connections for real-time audio streaming. You can send text to the API and receive the generated audio stream instantly in response.

WebSocket Endpoint

wss://developer.voicemaker.in/api/v1/voice/convert

Authentication

Authentication is required via the Authorization header with a Bearer token when establishing the WebSocket connection:

Authorization: Bearer YOUR-API-KEY

How It Works

Connect: Establish a WebSocket connection to the endpoint with your API key in the Authorization header.

Send Text: Send a JSON payload containing the text, voiceId (and optional settings) you want to convert. Text limit is up to 3,000 characters max. Large text inputs are automatically split and processed sequentially.

Receive Audio: For every message you send, the server streams back sequential base64-encoded audio chunks formatted as { "success": true, "audio": "..." }.

Completion Signal: When the final chunk is delivered, the response includes isFinal: true so you know the audio is complete. Responses are delivered in sequence. Use this flag to detect completion and concatenate chunks on the client if you need a single audio file.

Timeout: The WebSocket connection automatically closes after one minute of inactivity. Each message you send resets this timeout.

Rate Limit: Supports up to 20 concurrent requests. (Contact support for higher limits.)

Send Text (JSON message)

{
    "VoiceId": "ai3-Jony",
    "Text": "Welcome to Voicemaker API.",
    "LanguageCode": "en-US",
    "OutputFormat": "mp3",
    "SampleRate": "48000",
    "MasterVolume": "0",
    "MasterSpeed": "0",
    "MasterPitch": "0"
}

Request Parameters

Required Fields

Text (*): The text content to convert to speech. Supports SSML tags.

VoiceId (*): The ID of the voice to use for speech synthesis. (e.g., ai3-Jony, ai3-Aria)

LanguageCode (*): The language code for the voice. (e.g., en-US, en-GB, multi-lang for Pro voices)

Optional Fields

Engine: standard, neural (Default: neural)

OutputFormat: mp3, wav, ogg, opus, aac, ulaw, alaw (Default: mp3)

SampleRate: Audio sample rate. Common values: 22050, 24000, 44100, 48000

Effect: Voice effect to apply. (e.g., default, whispered, happy, sad, angry, excited, friendly)

MasterSettings: advanced_v1, advanced_v2 (Default: advanced_v1)

MasterVolume: Volume adjustment: -20 to 20 (Default: 0)

MasterSpeed: Speed adjustment: -100 to 100 (Default: 0)

MasterPitch: Pitch adjustment: -100 to 100 (Default: 0)

AccentCode: Accent code for multilingual voices. (e.g., en-US, en-GB, fr-FR)

CustomFileName: Custom filename for the output audio file.

Stability (ProPlus voices only): Stability setting: 0 to 100 (Default: 50)

Similarity (ProPlus voices only): Similarity setting: 0 to 100 (Default: 80)

ProEngine (ProPlus voices only): turbo, highres, expressive (Default: highres)

Not Supported

VoxFx: Audio is streamed in chunks, so VoxFx is not supported over WebSocket. To use VoxFx effects, please use the REST API instead.

Response Format

Per-message response

{
    "success": true,
    "audio": "base64-encoded-audio-chunk"
}

Final chunk response

{
    "success": true,
    "audio": "base64-encoded-audio-chunk",
    "isFinal": true
}

Error response

{
    "success": false,
    "message": "Validation error",
    "errors": ["Text is required and must be a non-empty string"]
}

WebSocket

WebSocket Endpoint#

Authentication#

How It Works#

Send Text (JSON message)#

Request Parameters#

Required Fields#

Optional Fields#

Not Supported#

Response Format#

Per-message response#

Final chunk response#

Error response#