Daisys API websockets

The Daisys API provides a websocket interface to enable direct communication with a single inference worker node, for applications that require lower latency.

Latency vs. throughput

While the websocket connection provides some convenience for certain applications, it should not be used for tasks where a batch approach is more appropriate, since generation requests through the REST API get distributed over multiple workers and will overall finish faster. However, applications that require real-time or near real-time interaction may benefit from keeping a connection open and receiving the response immediately without making an extra HTTP GET request.

Please do keep in mind that an effect of dedicating a connection is that requests over this connection are effectively serialized, so the decision of using websocket vs. the REST API is a typical latency vs. throughput tradeoff to make. For this reason, to help guarantee latency, the websocket system reserves the right to occasionally drop the connection, forcing the client to request a new URL, which has the effect of rebalancing the distribution of connections to workers, helping to ensure lower latency overall.

In this document we describe:

  • Connecting: How to get a websocket URL and make and maintain a connection.

  • Message format: The message format used to send commands and receive responses.

  • Python interface: How to use this Python library to communicate over the websocket.

Examples of using the websocket connection from the Python API as well as communicating with the websocket from JavaScript in a browser application are given in Daisys API websocket examples.

Connecting

In order to connect to a Daisys worker node, you must first be assigned a node through the API. In Python, this is taken care of for you, see Python interface below. However, when using another language or curl, you can get a URL via the Websocket Endpoints using a GET request.

As mentioned there, the websocket may disconnect between requests for rebalancing worker load, although this should not happen frequently. Additionally it is assumed that a websocket connection is for interaction with a specific model, which must be included in the URL. In fact any model can be used on a websocket connection but only the specified model shall be kept from being unloaded, therefore if latency is at issue, it is recommended to open a websocket connection per model. A reconnection scheme can be used to immediately request a new worker URL and reconnect if the connection is dropped. The Daisys API shall make every effort to ensure that all current requests are handled and results delivered before dropping any connections.

Message format

In cases where you are not using Python and wish to develop your own client for the websocket, the format is kept rather simple and should be quite approachable for any language for which a websocket library is available.

Websocket supports text and bytes (binary) messages. Commands are sent using text messages, and status messages (text) and audio messages (bytes) are received. Both outgoing and incoming text messages are in JSON format.

Outgoing messages have the following format:

{"command": "<command>", "data": {<data>}, "request_id": <request_id>}

where command may be one of /takes/generate or /voices/generate.

The data field corresponds to the same POST body given to the corresponding commands, i.e. the TakeGenerate and VoiceGenerate structures, respectively.

Likewise, the status messages received for each correspond to the responses to those same commands, these being TakeGenerate and TakeGenerate, respectively. They are similarly bundled into a response structure,

{"data": {<data>}, "request_id": <request_id>}

Special to the websocket connection is request_id, which is needed to track which incoming responses go with which outgoing requests. Because websockets do not guarantee message order (shorter messages may arrive before longer messages), and because a take_id is not known until the first status message is received, there is no way to know which audio goes with which request. Therefore the request_id is a user-provided identifier, a string or an integer, which is included with the responses to that command. A simple incrementing integer per connection is recommended, and is what the Python interface implements.

Audio response messages are also simple, however since it is necessary to carry some metadata, they contain two sections, delimited by a length prefix. Audio messages (bytes) are formatted thus:

JSON<length 4 bytes><json metadata>RIFF..

That is, they start with the literal string JSON followed by a 32-bit little endian integer indicating how long the metadata section is. The metadata section can be converted to a string and parsed as JSON. This is immediately followed by a .wav file header, which always starts with the literal string RIFF. Therefore, starting at R, the rest of the bytes can be passed to an audio player or a wav file parsing routine if chunking is not used.

The metadata section consists of the following fields,

{"take_id": "<take_id>", "part_id": int, "chunk_id": int, "request_id": <request_id>}

where part_id and chunk_id are incrementing integers as specified in the next section, and request_id reflects whatever was provided when the associated command was issued.

Parts and chunks

If multiple sentences have been provided, then they are returned with separate part_id values, which are an incrementing integer, where each part consists of a complete wav file. The end of the stream for a take is indicated by a new part_id that has 0 bytes of audio.

If chunking is enabled, the bytes must be concatenated to an existing part stream, either in real time or before writing the part to a file. Chunks are different from parts in that they are not prepended with a wav header, but are merely the individual pieces of a part that is not yet fully received. Similar to parts, chunks are identified with an incrementing integer chunk_id which must be used to put them in order before playback. Also similar, the end of the chunk stream for a part is indicated by a new chunk_id being accompanied with 0 bytes of audio.

Finally then, a stream of parts without chunking appears like so:

[part_id=0, audio len=12340]
[part_id=1, audio len=23450]
[part_id=2, audio len=0]

and with chunking,

[part_id=0, chunk_id=0, audio len=4140]
[part_id=0, chunk_id=1, audio len=4140]
[part_id=0, chunk_id=2, audio len=0]
[part_id=1, chunk_id=0, audio len=4140]
[part_id=1, chunk_id=1, audio len=4140]
[part_id=1, chunk_id=2, audio len=0]
[part_id=2, chunk_id=0, audio len=0]

The above is for visual explanation only, in reality the take_id and request_id are also included in the metadata header in order to know which audio is for which stream.

I a /voices/generate message was requested, audio of the associated example take will be sent. However the status message will be a TakeGenerate object, and the take_id included in the audio messages will correspond with its example_take_id field.

Python interface

The Python interface consists of calling, websocket() of the client object (see Daisys API clients), in a with context manager, which returns respectively one of the following objects. For example,

Streaming audio, websocket method
1from daisys import DaisysAPI
2with DaisysAPI('speak', email='user@example.com', password='pw') as speak:
3    with speak.websocket(model='theatrical-v2') as ws:
4        ....
5        request_id = ws.generate_take(...

In each case, you can then issue a command to generate a take or a voice using the returned context object, as demonstrated above. Subsequently the callbacks you provide get called whenever messages are received on the websocket containing either status information or audio data.

class daisys.v1.speak.sync_websocket.DaisysSyncSpeakWebsocketV1(client: DaisysSyncSpeakClientV1, model: str)

Wrapper for Daisys v1 API websocket connection, synchronous version.

This class is intended to be used in a with clause.

disconnect()

Disconnect this websocket.

generate_take(voice_id: str, text: str, override_language: str | None = None, style: list[str] | None = None, prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, stream_options: StreamOptions | None = None, status_webhook: str | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None, timeout: float | None = None) int

Generate a “take”, an audio file containing an utterance of the given text by the given voice.

Parameters:
  • voice_id – The id of the voice to be used for generating audio. The voice is attached to a specific model.

  • text – The text that the voice should say.

  • override_language – Normally a language classifier is used to detect the language of the speech; this allows for multilingual sentences. However, if the language should be enforced, it should be provided here. Currently accepted values are “nl-NL” and “en-GB”.

  • style – A list of styles to enable when speaking. Note that most styles are mutually exclusive, so a list of 1 value should be provided. Accepted styles can be retrieved from the associated voice’s VoiceInfo.styles or the model’s TTSModel.styles field. Note that not all models support styles, thus this can be left empty if specific styles are not desired.

  • prosody – The characteristics of the desired speech not determined by the voice or style. Here you can provide a SimpleProsody or most models also accept the more detailed AffectProsody.

  • stream_options – Configuration for streaming.

  • status_webhook – An optional URL to be called using POST whenever the take’s status changes, with TakeResponse in the body content.

  • done_webhook – An optional URL to be called exactly once using POST when the take is READY, ERROR, or TIMEOUT, with TakeResponse in the body content.

  • status_callback – An optional function to call for status updates regarding this take.

  • audio_callback – An optional function to call to provide the audio parts of the take.

Returns:

Information about the take being generated, including status.

Return type:

TakeResponse

generate_voice(name: str, model: str, gender: VoiceGender, description: str | None = None, default_style: list[str] | None = None, default_prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, example_take: TakeGenerateWithoutVoice | None = None, stream_options: StreamOptions | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None) int

Generate a random, novel voice for a given model with desired properties.

Parameters:
  • name – A name to give the voice, may be any string, and does not need to be unique.

  • model – The name of the model for this voice.

  • gender – The gender of this voice.

  • description – The description of this voice.

  • default_style – An optional list of styles to associate with this voice by default. It can be overriden by a take that uses this voice. Note that most styles are mutually exclusive, and not all models support styles.

  • default_prosody – An optional default prosody to associate with this voice. It can be overridden by a take that uses this voice.

  • example_take – Information on the take to generate as an example of this voice.

  • stream_options – Configuration for streaming.

  • done_webhook – An optional URL to call exactly once using POST when the voice is available, with VoiceInfo in the body content.

  • status_callback – An optional function to call for status updates regarding this take.

  • audio_callback – An optional function to call to provide the audio parts of the take.

Returns:

Information about the generated voice.

Return type:

VoiceInfo

iter_request(request_id)

Iterate over incoming text and audio messages for a given request_id.

Parameters:

request_id – The id value associated with the request to be iterated over. Returned by take_generate() and voice_generate().

Returns:

An Iterator yielding tuples (take_id, take, header, audio), where:

  • take_id: the take_id associated with this request

  • take: the TakeResponse information if a text message, otherwise None

  • header: the wav header if any, otherwise None

  • audio: the audio bytes, if a binary message, otherwise None

reconnect()

Reconnect this websocket, by first fetching the URL and then opening the conneciton to it.

update(timeout: int | None = 1)

Retrieve a waiting message on the open websocket connection.

Parameters:

timeout – Number of seconds to wait. Can be 0 if non-blocking usage is desired. If None, wait forever.

class daisys.v1.speak.async_websocket.DaisysAsyncSpeakWebsocketV1(client: DaisysAsyncSpeakClientV1, model: str | None, voice_id: str | None)

Wrapper for Daisys v1 API websocket connection, asynchronous version.

This class is intended to be used in an async with clause.

async disconnect()

Disconnect this websocket.

async generate_take(voice_id: str, text: str, override_language: str | None = None, style: list[str] | None = None, prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, stream_options: StreamOptions | None = None, status_webhook: str | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None, timeout: float | None = None) int

Generate a “take”, an audio file containing an utterance of the given text by the given voice.

Parameters:
  • voice_id – The id of the voice to be used for generating audio. The voice is attached to a specific model.

  • text – The text that the voice should say.

  • override_language – Normally a language classifier is used to detect the language of the speech; this allows for multilingual sentences. However, if the language should be enforced, it should be provided here. Currently accepted values are “nl-NL” and “en-GB”.

  • style – A list of styles to enable when speaking. Note that most styles are mutually exclusive, so a list of 1 value should be provided. Accepted styles can be retrieved from the associated voice’s VoiceInfo.styles or the model’s TTSModel.styles field. Note that not all models support styles, thus this can be left empty if specific styles are not desired.

  • prosody – The characteristics of the desired speech not determined by the voice or style. Here you can provide a SimpleProsody or most models also accept the more detailed AffectProsody.

  • stream_options – Configuration for streaming.

  • status_webhook – An optional URL to be called using POST whenever the take’s status changes, with TakeResponse in the body content.

  • done_webhook – An optional URL to be called exactly once using POST when the take is READY, ERROR, or TIMEOUT, with TakeResponse in the body content.

  • status_callback – An optional function to call for status updates regarding this take.

  • audio_callback – An optional function to call to provide the audio parts of the take.

Returns:

Information about the take being generated, including status.

Return type:

TakeResponse

async generate_voice(name: str, model: str, gender: VoiceGender, description: str | None = None, default_style: list[str] | None = None, default_prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, example_take: TakeGenerateWithoutVoice | None = None, stream_options: StreamOptions | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None) int

Generate a random, novel voice for a given model with desired properties.

Parameters:
  • name – A name to give the voice, may be any string, and does not need to be unique.

  • model – The name of the model for this voice.

  • gender – The gender of this voice.

  • description – The description of this voice.

  • default_style – An optional list of styles to associate with this voice by default. It can be overriden by a take that uses this voice. Note that most styles are mutually exclusive, and not all models support styles.

  • default_prosody – An optional default prosody to associate with this voice. It can be overridden by a take that uses this voice.

  • example_take – Information on the take to generate as an example of this voice.

  • stream_options – Configuration for streaming.

  • done_webhook – An optional URL to call exactly once using POST when the voice is available, with VoiceInfo in the body content.

  • status_callback – An optional function to call for status updates regarding this take.

  • audio_callback – An optional function to call to provide the audio parts of the take.

Returns:

Information about the generated voice.

Return type:

VoiceInfo

async iter_request(request_id)

Iterate over incoming text and audio messages for a given request_id.

Parameters:

request_id – The id value associated with the request to be iterated over. Returned by take_generate() and voice_generate().

Returns:

An Iterator yielding tuples (take_id, take, header, audio), where:

  • take_id: the take_id associated with this request

  • take: the TakeResponse information if a text message, otherwise None

  • header: the wav header if any, otherwise None

  • audio: the audio bytes, if a binary message, otherwise None

async reconnect()

Reconnect this websocket, by first fetching the URL and then opening the conneciton to it.

async update(timeout: int | None = 1)

Retrieve a waiting message on the open websocket connection.

Parameters:

timeout – Number of seconds to wait. In the async implementation this cannot be 0. If None, wait forever.