Daisys API websockets¶
The Daisys API provides a websocket interface to enable direct communication with a single inference worker node, for applications that require lower latency.
Latency vs. throughput¶
While the websocket connection provides some convenience for certain applications, it should not be used for tasks where a batch approach is more appropriate, since generation requests through the REST API get distributed over multiple workers and will overall finish faster. However, applications that require real-time or near real-time interaction may benefit from keeping a connection open and receiving the response immediately without making an extra HTTP GET request.
Please do keep in mind that an effect of dedicating a connection is that requests over this connection are effectively serialized, so the decision of using websocket vs. the REST API is a typical latency vs. throughput tradeoff to make. For this reason, to help guarantee latency, the websocket system reserves the right to occasionally drop the connection, forcing the client to request a new URL, which has the effect of rebalancing the distribution of connections to workers, helping to ensure lower latency overall.
In this document we describe:
Connecting: How to get a websocket URL and make and maintain a connection.
Message format: The message format used to send commands and receive responses.
Python interface: How to use this Python library to communicate over the websocket.
Examples of using the websocket connection from the Python API as well as communicating with the websocket from JavaScript in a browser application are given in Daisys API websocket examples.
Connecting¶
In order to connect to a Daisys worker node, you must first be assigned a node
through the API. In Python, this is taken care of for you, see
Python interface below. However, when using another language or
curl
, you can get a URL via the Websocket Endpoints using a GET
request.
As mentioned there, the websocket may disconnect between requests for rebalancing worker load, although this should not happen frequently. Additionally it is assumed that a websocket connection is for interaction with a specific model, which must be included in the URL. In fact any model can be used on a websocket connection but only the specified model shall be kept from being unloaded, therefore if latency is at issue, it is recommended to open a websocket connection per model. A reconnection scheme can be used to immediately request a new worker URL and reconnect if the connection is dropped. The Daisys API shall make every effort to ensure that all current requests are handled and results delivered before dropping any connections.
Message format¶
In cases where you are not using Python and wish to develop your own client for the websocket, the format is kept rather simple and should be quite approachable for any language for which a websocket library is available.
Websocket supports text and bytes (binary) messages. Commands are sent using text messages, and status messages (text) and audio messages (bytes) are received. Both outgoing and incoming text messages are in JSON format.
Outgoing messages have the following format:
{"command": "<command>", "data": {<data>}, "request_id": <request_id>}
where command
may be one of /takes/generate
or /voices/generate
.
The data
field corresponds to the same POST body given to the corresponding
commands, i.e. the TakeGenerate
and VoiceGenerate
structures,
respectively.
Likewise, the status messages received for each correspond to the responses to
those same commands, these being TakeGenerate
and TakeGenerate
, respectively. They are similarly bundled
into a response structure,
{"data": {<data>}, "request_id": <request_id>}
Special to the websocket connection is request_id
, which is needed to track
which incoming responses go with which outgoing requests. Because websockets do
not guarantee message order (shorter messages may arrive before longer
messages), and because a take_id
is not known until the first status message
is received, there is no way to know which audio goes with which
request. Therefore the request_id
is a user-provided identifier, a string or
an integer, which is included with the responses to that command. A simple
incrementing integer per connection is recommended, and is what the Python
interface implements.
Audio response messages are also simple, however since it is necessary to carry some metadata, they contain two sections, delimited by a length prefix. Audio messages (bytes) are formatted thus:
JSON<length 4 bytes><json metadata>RIFF..
That is, they start with the literal string JSON
followed by a 32-bit little
endian integer indicating how long the metadata section is. The metadata
section can be converted to a string and parsed as JSON. This is immediately
followed by a .wav
file header, which always starts with the literal string
RIFF
. Therefore, starting at R
, the rest of the bytes can be passed to
an audio player or a wav file parsing routine if chunking is not used.
The metadata section consists of the following fields,
{"take_id": "<take_id>", "part_id": int, "chunk_id": int, "request_id": <request_id>}
where part_id
and chunk_id
are incrementing integers as specified in the
next section, and request_id
reflects whatever was provided when the
associated command was issued.
Parts and chunks¶
If multiple sentences have been provided, then they are returned with separate
part_id
values, which are an incrementing integer, where each part consists
of a complete wav file. The end of the stream for a take is indicated by a new
part_id
that has 0 bytes of audio.
If chunking is enabled, the bytes must be concatenated to an existing part
stream, either in real time or before writing the part to a file. Chunks are
different from parts in that they are not prepended with a wav header, but are
merely the individual pieces of a part that is not yet fully received. Similar
to parts, chunks are identified with an incrementing integer chunk_id
which
must be used to put them in order before playback. Also similar, the end of the
chunk stream for a part is indicated by a new chunk_id
being accompanied
with 0 bytes of audio.
Finally then, a stream of parts without chunking appears like so:
[part_id=0, audio len=12340]
[part_id=1, audio len=23450]
[part_id=2, audio len=0]
and with chunking,
[part_id=0, chunk_id=0, audio len=4140]
[part_id=0, chunk_id=1, audio len=4140]
[part_id=0, chunk_id=2, audio len=0]
[part_id=1, chunk_id=0, audio len=4140]
[part_id=1, chunk_id=1, audio len=4140]
[part_id=1, chunk_id=2, audio len=0]
[part_id=2, chunk_id=0, audio len=0]
The above is for visual explanation only, in reality the take_id
and
request_id
are also included in the metadata header in order to know which
audio is for which stream.
I a /voices/generate
message was requested, audio of the associated example
take will be sent. However the status message will be a TakeGenerate
object, and the take_id
included in the
audio messages will correspond with its example_take_id
field.
Python interface¶
The Python interface consists of calling, websocket()
of the client
object (see Daisys API clients), in a with
context manager, which returns
respectively one of the following objects. For example,
1from daisys import DaisysAPI
2with DaisysAPI('speak', email='user@example.com', password='pw') as speak:
3 with speak.websocket(model='theatrical-v2') as ws:
4 ....
5 request_id = ws.generate_take(...
In each case, you can then issue a command to generate a take or a voice using the returned context object, as demonstrated above. Subsequently the callbacks you provide get called whenever messages are received on the websocket containing either status information or audio data.
- class daisys.v1.speak.sync_websocket.DaisysSyncSpeakWebsocketV1(client: DaisysSyncSpeakClientV1, model: str)¶
Wrapper for Daisys v1 API websocket connection, synchronous version.
This class is intended to be used in a
with
clause.- disconnect()¶
Disconnect this websocket.
- generate_take(voice_id: str, text: str, override_language: str | None = None, style: list[str] | None = None, prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, stream_options: StreamOptions | None = None, status_webhook: str | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None, timeout: float | None = None) int ¶
Generate a “take”, an audio file containing an utterance of the given text by the given voice.
- Parameters:
voice_id – The id of the voice to be used for generating audio. The voice is attached to a specific model.
text – The text that the voice should say.
override_language – Normally a language classifier is used to detect the language of the speech; this allows for multilingual sentences. However, if the language should be enforced, it should be provided here. Currently accepted values are “nl-NL” and “en-GB”.
style – A list of styles to enable when speaking. Note that most styles are mutually exclusive, so a list of 1 value should be provided. Accepted styles can be retrieved from the associated voice’s VoiceInfo.styles or the model’s TTSModel.styles field. Note that not all models support styles, thus this can be left empty if specific styles are not desired.
prosody – The characteristics of the desired speech not determined by the voice or style. Here you can provide a SimpleProsody or most models also accept the more detailed AffectProsody.
stream_options – Configuration for streaming.
status_webhook – An optional URL to be called using
POST
whenever the take’s status changes, withTakeResponse
in the body content.done_webhook – An optional URL to be called exactly once using
POST
when the take isREADY
,ERROR
, orTIMEOUT
, withTakeResponse
in the body content.status_callback – An optional function to call for status updates regarding this take.
audio_callback – An optional function to call to provide the audio parts of the take.
- Returns:
Information about the take being generated, including status.
- Return type:
- generate_voice(name: str, model: str, gender: VoiceGender, description: str | None = None, default_style: list[str] | None = None, default_prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, example_take: TakeGenerateWithoutVoice | None = None, stream_options: StreamOptions | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None) int ¶
Generate a random, novel voice for a given model with desired properties.
- Parameters:
name – A name to give the voice, may be any string, and does not need to be unique.
model – The name of the model for this voice.
gender – The gender of this voice.
description – The description of this voice.
default_style – An optional list of styles to associate with this voice by default. It can be overriden by a take that uses this voice. Note that most styles are mutually exclusive, and not all models support styles.
default_prosody – An optional default prosody to associate with this voice. It can be overridden by a take that uses this voice.
example_take – Information on the take to generate as an example of this voice.
stream_options – Configuration for streaming.
done_webhook – An optional URL to call exactly once using POST when the voice is available, with VoiceInfo in the body content.
status_callback – An optional function to call for status updates regarding this take.
audio_callback – An optional function to call to provide the audio parts of the take.
- Returns:
Information about the generated voice.
- Return type:
- iter_request(request_id)¶
Iterate over incoming text and audio messages for a given request_id.
- Parameters:
request_id – The id value associated with the request to be iterated over. Returned by
take_generate()
andvoice_generate()
.- Returns:
An Iterator yielding tuples (
take_id
,take
,header
,audio
), where:take_id
: the take_id associated with this requesttake
: the TakeResponse information if a text message, otherwise Noneheader
: the wav header if any, otherwise Noneaudio
: the audio bytes, if a binary message, otherwise None
- reconnect()¶
Reconnect this websocket, by first fetching the URL and then opening the conneciton to it.
- update(timeout: int | None = 1)¶
Retrieve a waiting message on the open websocket connection.
- Parameters:
timeout – Number of seconds to wait. Can be 0 if non-blocking usage is desired. If None, wait forever.
- class daisys.v1.speak.async_websocket.DaisysAsyncSpeakWebsocketV1(client: DaisysAsyncSpeakClientV1, model: str | None, voice_id: str | None)¶
Wrapper for Daisys v1 API websocket connection, asynchronous version.
This class is intended to be used in an
async with
clause.- async disconnect()¶
Disconnect this websocket.
- async generate_take(voice_id: str, text: str, override_language: str | None = None, style: list[str] | None = None, prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, stream_options: StreamOptions | None = None, status_webhook: str | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None, timeout: float | None = None) int ¶
Generate a “take”, an audio file containing an utterance of the given text by the given voice.
- Parameters:
voice_id – The id of the voice to be used for generating audio. The voice is attached to a specific model.
text – The text that the voice should say.
override_language – Normally a language classifier is used to detect the language of the speech; this allows for multilingual sentences. However, if the language should be enforced, it should be provided here. Currently accepted values are “nl-NL” and “en-GB”.
style – A list of styles to enable when speaking. Note that most styles are mutually exclusive, so a list of 1 value should be provided. Accepted styles can be retrieved from the associated voice’s VoiceInfo.styles or the model’s TTSModel.styles field. Note that not all models support styles, thus this can be left empty if specific styles are not desired.
prosody – The characteristics of the desired speech not determined by the voice or style. Here you can provide a SimpleProsody or most models also accept the more detailed AffectProsody.
stream_options – Configuration for streaming.
status_webhook – An optional URL to be called using
POST
whenever the take’s status changes, withTakeResponse
in the body content.done_webhook – An optional URL to be called exactly once using
POST
when the take isREADY
,ERROR
, orTIMEOUT
, withTakeResponse
in the body content.status_callback – An optional function to call for status updates regarding this take.
audio_callback – An optional function to call to provide the audio parts of the take.
- Returns:
Information about the take being generated, including status.
- Return type:
- async generate_voice(name: str, model: str, gender: VoiceGender, description: str | None = None, default_style: list[str] | None = None, default_prosody: SimpleProsody | AffectProsody | SignalProsody | None = None, example_take: TakeGenerateWithoutVoice | None = None, stream_options: StreamOptions | None = None, done_webhook: str | None = None, status_callback: Callable[[int, TakeResponse], None] | None = None, audio_callback: Callable[[int, str, int, int | None, bytes | None], None] | None = None) int ¶
Generate a random, novel voice for a given model with desired properties.
- Parameters:
name – A name to give the voice, may be any string, and does not need to be unique.
model – The name of the model for this voice.
gender – The gender of this voice.
description – The description of this voice.
default_style – An optional list of styles to associate with this voice by default. It can be overriden by a take that uses this voice. Note that most styles are mutually exclusive, and not all models support styles.
default_prosody – An optional default prosody to associate with this voice. It can be overridden by a take that uses this voice.
example_take – Information on the take to generate as an example of this voice.
stream_options – Configuration for streaming.
done_webhook – An optional URL to call exactly once using POST when the voice is available, with VoiceInfo in the body content.
status_callback – An optional function to call for status updates regarding this take.
audio_callback – An optional function to call to provide the audio parts of the take.
- Returns:
Information about the generated voice.
- Return type:
- async iter_request(request_id)¶
Iterate over incoming text and audio messages for a given request_id.
- Parameters:
request_id – The id value associated with the request to be iterated over. Returned by
take_generate()
andvoice_generate()
.- Returns:
An Iterator yielding tuples (
take_id
,take
,header
,audio
), where:take_id
: the take_id associated with this requesttake
: the TakeResponse information if a text message, otherwise Noneheader
: the wav header if any, otherwise Noneaudio
: the audio bytes, if a binary message, otherwise None
- async reconnect()¶
Reconnect this websocket, by first fetching the URL and then opening the conneciton to it.
- async update(timeout: int | None = 1)¶
Retrieve a waiting message on the open websocket connection.
- Parameters:
timeout – Number of seconds to wait. In the async implementation this cannot be 0. If None, wait forever.