Daisys API endpoints¶

While Daisys recommends the use of the Python client, the Daisys API endpoints are available for use with other languages. In addition to the current document, the FastAPI-generated documentation is available:

Swagger UI: https://api.daisys.ai/v1/speak/docs

Redoc: https://api.daisys.ai/v1/speak/redoc

OpenAPI definition file: https://api.daisys.ai/v1/speak/openapi.json

See also the FastAPI documentation on how to generate clients for other languages.

The “Speak” API provides a REST interface to its three main data structures: models, voices, and takes.

This is best demonstrated in the curl example, where JSON objects are constructed as strings in a shell script. See JSON input structures for more information on JSON input.

Take-related Endpoints¶

The principle service of the Daisys API is to perform text-to-speech audio synthesis. This is done by generating “takes”, which encapsulate a TTS job. Previously generated takes can be retrieved via takes, and the list can be filtered similar to voices:

https://api.daisys.ai/v1/speak/takes
https://api.daisys.ai/v1/speak/takes?take_id=<take_id1,take_id2>
https://api.daisys.ai/v1/speak/takes?length=5&page=2
https://api.daisys.ai/v1/speak/takes?newer=1690214050638

with similar semantics to /speak/voices described above. A single take can be retrieved by giving its identifier:

https://api.daisys.ai/v1/speak/takes/<take_id>

An audio take can be generated by making a POST request to takes/generate:

https://api.daisys.ai/v1/speak/takes/generate

and providing the TakeGenerate structure as input in the content body.

Finally, the audio can be retrieved by accessing the take’s /wav endpoint. Equivalently, other formats can also be retrieved this way, however wav is the only format that can be retrieved before it is “ready”, allowing to download as it is generated:

https://api.daisys.ai/v1/speak/takes/<take_id>/wav
https://api.daisys.ai/v1/speak/takes/<take_id>/mp3
https://api.daisys.ai/v1/speak/takes/<take_id>/m4a
https://api.daisys.ai/v1/speak/takes/<take_id>/flac
https://api.daisys.ai/v1/speak/takes/<take_id>/webm

Note that these endpoints return a 307 redirect to where the audio can be streamed or stored from.

Important: a complication is that S3 presigned URLs must be accessed without the Daisys “Authorization” header, which some http clients will not drop automatically. Therefore the following logic is recommended, and performed by the Python client library when following the redirect to url:
if 'X-Amz-Signature' in url:
  # Pre-signed URL, no auth needed.
  headers = {}
Note that browsers handle this automatically when changing origins, however it is not recommended in any case to access the REST API endpoints directly from the browser since they require the access token. Instead, backend software can access the /wav endpoint and retrieve the URL in the Location header, and forward this to the browser, which can be access without the Authorization header and has a limited lifetime. Therefore this redirect Location is convenient and more secure to pass directly to an Audio Player object on the client side.

Retrieving audio¶

https://api.daisys.ai/v1/speak/takes/<take_id>/wav
https://api.daisys.ai/v1/speak/takes/<take_id>/mp3
https://api.daisys.ai/v1/speak/takes/<take_id>/m4a
https://api.daisys.ai/v1/speak/takes/<take_id>/flac
https://api.daisys.ai/v1/speak/takes/<take_id>/webm

Note that these endpoints return a 307 redirect to where the audio can be streamed or stored from.

Important: a complication is that S3 presigned URLs must be accessed without the Daisys “Authorization” header, which some http clients will not drop automatically. Therefore the following logic is recommended, and performed by the Python client library when following the redirect to url:
if 'X-Amz-Signature' in url:
  # Pre-signed URL, no auth needed.
  headers = {}
Note that browsers handle this automatically when changing origins, however it is not recommended in any case to access the REST API endpoints directly from the browser since they require the access token. Instead, backend software can access the /wav endpoint and retrieve the URL in the Location header, and forward this to the browser, which can be accessed without the Authorization header and has a limited lifetime. Therefore this redirect Location is convenient and more secure to pass directly to an Audio Player object on the client side.

Websocket Endpoints¶

The following endpoint can be used to retrieve an URL for making a direct websocket connection to a worker by issuing a GET request:

https://api.daisys.ai/v1/speak/websocket?model=<model>

As can be seen, the model to use must be specified when making a request for a worker URL, which allows the Daisys API to better distribute requests to workers with preloaded models.

For the same reason, whenever a websocket is disconnected, a new URL must be requested through the above endpoint. Disconnection may happen from time to time but shall not happen during the processing of a request. The provided URLs expire after 1 hour. A connection may remain open longer than that, but new connections must request a new URL.

The endpoint returns the following JSON body:

{
  "websocket_url": "<url>"
}

Authentication Endpoints¶

To make use of the Daisys API, first an access token must be granted. This can be retrieved by a POST request to the auth/login endpoint:

https://api.daisys.ai/auth/login

The content body should have the form:

{
  "email": <user@example.com>,
  "password": <password>
}

On failure, a 401 HTTP status is returned. (In the client library, an exception is raised.) On success, a JSON object containing access_token and refresh_token fields is provided.

The access_token string should be attached to all GET and POST requests in the HTTP header, in the following form:

Authorization: Bearer <access_token>

Furthermore if the access_token is no longer working, the refresh_token can be used to get a new one without supplying the password:

https://api.daisys.ai/auth/refresh

In this case the POST request should have the form:

{
  "email": <user@example.com>,
  "refresh_token": <refresh_token>
}

The response contains new access_token and refresh_token fields. This allows to continually refresh an initial token whenever needed, so that the API can be used without providing a password.

Note that this token refresh logic is taken care of automatically by the Python client library. The client can also be initiated with just an email and refresh token rather than an email and password, so that credentials need not be provided to the Daisys API client. It is also alternatively possible to request a permatoken, which does not need to be refreshed.

On the other hand, refresh tokens can be revoked at any time through the following POST endpoint:

https://api.daisys.ai/auth/logout

with content body of the form:

{
  "refresh_token": <refresh_token>,
}

JSON input structures¶

POST endpoints, namely takes/generate and voices/generate, take input in their content body in the form of JSON objects.

The structure of all such objects can be inferred by reading the models, since the fields can be translated directly to JSON. Nonetheless some of the embedded structures and optional fields can be confusing, thus we give some examples here.

A minimal example of TakeGenerate:

{
  "text": "This is some text to speak.",
  "prosody": {"pace": -3, "pitch": 0, "expression": 4},
  "voice_id": "01h3anwqdh1q6zhf9s9s239wky",
}

Optional fields such as style, override_language, and done_webhook can be added as desired.

Here is an example of TakeGenerate using all available fields:

{
  "text": "This is some text to speak.",
  "override_language": "en-GB",
  "prosody": {"pace": -3, "pitch": 0, "expression": 4},
  "voice_id": "01h3anwqdh1q6zhf9s9s239wky",
  "style": ["narrator"],
  "status_webhook": "https://myservice.com/daisys_webhooks/take_status/1234",
  "done_webhook": "https://myservice.com/daisys_webhooks/take_done/1234",
}

Note that override_language is provided here as an example, but if it is not provided (is null) then the Daisys API will attempt to pronounce words in the correct language on a per-word basis. If it is provided, then the model may for example mispronounce loan words, since it assumes a single language for the input text. The presence of the style field depends on the model in use, as does the supported prosody types, although all models support the simple prosody type with pace, pitch, and expression being integer values from -10 to 10. Specific information about the model can be retrieved by the /speak/models endpoint.

Finally, here is an example of input for voices/generate:

{
  "name": "Bob",
  "default_prosody": {"pace": 0, "pitch": 0, "expression": 0},
  "model": "eng_base",
  "gender": "male",
  "done_webhook": "https://myservice.com/daisys_webhooks/voice_done/1234",
}

Here, a default prosody is specified for the voice, which is adopted in subsequent /take/generate requests if prosody is not provided (left as null).