.. _v1_speak_endpoints:

Daisys API endpoints
====================

While Daisys recommends the use of the Python client, the Daisys API endpoints are
available for use with other languages.  In addition to the current document, the
FastAPI-generated documentation is available:

 - Swagger UI: https://api.daisys.ai/v1/speak/docs
 - Redoc: https://api.daisys.ai/v1/speak/redoc
 - OpenAPI definition file: https://api.daisys.ai/v1/speak/openapi.json

See also the FastAPI documentation on how to `generate clients`_ for other languages.

.. _generate clients: https://fastapi.tiangolo.com/advanced/generate-clients

The "Speak" API provides a REST interface to its three main data structures: models,
voices, and takes.

This is best demonstrated in the :ref:`curl example <v1_speak_curl_example>`, where JSON
objects are constructed as strings in a shell script.  See `JSON input structures`_
for more information on JSON input.

.. _v1_speak_model_endpoints:

Model-related Endpoints
-----------------------

A Daisys API user account may have access to one or more *models*.  These models can be
listed by accessing the ``models`` endpoint using a ``GET`` request::

  https://api.daisys.ai/v1/speak/models

Furthermore a specific model can be accessed by providing its name::

  https://api.daisys.ai/v1/speak/models/<model_name>


.. _v1_speak_voice_endpoints:

Voice-related Endpoints
-----------------------

A Daisys API user account may have access to one or more *voices*.  These voices can be
listed by accessing the ``voices`` endpoint using a ``GET`` request. Voice listing can
also be filtered, by providing fields ``voice_id`` (a comma-separated list of ``voice_id``
to retrieve), ``length``, ``page``, ``older`` and ``newer``.  In these case a list is
returned::

  https://api.daisys.ai/v1/speak/voices
  https://api.daisys.ai/v1/speak/voices?voice_id=<voice_id1,voice_id2>
  https://api.daisys.ai/v1/speak/voices?length=5&page=2
  https://api.daisys.ai/v1/speak/voices?newer=1690214050638

The argument for ``newer`` and ``older`` must be a timestamp in milliseconds since epoch.
This can be retrieved in Python for example using::

  import time
  def seconds_ago(seconds: int=2):
      return int((time.time() - seconds)*1000)
  response = requests.get(f'https://api.daisys.ai/v1/speak/voices?newer={seconds_ago(2)}',
                          header={'Authorization', 'Bearer ' + access_token)

Furthermore a specific voice can be accessed by providing its name.  In this case a single
item is returned instead of a list::

  https://api.daisys.ai/v1/speak/voices/<voice_id>

User accounts may also have access to generate new voices.  This can be done by making a
``POST`` request to the ``voices/generate`` endpoint::

  https://api.daisys.ai/v1/speak/voices/generate

The body should contain the :class:`VoiceGenerate <daisys.v1.speak.models.VoiceGenerate>`
structure in JSON format.  Example::

  curl -X POST -H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
    -d '{"name": "Bob", "gender": "male", "model": "my_model"}' \
    https://api.daisys.ai/v1/speak/voices/generate

where ``my_model`` should be the name of a model listed by the ``/speak/models`` endpoint.

.. _v1_speak_take_endpoints:

Take-related Endpoints
----------------------

The principle service of the Daisys API is to perform text-to-speech audio synthesis.
This is done by generating "takes", which encapsulate a TTS job.  Previously generated
takes can be retrieved via ``takes``, and the list can be filtered similar to voices::

  https://api.daisys.ai/v1/speak/takes
  https://api.daisys.ai/v1/speak/takes?take_id=<take_id1,take_id2>
  https://api.daisys.ai/v1/speak/takes?length=5&page=2
  https://api.daisys.ai/v1/speak/takes?newer=1690214050638

with similar semantics to ``/speak/voices`` described above.  A single take can be
retrieved by giving its identifier::

  https://api.daisys.ai/v1/speak/takes/<take_id>

An audio take can be generated by making a ``POST`` request to ``takes/generate``::

  https://api.daisys.ai/v1/speak/takes/generate

and providing the :class:`TakeGenerate <daisys.v1.speak.models.TakeGenerate>` structure as
input in the content body.

Finally, the audio can be retrieved by accessing the take's ``/wav`` endpoint.
Equivalently, other formats can also be retrieved this way, however ``wav`` is
the only format that can be retrieved before it is "ready", allowing to download
as it is generated::

  https://api.daisys.ai/v1/speak/takes/<take_id>/wav
  https://api.daisys.ai/v1/speak/takes/<take_id>/mp3
  https://api.daisys.ai/v1/speak/takes/<take_id>/m4a
  https://api.daisys.ai/v1/speak/takes/<take_id>/flac
  https://api.daisys.ai/v1/speak/takes/<take_id>/webm

Note that these endpoints return a 307 redirect to where the audio can be
streamed or stored from.

  Important: a complication is that S3 presigned URLs must be accessed without the
  Daisys "Authorization" header, which some http clients will not drop
  automatically. Therefore the following logic is recommended, and performed by
  the Python client library when following the redirect to ``url``::

    if 'X-Amz-Signature' in url:
      # Pre-signed URL, no auth needed.
      headers = {}

  Note that browsers `handle this automatically`_ when changing origins, however
  it is not recommended in any case to access the REST API endpoints directly
  from the browser since they require the access token.  Instead, backend
  software can access the ``/wav`` endpoint and retrieve the URL in the Location
  header, and forward this to the browser, which can be access without the
  Authorization header and has a limited lifetime.  Therefore this redirect
  Location is convenient and more secure to pass directly to an Audio Player
  object on the client side.

  .. _handle this automatically: https://github.com/whatwg/fetch/pull/1544

.. _v1_speak_endpoints_retrieving_audio:

Retrieving audio
................

Finally, the audio can be retrieved by accessing the take's ``/wav`` endpoint.
Equivalently, other formats can also be retrieved this way, however ``wav`` is
the only format that can be retrieved before it is "ready", allowing to download
as it is generated::

  https://api.daisys.ai/v1/speak/takes/<take_id>/wav
  https://api.daisys.ai/v1/speak/takes/<take_id>/mp3
  https://api.daisys.ai/v1/speak/takes/<take_id>/m4a
  https://api.daisys.ai/v1/speak/takes/<take_id>/flac
  https://api.daisys.ai/v1/speak/takes/<take_id>/webm

Note that these endpoints return a 307 redirect to where the audio can be
streamed or stored from.

  Important: a complication is that S3 presigned URLs must be accessed without the
  Daisys "Authorization" header, which some http clients will not drop
  automatically. Therefore the following logic is recommended, and performed by
  the Python client library when following the redirect to ``url``::

    if 'X-Amz-Signature' in url:
      # Pre-signed URL, no auth needed.
      headers = {}

  Note that browsers `handle this automatically`_ when changing origins, however
  it is not recommended in any case to access the REST API endpoints directly
  from the browser since they require the access token.  Instead, backend
  software can access the ``/wav`` endpoint and retrieve the URL in the Location
  header, and forward this to the browser, which can be accessed without the
  Authorization header and has a limited lifetime.  Therefore this redirect
  Location is convenient and more secure to pass directly to an Audio Player
  object on the client side.

  .. _handle this automatically: https://github.com/whatwg/fetch/pull/1544

.. _websocket_endpoint:

Websocket Endpoints
-------------------

The following endpoint can be used to retrieve an URL for making a direct
websocket connection to a worker by issuing a GET request::

  https://api.daisys.ai/v1/speak/websocket?model=<model>

As can be seen, the model to use must be specified when making a request for a
worker URL, which allows the Daisys API to better distribute requests to workers
with preloaded models.

For the same reason, whenever a websocket is disconnected, a new URL must be
requested through the above endpoint.  Disconnection may happen from time to
time but shall not happen during the processing of a request.  The provided URLs
expire after 1 hour.  A connection may remain open longer than that, but new
connections must request a new URL.

The endpoint returns the following JSON body::

  {
    "websocket_url": "<url>"
  }


Authentication Endpoints
------------------------

To make use of the Daisys API, first an access token must be granted.  This can be
retrieved by a ``POST`` request to the ``auth/login`` endpoint::

  https://api.daisys.ai/auth/login

The content body should have the form::

  {
    "email": <user@example.com>,
    "password": <password>
  }

On failure, a 401 HTTP status is returned.  (In the client library, an exception is
raised.)  On success, a JSON object containing ``access_token`` and ``refresh_token``
fields is provided.

The ``access_token`` string should be attached to all ``GET`` and ``POST`` requests in the
HTTP header, in the following form::

  Authorization: Bearer <access_token>

Furthermore if the ``access_token`` is no longer working, the ``refresh_token`` can be
used to get a new one without supplying the password::

  https://api.daisys.ai/auth/refresh

In this case the ``POST`` request should have the form::

  {
    "email": <user@example.com>,
    "refresh_token": <refresh_token>
  }

The response contains new ``access_token`` and ``refresh_token`` fields.  This allows to
continually refresh an initial token whenever needed, so that the API can be used without
providing a password.

Note that this token refresh logic is taken care of automatically by the Python client
library.  The client can also be initiated with just an email and refresh token rather
than an email and password, so that credentials need not be provided to the Daisys API
client.  It is also alternatively possible to request a permatoken, which does not need to
be refreshed.

On the other hand, refresh tokens can be revoked at any time through the following
``POST`` endpoint:

  https://api.daisys.ai/auth/logout

with content body of the form::

  {
    "refresh_token": <refresh_token>,
  }


JSON input structures
---------------------

``POST`` endpoints, namely ``takes/generate`` and ``voices/generate``, take input in their
content body in the form of JSON objects.

The structure of all such objects can be inferred by reading the :ref:`models
<v1_speak_models>`, since the fields can be translated directly to JSON.  Nonetheless some
of the embedded structures and optional fields can be confusing, thus we give some
examples here.

A minimal example of :class:`TakeGenerate <daisys.v1.speak.models.TakeGenerate>`::

  {
    "text": "This is some text to speak.",
    "prosody": {"pace": -3, "pitch": 0, "expression": 4},
    "voice_id": "01h3anwqdh1q6zhf9s9s239wky",
  }

Optional fields such as ``style``, ``override_language``, and ``done_webhook`` can be
added as desired.

Here is an example of :class:`TakeGenerate <daisys.v1.speak.models.TakeGenerate>` using
all available fields::

  {
    "text": "This is some text to speak.",
    "override_language": "en-GB",
    "prosody": {"pace": -3, "pitch": 0, "expression": 4},
    "voice_id": "01h3anwqdh1q6zhf9s9s239wky",
    "style": ["narrator"],
    "status_webhook": "https://myservice.com/daisys_webhooks/take_status/1234",
    "done_webhook": "https://myservice.com/daisys_webhooks/take_done/1234",
  }

Note that ``override_language`` is provided here as an example, but if it is not provided
(is ``null``) then the Daisys API will attempt to pronounce words in the correct language
on a per-word basis.  If it is provided, then the model may for example mispronounce loan
words, since it assumes a single language for the input text.  The presence of the
``style`` field depends on the model in use, as does the supported prosody types, although
all models support the simple prosody type with ``pace``, ``pitch``, and ``expression``
being integer values from -10 to 10.  Specific information about the model can be
retrieved by the ``/speak/models`` endpoint.

Finally, here is an example of input for ``voices/generate``::
  
  {
    "name": "Bob",
    "default_prosody": {"pace": 0, "pitch": 0, "expression": 0},
    "model": "eng_base",
    "gender": "male",
    "done_webhook": "https://myservice.com/daisys_webhooks/voice_done/1234",
  }

Here, a default prosody is specified for the voice, which is adopted in subsequent
``/take/generate`` requests if ``prosody`` is not provided (left as ``null``).