Version: 1.0.0

Audio Timestamps

The main purpose of the audio_timestamps JSON object is to know the time at which a grapheme or a phoneme can be heard in the synthesized audio. All API's pertaining to clips return a audio_timestamps JSON object that contain the following attributes; graph_chars, graph_times, phone_cars, and phone_times. A detailed description of these attributes can be found below.

A good use case of audio_timestamps is for animation purposes. An animated model can easily synchronize mouth movements with the synthesized audio.

The `audio_timestamps` Object

Attribute	Description
graph_chars	An array of characters containing the equivalent of the text used to synthesize the audio.
graph_times	A 2D array of floats corresponding to the time in the audio, at which a grapheme at the same index in the `graph_chars` array starts and ends.
phone_chars	An array of Kirshenbaum IPA (ASCII IPA) characters representing the phonetic equivalent of the text used to synthesize the audio
phone_times	A 2D array of floats corresponding to the time in the audio, at which the phoneme at the same index in the `phone_chars` array starts and ends.

How to use the `audio_timestamps` Object

Given the following input text:

"This is a test."

The following audio_timestamps object will be returned (times may vary depending on the voice you're using):

"audio_timestamps": {
    "graph_chars": [
        "T",
        "h",
        "i",
        "s",
        " ",
        "i",
        "s",
        " ",
        "a",
        " ",
        "t",
        "e",
        "s",
        "t",
        "."
    ],
    "graph_times": [
        [
            0.0116,
            0.0464
        ],
        [
            0.0116,
            0.0464
        ],
        [
            0.0464,
            0.0929
        ],
        [
            0.0929,
            0.1509
        ],
        [
            0.1509,
            0.1974
        ],
        [
            0.1974,
            0.2438
        ],
        [
            0.2438,
            0.2902
        ],
        [
            0.2902,
            0.3367
        ],
        [
            0.3367,
            0.3831
        ],
        [
            0.3831,
            0.418
        ],
        [
            0.418,
            0.476
        ],
        [
            0.476,
            0.6502
        ],
        [
            0.6502,
            0.7663
        ],
        [
            0.7663,
            0.8707
        ],
        [
            0.8707,
            0.952
        ]
    ],
    "phon_chars": [
        "ð",
        "ɪ",
        "s",
        " ",
        "ɪ",
        "z",
        " ",
        "ɐ",
        " ",
        "t",
        "ˈ",
        "ɛ",
        "s",
        "t",
        "."
    ],
    "phon_times": [
        [
            0.0116,
            0.0464
        ],
        [
            0.0464,
            0.0929
        ],
        [
            0.0929,
            0.1509
        ],
        [
            0.1509,
            0.1974
        ],
        [
            0.1974,
            0.2438
        ],
        [
            0.2438,
            0.2902
        ],
        [
            0.2902,
            0.3367
        ],
        [
            0.3367,
            0.3831
        ],
        [
            0.3831,
            0.418
        ],
        [
            0.418,
            0.476
        ],
        [
            0.476,
            0.5457
        ],
        [
            0.5457,
            0.6502
        ],
        [
            0.6502,
            0.7663
        ],
        [
            0.7663,
            0.8707
        ],
        [
            0.8707,
            0.952
        ]
    ]
}

Each index in the graph_chars array has an equivalent index in the graph_times array which pertains to the time at which the grapheme has started and finished being spoken. For example at index 0 in the graph_chars array, we have:

And at index 0 in the graph_times array, we have:

Therefore grapheme T started being spoken at 0.0116s and finished being spoken at 0.0464s.

The same concept applies for the phon_chars and phon_times attributes.

Audio Timestamps

The audio_timestamps Object​

How to use the audio_timestamps Object​

The `audio_timestamps` Object

How to use the `audio_timestamps` Object