Overview
We will assume that the audio to be transcribed is available via a URL, e.g., hosted on AWS S3.
The process will consist of 3 steps:
- make the async transcription request in OFF-LINE mode
- wait for the offline transcription task to finish
- retrieve the result of transcription
Sample python script
A sample python code that accomplishes this can be found here:
The request
The body of the request looks like this:
{
"sessions": [
{
"asyncMode": "OFF-LINE",
"poll": {
"persist": 600000
},
"content": {
"incremental": ["progress"],
"full" : ["transcript", "words"]
}
}],
"audio":{
"source": {
"fromUrl": {
"url": audio_url
}
}
}
}
The sessions.poll.persist is set to 600,000 milliseconds (10 minutes). The persist time is counted from the time the transcription is finished. You need to set your polling interval to less than session.poll.persist - usually several times smaller.
sessions.content parameter tells us what should be contained in the polling responses.
- content.incremental is for responses returned from polling before the transcription is finished
- content.full is for responses returned from polling before the transcription has completed
In this example we do not care about incremental content except for the "progress". The full content is set to "transcript" and "words"
- "transcript" means a plain text of the transcript without any timing information
- "words" means every single word annotated with time and confidence information. Note, you should use "words" even if you do not care about per word timing and confidence, because, turning it on enables more accurate punctuation in the transcript text.
We make the request to the async transcribe API: /v1/asr/transcribe/async
From the response we get the full polling URL which includes the session id:
polling_url = init_response["sessions"][0]["poll"]["url"]
The polling
Notice that polling is done in two phases.
- with ?full=false parameter - this will return the progress information
- with ?full=true parameter - this will return the final transcript
What we care about in the incremental (full=false) polling is just an indication that the transcription has completed (note that it may complete w/o success, more about it next). So we inspect final parameter
is_final = poll_response["result"]["final"]
Once the result.final parameter turns to true, we can retrieve the full result.
In the final polling request we get the full response. In the example, we extract the transcript from it:
tr_text = poll_response["result"]["transcript"];
However, in production code you should first check the result.status field. It may have the following values:
- NOINPUT: processing completed with NOINPUT outcome
- NOMATCH: speech in audio input could not be recognized with confidence
- MATCH: speech from audio input was recognized with confidence above threshold -- only if status is MATCH you will get the transcript text in result.transcript
- ERROR: there was a processing error
Using sessionId to get formatted result
So far we shown how to get the raw transcript text. If you included "words" in content.full request parameter, you can also retrieve formatted result, e.g. using
GET https://api.voicegain.ai/v1/asr/transcribe/{sessionId}/transcript?format=text
This transcript will be available for retrieval for amount of time specified in poll.persist value (in milliseconds) in the initial request:
Running the sample code
To run the code you need to put your JWT token in the indicated place
JWT = "<Your JWT here>"
you also need to provide path to where your source audio to be transcribe is located:
audio_url = "https://s3.us-east-2.amazonaws.com/files.public.voicegain.ai/3sec.wav"
We suggest you run the script once with the audio file we provide to verify that the script works for you. If it works correctly, the transcript you will see will be:
She had no doubt in the world of it's being a very fine day.
As the code runs, it polls about every 5 seconds to see if the transcription is finished.
The intermediate results may contain output like this:
{
"session": {
"sessionId": "O-0-0kgv84axk082w25jzjddro0m6tzj",
"asyncMode": "OFF-LINE"
},
"result": {"final": false},
"responseType": "AsyncResultIncremental",
"progress": {"phase": "PROCESSING", "audioStartTime": 0}
}
The final result response will be much larger and will contain fields like:
"final": true
"responseType": "AsyncResultFull"
"words": [ ... ]
"transcript": " ... "
You can use the transcript or words from this final result or you can use the formatted transcript retrieved
using
GET https://api.voicegain.ai/v1/asr/transcribe/{sessionId}/transcript?format=text
request at the end of the python script.
Processing multiple files
Note: if you have multiple files to process, you have to care about Rate Limits. The rate limit that matters the most for OFF-LINE transcription is the offlineThroughputLimitPerHour.
We suggest two approaches for the code that submits the audio data for offline transcription:
Fixed pool of processing threads
In this case you would have a pool of e.g. 10 processing threads that look very much like the python example presented here. Each thread would independently run the 3 steps:
- make the async transcription request in OFF-LINE mode
- wait for the offline transcription task to finish
- retrieve the result of transcription
Voicegain offline processing handles transcription of a single audio file in multiple threads so fine tuning of the number of threads in your pool is not really important and generally a number around 10-20 will work fine.
Relying on offlineQueueSizeLimit
This may be more complicated to implement and not really better, but here is how it would work:
- There would be a process that continually submits the audio for transcription. When it gets a 429 response, it backs-off using the Retry-After value. The successfully submitted sessions will be passed to the second process which will do polling
- This process keeps polling the session ids it received from the submitting process. Polling can be done is one or more threads. Session that have completed, have the transcripts downloaded and the session ids removed from the pool of sessions being polled.
Comments
0 comments
Please sign in to leave a comment.