Context
When running real-time transcription it is possible to specify a websocket as the output.
It can be a predefined named websocket or an adHoc websocket with random name.
An example request is show below:
{
sessions: [{
asyncMode: "REAL-TIME",
websocket: {
adHoc: true,
useSTOMP: false,
minimumDelay: 100
},
}],
audio: {
source: { stream: { protocol: "WEBSOCKET" } },
format: "L16",
rate: 16000,
channels: "mono"
},
settings: {
asr: {
noInputTimeout: 59999,
incompleteTimeout: 3599999
}
}
}
Response to this request will contain either:
- if useSTOMP==false, the wss url to connect to the websocket that will be used to send the results.
- if useSTOMP==true, the wss url and the name of the topic that can be used to subscribe to the STOMP messages sent over the websocket that will contain the results.
Format of the websocket messages - examples
Websocket messages will contain JSON with transcription results formatted as follows:
Simple utterance, e.g.:
{"utt":"seven", "spk":1, "conf":0.7518, "gap":480}
The fields are:
- utt - the utterance/word recognized
- spk - speaker index - only present if diarization has been turned on
- conf - confidence of the recognition
- gap - gap in milliseconds between start of this this and the end of previous word/utterance
An edit payload would look like this:
{
"edit":[
{"utt":"w","conf":0.7101,"ins":true},
{"utt":"ae","conf":0.3757,"gap":40,"ins":true},
{"utt":"two","conf":0.7304,"gap":300,"add":true}
],
"del":1
}
The fields are:
- del - indicates how many previously output words will need to be deleted before applying the edits
- edit - contain all the edits
- ins - indicates that this is an insert before the time point of the deleted utterance
- add - indicates that this is an addition after the time point of the deleted utterance
Another example:
{
"edit":[
{"utt":"a","conf":0.2334,"gap":40,"repl":["ae"]},
{"utt":"e","conf":0.4526,"gap":40,"repl":[]},
{"utt":"two","conf":0.7304,"gap":300},
{"utt":"ree","conf":0.7833,"gap":520,"add":true}],
"del":2
}
The new fields in this example are:
- repl - indicates which words the given utterance replaces - it is an array because sometimes more than one word can be replaced - array may be empty
Note: the words without any "ins", "add", or "repl" annotations are repeats of the unchanging words that were deleted to make this edit possible.
One more example:
{
"edit":[
{"utt":"we","conf":0.5407,"ins":true},
{"utt":"","conf":0.2334,"del":["a","e"]},
{"utt":"two","conf":0.7304,"gap":340},
{"utt":"ree","conf":0.7833,"gap":520},
{"utt":"four","conf":0.8794,"gap":620,"add":true}],
"del":4
}
The new fields in this example are:
- del - note this is "del" within "edit" which is different from "del" at the outermost level - indicates which words are to be deleted i.e. replaced with "" empty string
Notes
- gap - the purpose of outputting the gap is e.g. to display words on screen with the gap depending on the gap in the audio. If the words will be displayed with standard space gap then this value in STOMP messages can be ignored
- ins, repl, del, add - the purpose of these is to enable easy highlights (e.g. color coding) of some of the edits, or to show the replacements (e.g. when hovering). If we do not care about the types of edits then these values can be ignored.
- conf - can be used to e.g. provide different word shading based on confidence
The simplest processor of websocket messages needs only to use the outer "del", "edit" and "utt" values.
Comments
0 comments
Article is closed for comments.