Context

When running real-time transcription it is possible to specify a websocket as the output.

It can be a predefined named websocket or an adHoc websocket with random name.

An example request is show below:

{
 sessions: [{
     asyncMode: "REAL-TIME",
     websocket: { 
       adHoc: true, 
       useSTOMP: false,
       minimumDelay: 100 
     },
   }],
 audio: {
   source: { stream: { protocol: "WEBSOCKET" } },
   format: "L16",
   rate: 16000,
   channels: "mono"
 },
 settings: {
   asr: {
     noInputTimeout: 59999,
     incompleteTimeout: 3599999
   }
 }
}

Response to this request will contain either:

if useSTOMP==false, the wss url to connect to the websocket that will be used to send the results.
if useSTOMP==true, the wss url and the name of the topic that can be used to subscribe to the STOMP messages sent over the websocket that will contain the results.

Format of the websocket messages - examples

Websocket messages will contain JSON with transcription results formatted as follows:

Simple utterance, e.g.:

{"utt":"seven", "spk":1, "conf":0.7518, "gap":480}

The fields are:

utt - the utterance/word recognized
spk - speaker index - only present if diarization has been turned on
conf - confidence of the recognition
gap - gap in milliseconds between start of this this and the end of previous word/utterance

An edit payload would look like this:

{
  "edit":[
    {"utt":"w","conf":0.7101,"ins":true},
    {"utt":"ae","conf":0.3757,"gap":40,"ins":true},
    {"utt":"two","conf":0.7304,"gap":300,"add":true}
  ],
  "del":1
}

The fields are:

del - indicates how many previously output words will need to be deleted before applying the edits
edit - contain all the edits
ins - indicates that this is an insert before the time point of the deleted utterance
add - indicates that this is an addition after the time point of the deleted utterance

Another example:

{
  "edit":[
    {"utt":"a","conf":0.2334,"gap":40,"repl":["ae"]},
    {"utt":"e","conf":0.4526,"gap":40,"repl":[]},
    {"utt":"two","conf":0.7304,"gap":300},
    {"utt":"ree","conf":0.7833,"gap":520,"add":true}],
  "del":2
}

The new fields in this example are:

repl - indicates which words the given utterance replaces - it is an array because sometimes more than one word can be replaced - array may be empty

Note: the words without any "ins", "add", or "repl" annotations are repeats of the unchanging words that were deleted to make this edit possible.

One more example:

{
  "edit":[
    {"utt":"we","conf":0.5407,"ins":true},
    {"utt":"","conf":0.2334,"del":["a","e"]},
    {"utt":"two","conf":0.7304,"gap":340},
    {"utt":"ree","conf":0.7833,"gap":520},
    {"utt":"four","conf":0.8794,"gap":620,"add":true}],
  "del":4
}

The new fields in this example are:

del - note this is "del" within "edit" which is different from "del" at the outermost level - indicates which words are to be deleted i.e. replaced with "" empty string

Notes

gap - the purpose of outputting the gap is e.g. to display words on screen with the gap depending on the gap in the audio. If the words will be displayed with standard space gap then this value in STOMP messages can be ignored
ins, repl, del, add - the purpose of these is to enable easy highlights (e.g. color coding) of some of the edits, or to show the replacements (e.g. when hovering). If we do not care about the types of edits then these values can be ignored.
conf - can be used to e.g. provide different word shading based on confidence

The simplest processor of websocket messages needs only to use the outer "del", "edit" and "utt" values.

Websocket payload for real-time transcription results

Context

Format of the websocket messages - examples

Notes

Comments

Context

Format of the websocket messages - examples

Notes

Related articles