Using MRCP for large vocabulary transcription, i.e. without grammars – Voicegain

MRCP was designed to be used with grammars, but Voicegain platform allows also for use of our large vocabulary (non-grammar) recognizer over MRCP.

There are two ways to tell the recognizer to do large vocabulary transcription:

Pass the URI of the special built-in grammar - note that it may not work on some VXML platforms or MRCP clients which examine if the grammar is a valid grammar. The built-in grammar names that enable large vocabulary recognition are:
1. builtin:speech/transcribe
2. builtin:grammar/transcribe
3. builtin:none
Pass a grxml grammar with the root rule name attribute set to "__TRANSCRIBE__" - apart from that the content of the grammar does not matter but it should be valid parseable GRXML grammar, e.g. one shown below.
NOTE: This grammar needs to be passed within the MRCP request, it cannot be referenced via a URL.

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar" 
     version="1.0" xml:lang="en-US" tag-format="semantics/1.0" 
     root="__TRANSCRIBE__">
  <rule id="__TRANSCRIBE__">
    <item>transcribe</item>
  </rule>
</grammar>

Results from large vocabulary transcription will be returned as follows, e.g.:

<result>
  <interpretation grammar="session:request1@form-level" confidence="0.94">
    <input mode="speech">make a payment</input>
  </interpretation>
</result>

MRCP Parameters that apply

Here are some of the parameters that control large vocabulary recognition over MRCP:

No-Input-Timeout - stops recognition if there was no speech detected for the specified time starting from the start of recognition command. Default value is 5000 msec.
Recognition-Timeout - maximum duration of recognition. Default value is 60000 msec (1 minute).
Speech-Complete-Timeout and Speech-Incomplete-Timeout - because in absence of grammars it is difficult do define when speech is complete or incomplete, these two parameters are treated the same and the actual value of the speech timeout as used by the ASR is the minimum of both parameters. Speech timeout starts to be counted from the end of the last recognized word (which means that if no speech has been detected then No-Input-Timeout will be in effect instead).
The defaults for these values come from the grammar recognition use case and that is why they are different - Speech-Complete-Timout default is 5000 msec , and the Speech-Incomplete-Timeout is 5000 msec. Which means that for large vocabulary transcription then default speech timeout in practice is 2000 msec.
Confidence-Threshold - float value between 0.0-1.0, if the confidence of the recognized utterance is below this threshold then a a NO-MATCH will be returned. Default value is 0.0 - which means that all recognitions will be MATCH.
Sensitivity-Level -Determines the threshold for rejection of low volume background speech or noise.
1 is most sensitive (least rejection) and corresponds to -90 dbFS
0 is least sensitive (most rejection) and corresponds to -50 dbFS

This setting affects start-of-speech and end-of-speech detection. If some background noise/speech is causing a too early start-of-speech try a lower value of sensitivity level.
Default value is 0.5

Unless mentioned above, there are no maximum values set for those parameters.

Passing hints

You can use the "fake" __TRANSCRIBE__ grammar that is used to enable large vocabulary transcription to pass hints. An example how to do this is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
    version="1.0" xml:lang="en-US" tag-format="semantics/1.0"
    root="__TRANSCRIBE__" hints="press_one:10,press_two">
  <rule id="__TRANSCRIBE__">
    <item>transcribe</item>
  </rule>
</grammar>

See Using Hints for more details on how to format hints strings.

Notes about using /asr/recognize API with the above grammars.

/asr/recognize API generally would not be used with the above grammars for large vocabulary transcription as that is what the /asr/transcribe API is for.

However, this is supported and will be equivalent to using builtin:grammar/transcribe under MRCP ASR.

"settings": {
  "asr": {
    "grammars": [
      {
        "type": "BUILT-IN",
        "name": "transcribe"
      }
    ],
    "noInputTimeout": 6000,
    "completeTimeout": 2000,
    "incompleteTimeout" : 2000,
    "maxAlternatives" : 5,
    "confidenceThreshold" : 0.0001,
    "languages" : ["en"]
  }
}

MRCP Parameters that apply

Passing hints

Notes about using /asr/recognize API with the above grammars.

Related articles