MRCP was designed to be used with grammars, but Voicegain platform allows also for use of our large vocabulary (non-grammar) recognizer over MRCP.
There are two ways to tell the recognizer to do large vocabulary transcription:
- Pass the URI of the special built-in grammar - note that it may not work on some VXML platforms or MRCP clients which examine if the grammar is a valid grammar. The built-in grammar names that enable large vocabulary recognition are:
- builtin:speech/transcribe
- builtin:grammar/transcribe
- builtin:none
- Pass a grxml grammar with the root rule name attribute set to "__TRANSCRIBE__" - apart from that the content of the grammar does not matter but it should be valid parseable GRXML grammar, e.g. one shown below.
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
version="1.0" xml:lang="en-US" tag-format="semantics/1.0"
root="__TRANSCRIBE__">
<rule id="__TRANSCRIBE__">
<item>transcribe</item>
</rule>
</grammar>
Results from large vocabulary transcription will be returned as follows, e.g.:
<result>
<interpretation grammar="session:request1@form-level" confidence="0.94">
<input mode="speech">make a payment</input>
</interpretation>
</result>
MRCP Parameters that apply
Here are some of the parameters that control large vocabulary recognition over MRCP:
- No-Input-Timeout - stops recognition if there was no speech detected for the specified time starting from the start of recognition command. Default value is 5000 msec.
- Recognition-Timeout - maximum duration of recognition. Default value is 60000 msec (1 minute).
- Speech-Complete-Timeout and Speech-Incomplete-Timeout - because in absence of grammars it is difficult do define when speech is complete or incomplete, these two parameters are treated the same and the actual value of the speech timeout as used by the ASR is the minimum of both parameters. Speech timeout starts to be counted from the end of the last recognized word (which means that if no speech has been detected then No-Input-Timeout will be in effect instead).
The defaults for these values come from the grammar recognition use case and that is why they are different - Speech-Complete-Timout default is 5000 msec , and the Speech-Incomplete-Timeout is 5000 msec. Which means that for large vocabulary transcription then default speech timeout in practice is 2000 msec. - Confidence-Threshold - float value between 0.0-1.0, if the confidence of the recognized utterance is below this threshold then a a NO-MATCH will be returned. Default value is 0.0 - which means that all recognitions will be MATCH.
- Sensitivity-Level -Determines the threshold for rejection of low volume background speech or noise.
1 is most sensitive (least rejection) and corresponds to -90 dbFS
0 is least sensitive (most rejection) and corresponds to -50 dbFSThis setting affects start-of-speech and end-of-speech detection. If some background noise/speech is causing a too early start-of-speech try a lower value of sensitivity level.
Default value is 0.5
Unless mentioned above, there are no maximum values set for those parameters.
Passing hints
You can use the "fake" __TRANSCRIBE__ grammar that is used to enable large vocabulary transcription to pass hints. An example how to do this is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
version="1.0" xml:lang="en-US" tag-format="semantics/1.0"
root="__TRANSCRIBE__" hints="press_one:10,press_two">
<rule id="__TRANSCRIBE__">
<item>transcribe</item>
</rule>
</grammar>
See Using Hints for more details on how to format hints strings.
Comments
0 comments
Please sign in to leave a comment.