platypush.plugins.stt.deepspeech

class platypush.plugins.stt.deepspeech.SttDeepspeechPlugin(model_file: str, lm_file: str, trie_file: str, lm_alpha: float = 0.75, lm_beta: float = 1.85, beam_width: int = 500, *args, **kwargs)[source]

This plugin performs speech-to-text and speech detection using the Mozilla DeepSpeech engine.

Requires:

  • deepspeech (pip install 'deepspeech>=0.6.0')
  • numpy (pip install numpy)
  • sounddevice (pip install sounddevice)
__init__(model_file: str, lm_file: str, trie_file: str, lm_alpha: float = 0.75, lm_beta: float = 1.85, beam_width: int = 500, *args, **kwargs)[source]

In order to run the speech-to-text engine you’ll need to download the right model files for the Deepspeech engine that you have installed:

# Create the working folder for the models
export MODELS_DIR=~/models
mkdir -p $MODELS_DIR
cd $MODELS_DIR

# Download and extract the model files for your version of Deepspeech. This may take a while.
export DEEPSPEECH_VERSION=0.6.1
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
tar -xvzf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite
Parameters:
  • model_file – Path to the model file (usually named output_graph.pb or output_graph.pbmm). Note that .pbmm usually perform better and are smaller.
  • lm_file – Path to the language model binary file (usually named lm.binary).
  • trie_file – The path to the trie file build from the same vocabulary as the language model binary (usually named trie).
  • lm_alpha – The alpha hyperparameter of the CTC decoder - Language Model weight. See <https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0>.
  • lm_beta – The beta hyperparameter of the CTC decoder - Word Insertion weight. See <https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0>.
  • beam_width – Decoder beam width (see beam scoring in KenLM language model).
  • input_device – PortAudio device index or name that will be used for recording speech (default: default system audio input device).
  • hotword – When this word is detected, the plugin will trigger a platypush.message.event.stt.HotwordDetectedEvent instead of a platypush.message.event.stt.SpeechDetectedEvent event. You can use these events for hooking other assistants.
  • hotwords – Use a list of hotwords instead of a single one.
  • conversation_timeout – If hotword or hotwords are set and conversation_timeout is set, the next speech detected event will trigger a platypush.message.event.stt.ConversationDetectedEvent instead of a platypush.message.event.stt.SpeechDetectedEvent event. You can hook custom hooks here to run any logic depending on the detected speech - it can emulate a kind of “OK, Google. Turn on the lights” interaction without using an external assistant.
  • block_duration – Duration of the acquired audio blocks (default: 1 second).
static convert_frames(frames: Union[<sphinx.ext.autodoc.importer._MockObject object at 0x7f99364a9da0>, bytes]) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f99364a9160>[source]

Conversion method for raw audio frames. It just returns the input frames as bytes. Override it if required by your logic.

Parameters:frames – Input audio frames, as bytes.
Returns:The audio frames as passed on the input. Override if required.
detect(audio_file: str) → platypush.message.response.stt.SpeechDetectedResponse[source]

Perform speech-to-text analysis on an audio file.

Parameters:audio_file – Path to the audio file.
detect_speech(frames) → str[source]

Method called within the detection_thread when new audio frames have been captured. Must be implemented by the derived classes.

Parameters:frames – Audio frames, as returned by convert_frames.
Returns:Detected text, as a string. Returns an empty string if no text has been detected.
on_detection_ended()[source]

Method called when the detection_thread stops. Clean up your context variables and models here.

on_detection_started()[source]

Method called when the detection_thread starts. Initialize your context variables and models here if required.

on_speech_detected(speech: str) → None[source]

Hook called when speech is detected. Triggers the right event depending on the current context.

Parameters:speech – Detected speech.