Difference between revisions of "Speech recognition"

From ArchWiki
Jump to: navigation, search
m (Dragon Naturally Speaking in Wine: Fix Typo.)
(List of text to speech applications: Added a link to a site where the different engines can be listen to)
 
(43 intermediate revisions by 14 users not shown)
Line 1: Line 1:
 
[[Category:Accessibility]]
 
[[Category:Accessibility]]
[[Category:Audio/Video]]
+
[[Category:Multimedia]]
Speech recognition is any means by which you can interface with your computer via spoken word.   This page is designed to identify applications that can facilitate speech recognition and to serve as a guide in installing and using this software in Arch.
+
[[ja:音声認識]]
 +
Speech recognition is any means by which you can interface with your computer via spoken word. This page is designed to identify applications that can facilitate speech recognition and to serve as a guide in installing and using this software in Arch.
  
'''A note to newcomers:''' Speech recognition is something that traditionally has not been well supported in Linux. If you become interested and choose to dig below the immediate surface, you can expect difficulty in finding documentation or help from the community.
+
'''A note to newcomers:''' speech recognition is something that traditionally has not been well supported in Linux. If you become interested and choose to dig below the immediate surface, you can expect difficulty in finding documentation or help from the community.
 +
 
 +
== Types of speech recognition ==
  
==Types of Speech Recognition==
 
 
Speech recognition can mean several things:
 
Speech recognition can mean several things:
* Text-To-Speech:
 
*:As it sounds, Text-To-Speech (or TTS)  will manipulate a string of text into an audio clip.  There are several programs available that perform TTS, some of which are command-line based (ideal for scripting) and others which provide a handy GUI. 
 
*Simple Voice Control/Commands:
 
*:This is the most basic form of Speech-To-Text application.  These are designed to recognize a small number of specific, typically one-word commands and then perform an action.  This is often used as an alternative to an application launcher, allowing the user for instance to say the word “firefox” and have his OS open a new browser window. 
 
*Full dictation/recognition:
 
*:Full dictation/recognition software allows the user to read full sentences or paragraphs and translates that data into text on the fly.  This could be used, for instance, to dictate an entire letter into the window of an email client.  In some cases, these types of applications need to be trained to your voice and can improve in accuracy the more they are used. 
 
 
==Development Status==
 
Several years ago there was a push to implement speech recognition in Linux.  Since then, many of those projects have stagnated. 
 
  
==Text-To-Speech==
+
;Text-To-Speech
The two major players in text-to-speech applications are Festival and eSpeak.  Comparison available [http://braille.uwo.ca/pipermail/speakup/2008-July/046755.html here]
+
:As it sounds, Text-To-Speech (or TTS) will manipulate a string of text into an audio clip. It is useful for blind people to be able to use computers but can also be used to simply improve computer experience. There are several programs available that perform TTS, some of which are command-line based (ideal for scripting) and others which provide a handy GUI.
  
===Festival===
+
;Simple Voice Control/Commands
[[Festival]] offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced.  
+
:This is the most basic form of Speech-To-Text application. These are designed to recognize a small number of specific, typically one-word commands and then perform an action. This is often used as an alternative to an application launcher, allowing the user for instance to say the word “firefox” and have his OS open a new browser window.
  
* Free
+
;Full dictation/recognition
* Can install several different voices/accents.
+
:Full dictation/recognition software allows the user to read full sentences or paragraphs and translates that data into text on the fly. This could be used, for instance, to dictate an entire letter into the window of an email client. In some cases, these types of applications need to be trained to your voice and can improve in accuracy the more they are used.
* Available in Extra
 
  
[http://www.cstr.ed.ac.uk/projects/festival/ Site Link]
+
== List of text to speech applications ==
  
===eSpeak===
+
Two text-to-speech applications are Festival and eSpeak, a small feature comparison is available in a mailing list [http://web.archive.org/web/20090924193011/http://braille.uwo.ca/pipermail/speakup/2008-July/046756.html thread]. You can find a listening comparison of the different engines [https://tools.wmflabs.org/tts-comparison/ here].
[http://espeak.sourceforge.net/ eSpeak] is "a compact open source software speech synthesizer for English and other languages, for Linux and Windows".
 
  
*Open source
+
;Engines: TTS-Engines that can be used as commandline tools or embedded in other applications:
*Lightweight
+
* {{App|[[Wikipedia:eSpeak|eSpeak]]|Compact open source software speech synthesizer for more than 50 languages.|http://espeak.sourceforge.net/|{{Pkg|espeak}}}}
*Available in the community repository
+
* {{App|[[Wikipedia:eSpeakNG|eSpeakNG]]|Fork of eSpeak (due to inactivity of original maintainer).|https://github.com/espeak-ng/espeak-ng|{{AUR|espeak-ng-git}}}}
*Excellent language support
+
* {{App|[[Festival]]|General framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech.|http://www.cstr.ed.ac.uk/projects/festival/|{{Pkg|festival}}}}
 +
* {{App|[[mbrola|MBROLA]]|Non-free phonemes-to-audio program which supports more than 70 languages. Mbrola-voices can also be used with eSpeak.|http://tcts.fpms.ac.be/synthesis/mbrola.html|{{AUR|mbrola}}}}
 +
* {{App|Flite|A lighweight speech synthesis engine|http://www.festvox.org/flite/|{{Pkg|flite}}}}
 +
* {{App|SVOX Pico|The text-to-speech engine used on Android phones. (Available languages are en-US, en-GB, de-DE, es-ES, fr-FR and it-IT)|-|{{AUR|svox-pico-bin}}}}
 +
* {{App|Mimic|Text-to-speech voice synthesis from the Mycroft project (based on flite)|https://mimic.mycroft.ai/|{{Aur|mimic-git}}}}
 +
* {{App|Marytts|An open-source, multilingual Text-to-Speech Synthesis platform written in Java|http://mary.dfki.de/|{{Aur|marytts}}}}
  
=====Installing eSpeak=====
+
;language-specific Engines:
To install eSpeak:
+
* {{App|Ekho|Chinese text-to-speech (TTS) software for Cantonese, Mandarin, Zhaoan Hakka, Tibetan, Ngangien and Korean|http://www.eguidedog.net/ekho.php|{{Aur|ekho}}}}
:{{bc| pacman -S espeak}}
+
* {{App|Open-jtalk|Japanese text-to-speech synthesis system|https://sourceforge.net/projects/open-jtalk/|{{Aur|open-jtalk}}}}
=====Testing eSpeak=====
 
:{{bc| echo "Hello. This is a test." <nowiki>|</nowiki> espeak}}
 
  
=====eSpeak Usage/Configuration=====
+
;User Applications: Mostly Graphical Applications using one of above engines:
The Documents page on the eSpeak website [http://espeak.sourceforge.net/docindex.html here] provides an excellent guide for using different voices, adjusting pronunciation, etc. There are many different accents included in this install that are worth trying out.
+
* {{App|Gnome speech|API to incorporate speech into GNOME application menus|}}
 +
* {{App|Jovie|KDE Text To Speech Daemon.|https://userbase.kde.org/Jovie|{{Pkg|kdeaccessibility-jovie}}}}
 +
* {{App|Orca|Screen reader for individuals who are blind or visually impaired, using eSpeak (via speech-dispatcher)|http://www.gnome.org/projects/orca|{{Pkg|orca}}}}
 +
* {{App|[[Simple Orca Plugin System]]|Plug-in extension for the Orca screen reader|https://stormdragon.tk/orca-plugins/index.php|{{AUR|simpleorcapluginsystem-git}}}}
 +
* {{App|Speech Dispatcher|Common interface to speech synthesis. It has backends for eSpeak, Festival, and a few other speech synthesizers.|http://www.freebsoft.org/speechd|{{Pkg|speech-dispatcher}}}}
 +
* {{App|Gespeaker|Gespeaker is a GTK+ frontend for espeak|http://www.muflone.com/gespeaker/|{{AUR|gespeaker-git}}}}
  
 +
== List of voiced commands applications ==
  
==Voiced Commands==
+
{{Out of date|Replace the list with available software, e.g. {{AUR|voximp}}.}}
===Gnome-Voice-Control===
 
Gnome-Voice-Control is a dialogue system to control the GNOME Desktop. It is developed on Google Summer of Code 2007.  
 
  
Available in AUR
+
=== VEDICS ===
  
===VEDICS===
 
 
VEDICS (Voice Enabled Desktop Interaction and Control System) is an assistive software which lets the user to interact with the OS using voice commands.
 
VEDICS (Voice Enabled Desktop Interaction and Control System) is an assistive software which lets the user to interact with the OS using voice commands.
  
 
Note:
 
Note:
Not yet tested
+
* Last updated in 2011.
 +
* Not yet tested.
  
 
[http://vedics.sourceforge.net/ Site Link]
 
[http://vedics.sourceforge.net/ Site Link]
  
 
Features:
 
Features:
#Perform common window operations like close, minimize, maximize etc.
+
*Perform common window operations like close, minimize, maximize etc.
#Invoke default applications like browsers, mail clients etc.
+
*Invoke default applications like browsers, mail clients etc.
#Access any element on the desktop just by saying its name.
+
*Access any element on the desktop just by saying its name.
#Supports GNOME3, GNOME2
+
*Supports GNOME3, GNOME2
  
 
===Perlbox-Voice===
 
===Perlbox-Voice===
Line 72: Line 70:
 
Note:
 
Note:
 
*Last updated in 2005
 
*Last updated in 2005
*Package is in AUR, but missing festival-don dependency.
 
  
 
[http://perlbox.sourceforge.net/pbtk/ Site Link]
 
[http://perlbox.sourceforge.net/pbtk/ Site Link]
  
 
Features:
 
Features:
#Text to speech (Thanks to the Festival speech synthesizer)
+
*Text to speech (Thanks to the Festival speech synthesizer)
#Voice control to open user specified applications. For example, if you say "Web", the Perlbox-Voice Control will open the browser of your choice.
+
*Voice control to open user specified applications. For example, if you say "Web", the Perlbox-Voice Control will open the browser of your choice.
#Desktop plugins to control your Linux desktop using only your voice. You can switch virtual screens, cycle through desktops, invoke the run dialog, quick lock the screen.  
+
*Desktop plugins to control your Linux desktop using only your voice. You can switch virtual screens, cycle through desktops, invoke the run dialog, quick lock the screen.
#Custom commands are fully supported, and you can add commands on the fly.
+
*Custom commands are fully supported, and you can add commands on the fly.
#Pseudo Commands' allow you to enter commands that the speaker should say. For example, if you say "Good morning", the computer voice could say "And good morning to you".  
+
*Pseudo Commands' allow you to enter commands that the speaker should say. For example, if you say "Good morning", the computer voice could say "And good morning to you".
 +
 
 +
== List of speech recognition applications ==
 +
=== Free Speech Recognition Engines ===
 +
 
 +
;CMU Sphinx: See http://cmusphinx.sourceforge.net/ and [[wikipedia:CMU_Sphinx|Wikipedia]].
 +
 
 +
;Simon: http://sourceforge.net/projects/speech2text/ - Simon is a QT interface to Julius that will replace the mouse and keyboard with your voice. Works with X11 and Windows
 +
 
 +
;Speech: [https://github.com/andre-luiz-dos-santos/speech-app Speech] is a Chrome App for dictation, using Google's speech recognition engine.
 +
 
 +
;Julius: Julius is a large vocabulary continuous speech recognition decoder, their project page is located on http://julius.sourceforge.jp/en_index.php
 +
 
 +
;XVoice: Uses ViaVoice to pass text to X applications. http://xvoice.sourceforge.net/
 +
 
 +
;ViaVoice: [[wikipedia:IBM ViaVoice]]
 +
 
 +
;sphinxkeys: http://code.google.com/p/sphinxkeys/ - You can essentially type keyboard keys and mouse clicks by speaking into your microphone
 +
 
 +
;VoxForge: http://www.voxforge.org/ - a project that collects speech transcriptions for use in open source speech recognition engines
 +
 
 +
=== Proprietary Speech Recognition Engines ===
 +
 
 +
;Dragon Naturally Speaking in Wine: Dragon Naturally Speaking software by Nuance is a well-functioning and popular implementation of speech dictation. It is developed for Windows, but has been run successfully in a Linux enviornment using wine. It can be used independently for dictation into other wine programs such as notepad or it can be paired with Platypus to interface with any native linux program. Platypus also provides a feature to control of your OS using voice commands, similar to the programs described in [[#List of voiced commands applications]].
 +
:Nuance's software is non-free, so you will have to purchase a copy. Note that Dragon provides you with the ability to install it on a set number of machines.Installing/Reinstalling in wine may use up some of these licenses.
 +
:[http://thenerdshow.com/platypus.html Platypus Project]
 +
 
 +
;Wizzscribe SI
  
 +
;Verbio ASR
  
==Speech Recognition==
+
;DynaSpeak from SRI International
===Free Speech Recognition Engines===
 
====CMU Sphinx====
 
See http://cmusphinx.sourceforge.net/ and [http://en.wikipedia.org/wiki/CMU_Sphinx Wikipedia].
 
====Simon====
 
====Julius====
 
====XVoice====
 
====ViaVoice====
 
====sphinxkeys====
 
====VoxForge====
 
  
===Proprietary Speech Recognition Engines===
+
;LumenVox Speech Engine
====Dragon Naturally Speaking in Wine====
 
Dragon Naturally Speaking software by Nuance is a well-functioning and popular implementation of speech dictation.  It is developed for Windows, but has been run sucsessfully in a a linux enviornment using wine.  It can be used independently for dictation into other wine programs such as notepad or it can be paired with Platypus to interface with any native linux program.  Platypus also provides a feature to control of your OS using voice commands, similar to the programs described in the [[Speech_Recognition#Voiced_Commands | Voiced Commands]] section.
 
  
Nuance's software is non-free, so you will have to purchase a copy. Note that Dragon provides you with the ability to install it on a set number of machines.  Installing/Reinstalling in wine may use up some of these licenses.  
+
;VoxSigma: http://www.vocapia.com and [[wikipedia:VoxSigma|Wikipedia]]
 +
:VoxSigma is a speech-to-text software suite by Vocapia Research. It is well suited for broadcast monitoring, audio visual archive indexing, telephone speech analytics, transcription of business conference calls, and video subtitling.
  
[http://thenerdshow.com/platypus.html Platypus Project]
+
== See also ==
  
====Wizzscribe SI====
+
* [http://kubuntu.free.fr/blog/index.php/2006/09/24/121-synthese-vocale-en-francais-sous-linux Synthèse vocale en français sous Linux - KubuntuBlog (french)]
====Verbio ASR====
 
====DynaSpeak from SRI International====
 
====LumenVox Speech Engine====
 

Latest revision as of 19:44, 4 November 2017

Speech recognition is any means by which you can interface with your computer via spoken word. This page is designed to identify applications that can facilitate speech recognition and to serve as a guide in installing and using this software in Arch.

A note to newcomers: speech recognition is something that traditionally has not been well supported in Linux. If you become interested and choose to dig below the immediate surface, you can expect difficulty in finding documentation or help from the community.

Types of speech recognition

Speech recognition can mean several things:

Text-To-Speech
As it sounds, Text-To-Speech (or TTS) will manipulate a string of text into an audio clip. It is useful for blind people to be able to use computers but can also be used to simply improve computer experience. There are several programs available that perform TTS, some of which are command-line based (ideal for scripting) and others which provide a handy GUI.
Simple Voice Control/Commands
This is the most basic form of Speech-To-Text application. These are designed to recognize a small number of specific, typically one-word commands and then perform an action. This is often used as an alternative to an application launcher, allowing the user for instance to say the word “firefox” and have his OS open a new browser window.
Full dictation/recognition
Full dictation/recognition software allows the user to read full sentences or paragraphs and translates that data into text on the fly. This could be used, for instance, to dictate an entire letter into the window of an email client. In some cases, these types of applications need to be trained to your voice and can improve in accuracy the more they are used.

List of text to speech applications

Two text-to-speech applications are Festival and eSpeak, a small feature comparison is available in a mailing list thread. You can find a listening comparison of the different engines here.

Engines
TTS-Engines that can be used as commandline tools or embedded in other applications:
  • eSpeak — Compact open source software speech synthesizer for more than 50 languages.
http://espeak.sourceforge.net/ || espeak
  • eSpeakNG — Fork of eSpeak (due to inactivity of original maintainer).
https://github.com/espeak-ng/espeak-ng || espeak-ng-gitAUR
  • Festival — General framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech.
http://www.cstr.ed.ac.uk/projects/festival/ || festival
  • MBROLA — Non-free phonemes-to-audio program which supports more than 70 languages. Mbrola-voices can also be used with eSpeak.
http://tcts.fpms.ac.be/synthesis/mbrola.html || mbrolaAUR
  • Flite — A lighweight speech synthesis engine
http://www.festvox.org/flite/ || flite
  • SVOX Pico — The text-to-speech engine used on Android phones. (Available languages are en-US, en-GB, de-DE, es-ES, fr-FR and it-IT)
- || svox-pico-binAUR
  • Mimic — Text-to-speech voice synthesis from the Mycroft project (based on flite)
https://mimic.mycroft.ai/ || mimic-gitAUR
  • Marytts — An open-source, multilingual Text-to-Speech Synthesis platform written in Java
http://mary.dfki.de/ || maryttsAUR
language-specific Engines
  • Ekho — Chinese text-to-speech (TTS) software for Cantonese, Mandarin, Zhaoan Hakka, Tibetan, Ngangien and Korean
http://www.eguidedog.net/ekho.php || ekhoAUR
  • Open-jtalk — Japanese text-to-speech synthesis system
https://sourceforge.net/projects/open-jtalk/ || open-jtalkAUR
User Applications
Mostly Graphical Applications using one of above engines:
  • Gnome speech — API to incorporate speech into GNOME application menus
|| not packaged? search in AUR
  • Jovie — KDE Text To Speech Daemon.
https://userbase.kde.org/Jovie || kdeaccessibility-jovie
  • Orca — Screen reader for individuals who are blind or visually impaired, using eSpeak (via speech-dispatcher)
http://www.gnome.org/projects/orca || orca
https://stormdragon.tk/orca-plugins/index.php || simpleorcapluginsystem-gitAUR
  • Speech Dispatcher — Common interface to speech synthesis. It has backends for eSpeak, Festival, and a few other speech synthesizers.
http://www.freebsoft.org/speechd || speech-dispatcher
  • Gespeaker — Gespeaker is a GTK+ frontend for espeak
http://www.muflone.com/gespeaker/ || gespeaker-gitAUR

List of voiced commands applications

Tango-view-refresh-red.pngThis article or section is out of date.Tango-view-refresh-red.png

Reason: Replace the list with available software, e.g. voximpAUR. (Discuss in Talk:Speech recognition#)

VEDICS

VEDICS (Voice Enabled Desktop Interaction and Control System) is an assistive software which lets the user to interact with the OS using voice commands.

Note:

  • Last updated in 2011.
  • Not yet tested.

Site Link

Features:

  • Perform common window operations like close, minimize, maximize etc.
  • Invoke default applications like browsers, mail clients etc.
  • Access any element on the desktop just by saying its name.
  • Supports GNOME3, GNOME2

Perlbox-Voice

Perlbox Voice is an voice enabled application to bring your desktop under your command.

Note:

  • Last updated in 2005

Site Link

Features:

  • Text to speech (Thanks to the Festival speech synthesizer)
  • Voice control to open user specified applications. For example, if you say "Web", the Perlbox-Voice Control will open the browser of your choice.
  • Desktop plugins to control your Linux desktop using only your voice. You can switch virtual screens, cycle through desktops, invoke the run dialog, quick lock the screen.
  • Custom commands are fully supported, and you can add commands on the fly.
  • Pseudo Commands' allow you to enter commands that the speaker should say. For example, if you say "Good morning", the computer voice could say "And good morning to you".

List of speech recognition applications

Free Speech Recognition Engines

CMU Sphinx
See http://cmusphinx.sourceforge.net/ and Wikipedia.
Simon
http://sourceforge.net/projects/speech2text/ - Simon is a QT interface to Julius that will replace the mouse and keyboard with your voice. Works with X11 and Windows
Speech
Speech is a Chrome App for dictation, using Google's speech recognition engine.
Julius
Julius is a large vocabulary continuous speech recognition decoder, their project page is located on http://julius.sourceforge.jp/en_index.php
XVoice
Uses ViaVoice to pass text to X applications. http://xvoice.sourceforge.net/
ViaVoice
wikipedia:IBM ViaVoice
sphinxkeys
http://code.google.com/p/sphinxkeys/ - You can essentially type keyboard keys and mouse clicks by speaking into your microphone
VoxForge
http://www.voxforge.org/ - a project that collects speech transcriptions for use in open source speech recognition engines

Proprietary Speech Recognition Engines

Dragon Naturally Speaking in Wine
Dragon Naturally Speaking software by Nuance is a well-functioning and popular implementation of speech dictation. It is developed for Windows, but has been run successfully in a Linux enviornment using wine. It can be used independently for dictation into other wine programs such as notepad or it can be paired with Platypus to interface with any native linux program. Platypus also provides a feature to control of your OS using voice commands, similar to the programs described in #List of voiced commands applications.
Nuance's software is non-free, so you will have to purchase a copy. Note that Dragon provides you with the ability to install it on a set number of machines.Installing/Reinstalling in wine may use up some of these licenses.
Platypus Project
Wizzscribe SI
Verbio ASR
DynaSpeak from SRI International
LumenVox Speech Engine
VoxSigma
http://www.vocapia.com and Wikipedia
VoxSigma is a speech-to-text software suite by Vocapia Research. It is well suited for broadcast monitoring, audio visual archive indexing, telephone speech analytics, transcription of business conference calls, and video subtitling.

See also