| .TH POCKETSPHINX 1 "2022-09-27" | |
| .SH NAME | |
| pocketsphinx \- Run speech recognition on audio data | |
| .SH SYNOPSIS | |
| .B pocketsphinx | |
| [ \fIoptions\fR... ] | |
| [ \fBlive\fR | | |
| \fBsingle\fR | | |
| \fBhelp\fR | | |
| \fBsoxflags\fR ] | |
| \fIINPUTS\fR... | |
| .SH DESCRIPTION | |
| .PP | |
| The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel | |
| 16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from | |
| standard input), and attemps to recognize speech in it using the | |
| default acoustic and language model. The input files can be raw audio, | |
| WAV, or NIST Sphere files, though some of these may not be recognized | |
| properly. It accepts a large number of options which you probably | |
| don't care about, and a \fIcommand\fP which defaults to | |
| ‘\f[CR]live\fP’. The commands are as follows: | |
| .TP | |
| .B help | |
| Print a long list of those options you don't care about. | |
| .TP | |
| .B config | |
| Dump configuration as JSON to standard output (can be loaded with the | |
| ‘\f[CR]-config\fP’ option). | |
| .TP | |
| .B live | |
| Detect speech segments in input files, run recognition on them (using | |
| those options you don't care about), and write the results to standard | |
| output in line-delimited JSON. I realize this isn't the prettiest | |
| format, but it sure beats XML. Each line contains a JSON object with | |
| these fields, which have short names to make the lines more readable: | |
| .IP | |
| "b": Start time in seconds, from the beginning of the stream | |
| .IP | |
| "d": Duration in seconds | |
| .IP | |
| "p": Estimated probability of the recognition result, i.e. a number between | |
| 0 and 1 which may be used as a confidence score | |
| .IP | |
| "t": Full text of recognition result | |
| .IP | |
| "w": List of segments (usually words), each of which in turn contains the | |
| ‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for | |
| start, end, probability, and the text of the word. In the future we | |
| may also support hierarchical results in which case ‘\f[CR]w\fP’ could | |
| be present. | |
| .TP | |
| .B single | |
| Recognize the input as a single utterance, and write a JSON object in the same format described above. | |
| .TP | |
| .B align | |
| Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word | |
| sequence, and write a JSON object in the same format described above. | |
| The first positional argument is the input, and all subsequent ones | |
| are concatenated to make the text, to avoid surprises if you forget to | |
| quote it. You are responsible for normalizing the text to remove | |
| punctuation, uppercase, centipedes, etc. For example: | |
| .EX | |
| pocketsphinx align goforward.wav "go forward ten meters" | |
| .EE | |
| By default, only word-level alignment is done. To get phone | |
| alignments, pass `-phone_align yes` in the flags, e.g.: | |
| .EX | |
| pocketsphinx -phone_align yes align audio.wav $text | |
| .EE | |
| This will make not particularly readable output, but you can use | |
| .B jq | |
| (https://stedolan.github.io/jq/) to clean it up. For example, | |
| you can get just the word names and start times like this: | |
| .EX | |
| pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]' | |
| .EE | |
| Or you could get the phone names and durations like this: | |
| .EX | |
| pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]' | |
| .EE | |
| There are many, many other possibilities, of course. | |
| .TP | |
| .B help | |
| Print a usage and help text with a list of possible arguments. | |
| .TP | |
| .B soxflags | |
| Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate | |
| input format. Note that because the ‘\f[CR]sox\fP’ command-line is | |
| slightly quirky these must always come \fIafter\fP the filename or | |
| ‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the | |
| microphone). You can run live recognition like this: | |
| .EX | |
| sox -d $(pocketsphinx soxflags) | pocketsphinx - | |
| .EE | |
| or decode from a file named "audio.mp3" like this: | |
| .EX | |
| sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx - | |
| .EE | |
| .PP | |
| By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however. | |
| .SH OPTIONS | |
| .\" ### ARGUMENTS ### | |
| .SH AUTHOR | |
| Written by numerous people at CMU from 1994 onwards. This manual page | |
| by David Huggins-Daines <dhdaines@gmail.com> | |
| .SH COPYRIGHT | |
| Copyright \(co 1994-2016 Carnegie Mellon University. See the file | |
| \fILICENSE\fR included with this package for more information. | |
| .br | |
| .SH "SEE ALSO" | |
| .BR pocketsphinx_batch (1), | |
| .BR sphinx_fe (1). | |
| .br | |