| .TH POCKETSPHINX 1 "2022-09-27" |
| .SH NAME |
| pocketsphinx \- Run speech recognition on audio data |
| .SH SYNOPSIS |
| .B pocketsphinx |
| [ \fIoptions\fR... ] |
| [ \fBlive\fR | |
| \fBsingle\fR | |
| \fBhelp\fR | |
| \fBsoxflags\fR ] |
| \fIINPUTS\fR... |
| .SH DESCRIPTION |
| .PP |
| The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel |
| 16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from |
| standard input), and attemps to recognize speech in it using the |
| default acoustic and language model. The input files can be raw audio, |
| WAV, or NIST Sphere files, though some of these may not be recognized |
| properly. It accepts a large number of options which you probably |
| don't care about, and a \fIcommand\fP which defaults to |
| ‘\f[CR]live\fP’. The commands are as follows: |
| .TP |
| .B help |
| Print a long list of those options you don't care about. |
| .TP |
| .B config |
| Dump configuration as JSON to standard output (can be loaded with the |
| ‘\f[CR]-config\fP’ option). |
| .TP |
| .B live |
| Detect speech segments in input files, run recognition on them (using |
| those options you don't care about), and write the results to standard |
| output in line-delimited JSON. I realize this isn't the prettiest |
| format, but it sure beats XML. Each line contains a JSON object with |
| these fields, which have short names to make the lines more readable: |
| .IP |
| "b": Start time in seconds, from the beginning of the stream |
| .IP |
| "d": Duration in seconds |
| .IP |
| "p": Estimated probability of the recognition result, i.e. a number between |
| 0 and 1 which may be used as a confidence score |
| .IP |
| "t": Full text of recognition result |
| .IP |
| "w": List of segments (usually words), each of which in turn contains the |
| ‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for |
| start, end, probability, and the text of the word. In the future we |
| may also support hierarchical results in which case ‘\f[CR]w\fP’ could |
| be present. |
| .TP |
| .B single |
| Recognize the input as a single utterance, and write a JSON object in the same format described above. |
| .TP |
| .B align |
|
|
| Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word |
| sequence, and write a JSON object in the same format described above. |
| The first positional argument is the input, and all subsequent ones |
| are concatenated to make the text, to avoid surprises if you forget to |
| quote it. You are responsible for normalizing the text to remove |
| punctuation, uppercase, centipedes, etc. For example: |
|
|
| .EX |
| pocketsphinx align goforward.wav "go forward ten meters" |
| .EE |
|
|
| By default, only word-level alignment is done. To get phone |
| alignments, pass `-phone_align yes` in the flags, e.g.: |
|
|
| .EX |
| pocketsphinx -phone_align yes align audio.wav $text |
| .EE |
|
|
| This will make not particularly readable output, but you can use |
| .B jq |
| (https://stedolan.github.io/jq/) to clean it up. For example, |
| you can get just the word names and start times like this: |
|
|
| .EX |
| pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]' |
| .EE |
|
|
| Or you could get the phone names and durations like this: |
|
|
| .EX |
| pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]' |
| .EE |
|
|
| There are many, many other possibilities, of course. |
| .TP |
| .B help |
| Print a usage and help text with a list of possible arguments. |
| .TP |
| .B soxflags |
| Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate |
| input format. Note that because the ‘\f[CR]sox\fP’ command-line is |
| slightly quirky these must always come \fIafter\fP the filename or |
| ‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the |
| microphone). You can run live recognition like this: |
|
|
| .EX |
| sox -d $(pocketsphinx soxflags) | pocketsphinx - |
| .EE |
|
|
| or decode from a file named "audio.mp3" like this: |
|
|
| .EX |
| sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx - |
| .EE |
| .PP |
| By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however. |
| .SH OPTIONS |
| .TP |
| .B \-agc |
| Automatic gain control for c0 ('max', 'emax', 'noise', or 'none') |
| .TP |
| .B \-agcthresh |
| Initial threshold for automatic gain control |
| .TP |
| .B \-allphone |
| phoneme decoding with phonetic lm (given here) |
| .TP |
| .B \-allphone_ci |
| Perform phoneme decoding with phonetic lm and context-independent units only |
| .TP |
| .B \-alpha |
| Preemphasis parameter |
| .TP |
| .B \-ascale |
| Inverse of acoustic model scale for confidence score calculation |
| .TP |
| .B \-aw |
| Inverse weight applied to acoustic scores. |
| .TP |
| .B \-backtrace |
| Print results and backtraces to log. |
| .TP |
| .B \-beam |
| Beam width applied to every frame in Viterbi search (smaller values mean wider beam) |
| .TP |
| .B \-bestpath |
| Run bestpath (Dijkstra) search over word lattice (3rd pass) |
| .TP |
| .B \-bestpathlw |
| Language model probability weight for bestpath search |
| .TP |
| .B \-ceplen |
| Number of components in the input feature vector |
| .TP |
| .B \-cmn |
| Cepstral mean normalization scheme ('live', 'batch', or 'none') |
| .TP |
| .B \-cmninit |
| Initial values (comma-separated) for cepstral mean when 'live' is used |
| .TP |
| .B \-compallsen |
| Compute all senone scores in every frame (can be faster when there are many senones) |
| .TP |
| .B \-dict |
| pronunciation dictionary (lexicon) input file |
| .TP |
| .B \-dictcase |
| Dictionary is case sensitive (NOTE: case insensitivity applies to ASCII characters only) |
| .TP |
| .B \-dither |
| Add 1/2-bit noise |
| .TP |
| .B \-doublebw |
| Use double bandwidth filters (same center freq) |
| .TP |
| .B \-ds |
| Frame GMM computation downsampling ratio |
| .TP |
| .B \-fdict |
| word pronunciation dictionary input file |
| .TP |
| .B \-feat |
| Feature stream type, depends on the acoustic model |
| .TP |
| .B \-featparams |
| containing feature extraction parameters. |
| .TP |
| .B \-fillprob |
| Filler word transition probability |
| .TP |
| .B \-frate |
| Frame rate |
| .TP |
| .B \-fsg |
| format finite state grammar file |
| .TP |
| .B \-fsgusealtpron |
| Add alternate pronunciations to FSG |
| .TP |
| .B \-fsgusefiller |
| Insert filler words at each state. |
| .TP |
| .B \-fwdflat |
| Run forward flat-lexicon search over word lattice (2nd pass) |
| .TP |
| .B \-fwdflatbeam |
| Beam width applied to every frame in second-pass flat search |
| .TP |
| .B \-fwdflatefwid |
| Minimum number of end frames for a word to be searched in fwdflat search |
| .TP |
| .B \-fwdflatlw |
| Language model probability weight for flat lexicon (2nd pass) decoding |
| .TP |
| .B \-fwdflatsfwin |
| Window of frames in lattice to search for successor words in fwdflat search |
| .TP |
| .B \-fwdflatwbeam |
| Beam width applied to word exits in second-pass flat search |
| .TP |
| .B \-fwdtree |
| Run forward lexicon-tree search (1st pass) |
| .TP |
| .B \-hmm |
| containing acoustic model files. |
| .TP |
| .B \-input_endian |
| Endianness of input data, big or little, ignored if NIST or MS Wav |
| .TP |
| .B \-jsgf |
| grammar file |
| .TP |
| .B \-keyphrase |
| to spot |
| .TP |
| .B \-kws |
| file with keyphrases to spot, one per line |
| .TP |
| .B \-kws_delay |
| Delay to wait for best detection score |
| .TP |
| .B \-kws_plp |
| Phone loop probability for keyphrase spotting |
| .TP |
| .B \-kws_threshold |
| Threshold for p(hyp)/p(alternatives) ratio |
| .TP |
| .B \-latsize |
| Initial backpointer table size |
| .TP |
| .B \-lda |
| containing transformation matrix to be applied to features (single-stream features only) |
| .TP |
| .B \-ldadim |
| Dimensionality of output of feature transformation (0 to use entire matrix) |
| .TP |
| .B \-lifter |
| Length of sin-curve for liftering, or 0 for no liftering. |
| .TP |
| .B \-lm |
| trigram language model input file |
| .TP |
| .B \-lmctl |
| a set of language model |
| .TP |
| .B \-lmname |
| language model in \fB\-lmctl\fR to use by default |
| .TP |
| .B \-logbase |
| Base in which all log-likelihoods calculated |
| .TP |
| .B \-logfn |
| to write log messages in |
| .TP |
| .B \-loglevel |
| Minimum level of log messages (DEBUG, INFO, WARN, ERROR) |
| .TP |
| .B \-logspec |
| Write out logspectral files instead of cepstra |
| .TP |
| .B \-lowerf |
| Lower edge of filters |
| .TP |
| .B \-lpbeam |
| Beam width applied to last phone in words |
| .TP |
| .B \-lponlybeam |
| Beam width applied to last phone in single-phone words |
| .TP |
| .B \-lw |
| Language model probability weight |
| .TP |
| .B \-maxhmmpf |
| Maximum number of active HMMs to maintain at each frame (or \fB\-1\fR for no pruning) |
| .TP |
| .B \-maxwpf |
| Maximum number of distinct word exits at each frame (or \fB\-1\fR for no pruning) |
| .TP |
| .B \-mdef |
| definition input file |
| .TP |
| .B \-mean |
| gaussian means input file |
| .TP |
| .B \-mfclogdir |
| to log feature files to |
| .TP |
| .B \-min_endfr |
| Nodes ignored in lattice construction if they persist for fewer than N frames |
| .TP |
| .B \-mixw |
| mixture weights input file (uncompressed) |
| .TP |
| .B \-mixwfloor |
| Senone mixture weights floor (applied to data from \fB\-mixw\fR file) |
| .TP |
| .B \-mllr |
| transformation to apply to means and variances |
| .TP |
| .B \-mmap |
| Use memory-mapped I/O (if possible) for model files |
| .TP |
| .B \-ncep |
| Number of cep coefficients |
| .TP |
| .B \-nfft |
| Size of FFT, or 0 to set automatically (recommended) |
| .TP |
| .B \-nfilt |
| Number of filter banks |
| .TP |
| .B \-nwpen |
| New word transition penalty |
| .TP |
| .B \-pbeam |
| Beam width applied to phone transitions |
| .TP |
| .B \-pip |
| Phone insertion penalty |
| .TP |
| .B \-pl_beam |
| Beam width applied to phone loop search for lookahead |
| .TP |
| .B \-pl_pbeam |
| Beam width applied to phone loop transitions for lookahead |
| .TP |
| .B \-pl_pip |
| Phone insertion penalty for phone loop |
| .TP |
| .B \-pl_weight |
| Weight for phoneme lookahead penalties |
| .TP |
| .B \-pl_window |
| Phoneme lookahead window size, in frames |
| .TP |
| .B \-rawlogdir |
| to log raw audio files to |
| .TP |
| .B \-remove_dc |
| Remove DC offset from each frame |
| .TP |
| .B \-remove_noise |
| Remove noise using spectral subtraction |
| .TP |
| .B \-round_filters |
| Round mel filter frequencies to DFT points |
| .TP |
| .B \-samprate |
| Sampling rate |
| .TP |
| .B \-seed |
| Seed for random number generator; if less than zero, pick our own |
| .TP |
| .B \-sendump |
| dump (compressed mixture weights) input file |
| .TP |
| .B \-senlogdir |
| to log senone score files to |
| .TP |
| .B \-senmgau |
| to codebook mapping input file (usually not needed) |
| .TP |
| .B \-silprob |
| Silence word transition probability |
| .TP |
| .B \-smoothspec |
| Write out cepstral-smoothed logspectral files |
| .TP |
| .B \-svspec |
| specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38) |
| .TP |
| .B \-tmat |
| state transition matrix input file |
| .TP |
| .B \-tmatfloor |
| HMM state transition probability floor (applied to \fB\-tmat\fR file) |
| .TP |
| .B \-topn |
| Maximum number of top Gaussians to use in scoring. |
| .TP |
| .B \-topn_beam |
| Beam width used to determine top-N Gaussians (or a list, per-feature) |
| .TP |
| .B \-toprule |
| rule for JSGF (first public rule is default) |
| .TP |
| .B \-transform |
| Which type of transform to use to calculate cepstra (legacy, dct, or htk) |
| .TP |
| .B \-unit_area |
| Normalize mel filters to unit area |
| .TP |
| .B \-upperf |
| Upper edge of filters |
| .TP |
| .B \-uw |
| Unigram weight |
| .TP |
| .B \-var |
| gaussian variances input file |
| .TP |
| .B \-varfloor |
| Mixture gaussian variance floor (applied to data from \fB\-var\fR file) |
| .TP |
| .B \-varnorm |
| Variance normalize each utterance (only if CMN == current) |
| .TP |
| .B \-verbose |
| Show input filenames |
| .TP |
| .B \-warp_params |
| defining the warping function |
| .TP |
| .B \-warp_type |
| Warping function type (or shape) |
| .TP |
| .B \-wbeam |
| Beam width applied to word exits |
| .TP |
| .B \-wip |
| Word insertion penalty |
| .TP |
| .B \-wlen |
| Hamming window length |
| .SH AUTHOR |
| Written by numerous people at CMU from 1994 onwards. This manual page |
| by David Huggins-Daines <dhdaines@gmail.com> |
| .SH COPYRIGHT |
| Copyright \(co 1994-2016 Carnegie Mellon University. See the file |
| \fILICENSE\fR included with this package for more information. |
| .br |
| .SH "SEE ALSO" |
| .BR pocketsphinx_batch (1), |
| .BR sphinx_fe (1). |
| .br |
|
|