| .TH POCKETSPHINX 1 "2022-09-27" | |
| .SH NAME | |
| pocketsphinx \- Run speech recognition on audio data | |
| .SH SYNOPSIS | |
| .B pocketsphinx | |
| [ \fIoptions\fR... ] | |
| [ \fBlive\fR | | |
| \fBsingle\fR | | |
| \fBhelp\fR | | |
| \fBsoxflags\fR ] | |
| \fIINPUTS\fR... | |
| .SH DESCRIPTION | |
| .PP | |
| The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel | |
| 16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from | |
| standard input), and attemps to recognize speech in it using the | |
| default acoustic and language model. The input files can be raw audio, | |
| WAV, or NIST Sphere files, though some of these may not be recognized | |
| properly. It accepts a large number of options which you probably | |
| don't care about, and a \fIcommand\fP which defaults to | |
| ‘\f[CR]live\fP’. The commands are as follows: | |
| .TP | |
| .B help | |
| Print a long list of those options you don't care about. | |
| .TP | |
| .B config | |
| Dump configuration as JSON to standard output (can be loaded with the | |
| ‘\f[CR]-config\fP’ option). | |
| .TP | |
| .B live | |
| Detect speech segments in input files, run recognition on them (using | |
| those options you don't care about), and write the results to standard | |
| output in line-delimited JSON. I realize this isn't the prettiest | |
| format, but it sure beats XML. Each line contains a JSON object with | |
| these fields, which have short names to make the lines more readable: | |
| .IP | |
| "b": Start time in seconds, from the beginning of the stream | |
| .IP | |
| "d": Duration in seconds | |
| .IP | |
| "p": Estimated probability of the recognition result, i.e. a number between | |
| 0 and 1 which may be used as a confidence score | |
| .IP | |
| "t": Full text of recognition result | |
| .IP | |
| "w": List of segments (usually words), each of which in turn contains the | |
| ‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for | |
| start, end, probability, and the text of the word. In the future we | |
| may also support hierarchical results in which case ‘\f[CR]w\fP’ could | |
| be present. | |
| .TP | |
| .B single | |
| Recognize the input as a single utterance, and write a JSON object in the same format described above. | |
| .TP | |
| .B align | |
| Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word | |
| sequence, and write a JSON object in the same format described above. | |
| The first positional argument is the input, and all subsequent ones | |
| are concatenated to make the text, to avoid surprises if you forget to | |
| quote it. You are responsible for normalizing the text to remove | |
| punctuation, uppercase, centipedes, etc. For example: | |
| .EX | |
| pocketsphinx align goforward.wav "go forward ten meters" | |
| .EE | |
| By default, only word-level alignment is done. To get phone | |
| alignments, pass `-phone_align yes` in the flags, e.g.: | |
| .EX | |
| pocketsphinx -phone_align yes align audio.wav $text | |
| .EE | |
| This will make not particularly readable output, but you can use | |
| .B jq | |
| (https://stedolan.github.io/jq/) to clean it up. For example, | |
| you can get just the word names and start times like this: | |
| .EX | |
| pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]' | |
| .EE | |
| Or you could get the phone names and durations like this: | |
| .EX | |
| pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]' | |
| .EE | |
| There are many, many other possibilities, of course. | |
| .TP | |
| .B help | |
| Print a usage and help text with a list of possible arguments. | |
| .TP | |
| .B soxflags | |
| Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate | |
| input format. Note that because the ‘\f[CR]sox\fP’ command-line is | |
| slightly quirky these must always come \fIafter\fP the filename or | |
| ‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the | |
| microphone). You can run live recognition like this: | |
| .EX | |
| sox -d $(pocketsphinx soxflags) | pocketsphinx - | |
| .EE | |
| or decode from a file named "audio.mp3" like this: | |
| .EX | |
| sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx - | |
| .EE | |
| .PP | |
| By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however. | |
| .SH OPTIONS | |
| .TP | |
| .B \-agc | |
| Automatic gain control for c0 ('max', 'emax', 'noise', or 'none') | |
| .TP | |
| .B \-agcthresh | |
| Initial threshold for automatic gain control | |
| .TP | |
| .B \-allphone | |
| phoneme decoding with phonetic lm (given here) | |
| .TP | |
| .B \-allphone_ci | |
| Perform phoneme decoding with phonetic lm and context-independent units only | |
| .TP | |
| .B \-alpha | |
| Preemphasis parameter | |
| .TP | |
| .B \-ascale | |
| Inverse of acoustic model scale for confidence score calculation | |
| .TP | |
| .B \-aw | |
| Inverse weight applied to acoustic scores. | |
| .TP | |
| .B \-backtrace | |
| Print results and backtraces to log. | |
| .TP | |
| .B \-beam | |
| Beam width applied to every frame in Viterbi search (smaller values mean wider beam) | |
| .TP | |
| .B \-bestpath | |
| Run bestpath (Dijkstra) search over word lattice (3rd pass) | |
| .TP | |
| .B \-bestpathlw | |
| Language model probability weight for bestpath search | |
| .TP | |
| .B \-ceplen | |
| Number of components in the input feature vector | |
| .TP | |
| .B \-cmn | |
| Cepstral mean normalization scheme ('live', 'batch', or 'none') | |
| .TP | |
| .B \-cmninit | |
| Initial values (comma-separated) for cepstral mean when 'live' is used | |
| .TP | |
| .B \-compallsen | |
| Compute all senone scores in every frame (can be faster when there are many senones) | |
| .TP | |
| .B \-dict | |
| pronunciation dictionary (lexicon) input file | |
| .TP | |
| .B \-dictcase | |
| Dictionary is case sensitive (NOTE: case insensitivity applies to ASCII characters only) | |
| .TP | |
| .B \-dither | |
| Add 1/2-bit noise | |
| .TP | |
| .B \-doublebw | |
| Use double bandwidth filters (same center freq) | |
| .TP | |
| .B \-ds | |
| Frame GMM computation downsampling ratio | |
| .TP | |
| .B \-fdict | |
| word pronunciation dictionary input file | |
| .TP | |
| .B \-feat | |
| Feature stream type, depends on the acoustic model | |
| .TP | |
| .B \-featparams | |
| containing feature extraction parameters. | |
| .TP | |
| .B \-fillprob | |
| Filler word transition probability | |
| .TP | |
| .B \-frate | |
| Frame rate | |
| .TP | |
| .B \-fsg | |
| format finite state grammar file | |
| .TP | |
| .B \-fsgusealtpron | |
| Add alternate pronunciations to FSG | |
| .TP | |
| .B \-fsgusefiller | |
| Insert filler words at each state. | |
| .TP | |
| .B \-fwdflat | |
| Run forward flat-lexicon search over word lattice (2nd pass) | |
| .TP | |
| .B \-fwdflatbeam | |
| Beam width applied to every frame in second-pass flat search | |
| .TP | |
| .B \-fwdflatefwid | |
| Minimum number of end frames for a word to be searched in fwdflat search | |
| .TP | |
| .B \-fwdflatlw | |
| Language model probability weight for flat lexicon (2nd pass) decoding | |
| .TP | |
| .B \-fwdflatsfwin | |
| Window of frames in lattice to search for successor words in fwdflat search | |
| .TP | |
| .B \-fwdflatwbeam | |
| Beam width applied to word exits in second-pass flat search | |
| .TP | |
| .B \-fwdtree | |
| Run forward lexicon-tree search (1st pass) | |
| .TP | |
| .B \-hmm | |
| containing acoustic model files. | |
| .TP | |
| .B \-input_endian | |
| Endianness of input data, big or little, ignored if NIST or MS Wav | |
| .TP | |
| .B \-jsgf | |
| grammar file | |
| .TP | |
| .B \-keyphrase | |
| to spot | |
| .TP | |
| .B \-kws | |
| file with keyphrases to spot, one per line | |
| .TP | |
| .B \-kws_delay | |
| Delay to wait for best detection score | |
| .TP | |
| .B \-kws_plp | |
| Phone loop probability for keyphrase spotting | |
| .TP | |
| .B \-kws_threshold | |
| Threshold for p(hyp)/p(alternatives) ratio | |
| .TP | |
| .B \-latsize | |
| Initial backpointer table size | |
| .TP | |
| .B \-lda | |
| containing transformation matrix to be applied to features (single-stream features only) | |
| .TP | |
| .B \-ldadim | |
| Dimensionality of output of feature transformation (0 to use entire matrix) | |
| .TP | |
| .B \-lifter | |
| Length of sin-curve for liftering, or 0 for no liftering. | |
| .TP | |
| .B \-lm | |
| trigram language model input file | |
| .TP | |
| .B \-lmctl | |
| a set of language model | |
| .TP | |
| .B \-lmname | |
| language model in \fB\-lmctl\fR to use by default | |
| .TP | |
| .B \-logbase | |
| Base in which all log-likelihoods calculated | |
| .TP | |
| .B \-logfn | |
| to write log messages in | |
| .TP | |
| .B \-loglevel | |
| Minimum level of log messages (DEBUG, INFO, WARN, ERROR) | |
| .TP | |
| .B \-logspec | |
| Write out logspectral files instead of cepstra | |
| .TP | |
| .B \-lowerf | |
| Lower edge of filters | |
| .TP | |
| .B \-lpbeam | |
| Beam width applied to last phone in words | |
| .TP | |
| .B \-lponlybeam | |
| Beam width applied to last phone in single-phone words | |
| .TP | |
| .B \-lw | |
| Language model probability weight | |
| .TP | |
| .B \-maxhmmpf | |
| Maximum number of active HMMs to maintain at each frame (or \fB\-1\fR for no pruning) | |
| .TP | |
| .B \-maxwpf | |
| Maximum number of distinct word exits at each frame (or \fB\-1\fR for no pruning) | |
| .TP | |
| .B \-mdef | |
| definition input file | |
| .TP | |
| .B \-mean | |
| gaussian means input file | |
| .TP | |
| .B \-mfclogdir | |
| to log feature files to | |
| .TP | |
| .B \-min_endfr | |
| Nodes ignored in lattice construction if they persist for fewer than N frames | |
| .TP | |
| .B \-mixw | |
| mixture weights input file (uncompressed) | |
| .TP | |
| .B \-mixwfloor | |
| Senone mixture weights floor (applied to data from \fB\-mixw\fR file) | |
| .TP | |
| .B \-mllr | |
| transformation to apply to means and variances | |
| .TP | |
| .B \-mmap | |
| Use memory-mapped I/O (if possible) for model files | |
| .TP | |
| .B \-ncep | |
| Number of cep coefficients | |
| .TP | |
| .B \-nfft | |
| Size of FFT, or 0 to set automatically (recommended) | |
| .TP | |
| .B \-nfilt | |
| Number of filter banks | |
| .TP | |
| .B \-nwpen | |
| New word transition penalty | |
| .TP | |
| .B \-pbeam | |
| Beam width applied to phone transitions | |
| .TP | |
| .B \-pip | |
| Phone insertion penalty | |
| .TP | |
| .B \-pl_beam | |
| Beam width applied to phone loop search for lookahead | |
| .TP | |
| .B \-pl_pbeam | |
| Beam width applied to phone loop transitions for lookahead | |
| .TP | |
| .B \-pl_pip | |
| Phone insertion penalty for phone loop | |
| .TP | |
| .B \-pl_weight | |
| Weight for phoneme lookahead penalties | |
| .TP | |
| .B \-pl_window | |
| Phoneme lookahead window size, in frames | |
| .TP | |
| .B \-rawlogdir | |
| to log raw audio files to | |
| .TP | |
| .B \-remove_dc | |
| Remove DC offset from each frame | |
| .TP | |
| .B \-remove_noise | |
| Remove noise using spectral subtraction | |
| .TP | |
| .B \-round_filters | |
| Round mel filter frequencies to DFT points | |
| .TP | |
| .B \-samprate | |
| Sampling rate | |
| .TP | |
| .B \-seed | |
| Seed for random number generator; if less than zero, pick our own | |
| .TP | |
| .B \-sendump | |
| dump (compressed mixture weights) input file | |
| .TP | |
| .B \-senlogdir | |
| to log senone score files to | |
| .TP | |
| .B \-senmgau | |
| to codebook mapping input file (usually not needed) | |
| .TP | |
| .B \-silprob | |
| Silence word transition probability | |
| .TP | |
| .B \-smoothspec | |
| Write out cepstral-smoothed logspectral files | |
| .TP | |
| .B \-svspec | |
| specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38) | |
| .TP | |
| .B \-tmat | |
| state transition matrix input file | |
| .TP | |
| .B \-tmatfloor | |
| HMM state transition probability floor (applied to \fB\-tmat\fR file) | |
| .TP | |
| .B \-topn | |
| Maximum number of top Gaussians to use in scoring. | |
| .TP | |
| .B \-topn_beam | |
| Beam width used to determine top-N Gaussians (or a list, per-feature) | |
| .TP | |
| .B \-toprule | |
| rule for JSGF (first public rule is default) | |
| .TP | |
| .B \-transform | |
| Which type of transform to use to calculate cepstra (legacy, dct, or htk) | |
| .TP | |
| .B \-unit_area | |
| Normalize mel filters to unit area | |
| .TP | |
| .B \-upperf | |
| Upper edge of filters | |
| .TP | |
| .B \-uw | |
| Unigram weight | |
| .TP | |
| .B \-var | |
| gaussian variances input file | |
| .TP | |
| .B \-varfloor | |
| Mixture gaussian variance floor (applied to data from \fB\-var\fR file) | |
| .TP | |
| .B \-varnorm | |
| Variance normalize each utterance (only if CMN == current) | |
| .TP | |
| .B \-verbose | |
| Show input filenames | |
| .TP | |
| .B \-warp_params | |
| defining the warping function | |
| .TP | |
| .B \-warp_type | |
| Warping function type (or shape) | |
| .TP | |
| .B \-wbeam | |
| Beam width applied to word exits | |
| .TP | |
| .B \-wip | |
| Word insertion penalty | |
| .TP | |
| .B \-wlen | |
| Hamming window length | |
| .SH AUTHOR | |
| Written by numerous people at CMU from 1994 onwards. This manual page | |
| by David Huggins-Daines <dhdaines@gmail.com> | |
| .SH COPYRIGHT | |
| Copyright \(co 1994-2016 Carnegie Mellon University. See the file | |
| \fILICENSE\fR included with this package for more information. | |
| .br | |
| .SH "SEE ALSO" | |
| .BR pocketsphinx_batch (1), | |
| .BR sphinx_fe (1). | |
| .br | |