.TH POCKETSPHINX 1 "2022-09-27" .SH NAME pocketsphinx \- Run speech recognition on audio data .SH SYNOPSIS .B pocketsphinx [ \fIoptions\fR... ] [ \fBlive\fR | \fBsingle\fR | \fBhelp\fR | \fBsoxflags\fR ] \fIINPUTS\fR... .SH DESCRIPTION .PP The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel 16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from standard input), and attemps to recognize speech in it using the default acoustic and language model. The input files can be raw audio, WAV, or NIST Sphere files, though some of these may not be recognized properly. It accepts a large number of options which you probably don't care about, and a \fIcommand\fP which defaults to ‘\f[CR]live\fP’. The commands are as follows: .TP .B help Print a long list of those options you don't care about. .TP .B config Dump configuration as JSON to standard output (can be loaded with the ‘\f[CR]-config\fP’ option). .TP .B live Detect speech segments in input files, run recognition on them (using those options you don't care about), and write the results to standard output in line-delimited JSON. I realize this isn't the prettiest format, but it sure beats XML. Each line contains a JSON object with these fields, which have short names to make the lines more readable: .IP "b": Start time in seconds, from the beginning of the stream .IP "d": Duration in seconds .IP "p": Estimated probability of the recognition result, i.e. a number between 0 and 1 which may be used as a confidence score .IP "t": Full text of recognition result .IP "w": List of segments (usually words), each of which in turn contains the ‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for start, end, probability, and the text of the word. In the future we may also support hierarchical results in which case ‘\f[CR]w\fP’ could be present. .TP .B single Recognize the input as a single utterance, and write a JSON object in the same format described above. .TP .B align Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word sequence, and write a JSON object in the same format described above. The first positional argument is the input, and all subsequent ones are concatenated to make the text, to avoid surprises if you forget to quote it. You are responsible for normalizing the text to remove punctuation, uppercase, centipedes, etc. For example: .EX pocketsphinx align goforward.wav "go forward ten meters" .EE By default, only word-level alignment is done. To get phone alignments, pass `-phone_align yes` in the flags, e.g.: .EX pocketsphinx -phone_align yes align audio.wav $text .EE This will make not particularly readable output, but you can use .B jq (https://stedolan.github.io/jq/) to clean it up. For example, you can get just the word names and start times like this: .EX pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]' .EE Or you could get the phone names and durations like this: .EX pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]' .EE There are many, many other possibilities, of course. .TP .B help Print a usage and help text with a list of possible arguments. .TP .B soxflags Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate input format. Note that because the ‘\f[CR]sox\fP’ command-line is slightly quirky these must always come \fIafter\fP the filename or ‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the microphone). You can run live recognition like this: .EX sox -d $(pocketsphinx soxflags) | pocketsphinx - .EE or decode from a file named "audio.mp3" like this: .EX sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx - .EE .PP By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however. .SH OPTIONS .TP .B \-agc Automatic gain control for c0 ('max', 'emax', 'noise', or 'none') .TP .B \-agcthresh Initial threshold for automatic gain control .TP .B \-allphone phoneme decoding with phonetic lm (given here) .TP .B \-allphone_ci Perform phoneme decoding with phonetic lm and context-independent units only .TP .B \-alpha Preemphasis parameter .TP .B \-ascale Inverse of acoustic model scale for confidence score calculation .TP .B \-aw Inverse weight applied to acoustic scores. .TP .B \-backtrace Print results and backtraces to log. .TP .B \-beam Beam width applied to every frame in Viterbi search (smaller values mean wider beam) .TP .B \-bestpath Run bestpath (Dijkstra) search over word lattice (3rd pass) .TP .B \-bestpathlw Language model probability weight for bestpath search .TP .B \-ceplen Number of components in the input feature vector .TP .B \-cmn Cepstral mean normalization scheme ('live', 'batch', or 'none') .TP .B \-cmninit Initial values (comma-separated) for cepstral mean when 'live' is used .TP .B \-compallsen Compute all senone scores in every frame (can be faster when there are many senones) .TP .B \-dict pronunciation dictionary (lexicon) input file .TP .B \-dictcase Dictionary is case sensitive (NOTE: case insensitivity applies to ASCII characters only) .TP .B \-dither Add 1/2-bit noise .TP .B \-doublebw Use double bandwidth filters (same center freq) .TP .B \-ds Frame GMM computation downsampling ratio .TP .B \-fdict word pronunciation dictionary input file .TP .B \-feat Feature stream type, depends on the acoustic model .TP .B \-featparams containing feature extraction parameters. .TP .B \-fillprob Filler word transition probability .TP .B \-frate Frame rate .TP .B \-fsg format finite state grammar file .TP .B \-fsgusealtpron Add alternate pronunciations to FSG .TP .B \-fsgusefiller Insert filler words at each state. .TP .B \-fwdflat Run forward flat-lexicon search over word lattice (2nd pass) .TP .B \-fwdflatbeam Beam width applied to every frame in second-pass flat search .TP .B \-fwdflatefwid Minimum number of end frames for a word to be searched in fwdflat search .TP .B \-fwdflatlw Language model probability weight for flat lexicon (2nd pass) decoding .TP .B \-fwdflatsfwin Window of frames in lattice to search for successor words in fwdflat search .TP .B \-fwdflatwbeam Beam width applied to word exits in second-pass flat search .TP .B \-fwdtree Run forward lexicon-tree search (1st pass) .TP .B \-hmm containing acoustic model files. .TP .B \-input_endian Endianness of input data, big or little, ignored if NIST or MS Wav .TP .B \-jsgf grammar file .TP .B \-keyphrase to spot .TP .B \-kws file with keyphrases to spot, one per line .TP .B \-kws_delay Delay to wait for best detection score .TP .B \-kws_plp Phone loop probability for keyphrase spotting .TP .B \-kws_threshold Threshold for p(hyp)/p(alternatives) ratio .TP .B \-latsize Initial backpointer table size .TP .B \-lda containing transformation matrix to be applied to features (single-stream features only) .TP .B \-ldadim Dimensionality of output of feature transformation (0 to use entire matrix) .TP .B \-lifter Length of sin-curve for liftering, or 0 for no liftering. .TP .B \-lm trigram language model input file .TP .B \-lmctl a set of language model .TP .B \-lmname language model in \fB\-lmctl\fR to use by default .TP .B \-logbase Base in which all log-likelihoods calculated .TP .B \-logfn to write log messages in .TP .B \-loglevel Minimum level of log messages (DEBUG, INFO, WARN, ERROR) .TP .B \-logspec Write out logspectral files instead of cepstra .TP .B \-lowerf Lower edge of filters .TP .B \-lpbeam Beam width applied to last phone in words .TP .B \-lponlybeam Beam width applied to last phone in single-phone words .TP .B \-lw Language model probability weight .TP .B \-maxhmmpf Maximum number of active HMMs to maintain at each frame (or \fB\-1\fR for no pruning) .TP .B \-maxwpf Maximum number of distinct word exits at each frame (or \fB\-1\fR for no pruning) .TP .B \-mdef definition input file .TP .B \-mean gaussian means input file .TP .B \-mfclogdir to log feature files to .TP .B \-min_endfr Nodes ignored in lattice construction if they persist for fewer than N frames .TP .B \-mixw mixture weights input file (uncompressed) .TP .B \-mixwfloor Senone mixture weights floor (applied to data from \fB\-mixw\fR file) .TP .B \-mllr transformation to apply to means and variances .TP .B \-mmap Use memory-mapped I/O (if possible) for model files .TP .B \-ncep Number of cep coefficients .TP .B \-nfft Size of FFT, or 0 to set automatically (recommended) .TP .B \-nfilt Number of filter banks .TP .B \-nwpen New word transition penalty .TP .B \-pbeam Beam width applied to phone transitions .TP .B \-pip Phone insertion penalty .TP .B \-pl_beam Beam width applied to phone loop search for lookahead .TP .B \-pl_pbeam Beam width applied to phone loop transitions for lookahead .TP .B \-pl_pip Phone insertion penalty for phone loop .TP .B \-pl_weight Weight for phoneme lookahead penalties .TP .B \-pl_window Phoneme lookahead window size, in frames .TP .B \-rawlogdir to log raw audio files to .TP .B \-remove_dc Remove DC offset from each frame .TP .B \-remove_noise Remove noise using spectral subtraction .TP .B \-round_filters Round mel filter frequencies to DFT points .TP .B \-samprate Sampling rate .TP .B \-seed Seed for random number generator; if less than zero, pick our own .TP .B \-sendump dump (compressed mixture weights) input file .TP .B \-senlogdir to log senone score files to .TP .B \-senmgau to codebook mapping input file (usually not needed) .TP .B \-silprob Silence word transition probability .TP .B \-smoothspec Write out cepstral-smoothed logspectral files .TP .B \-svspec specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38) .TP .B \-tmat state transition matrix input file .TP .B \-tmatfloor HMM state transition probability floor (applied to \fB\-tmat\fR file) .TP .B \-topn Maximum number of top Gaussians to use in scoring. .TP .B \-topn_beam Beam width used to determine top-N Gaussians (or a list, per-feature) .TP .B \-toprule rule for JSGF (first public rule is default) .TP .B \-transform Which type of transform to use to calculate cepstra (legacy, dct, or htk) .TP .B \-unit_area Normalize mel filters to unit area .TP .B \-upperf Upper edge of filters .TP .B \-uw Unigram weight .TP .B \-var gaussian variances input file .TP .B \-varfloor Mixture gaussian variance floor (applied to data from \fB\-var\fR file) .TP .B \-varnorm Variance normalize each utterance (only if CMN == current) .TP .B \-verbose Show input filenames .TP .B \-warp_params defining the warping function .TP .B \-warp_type Warping function type (or shape) .TP .B \-wbeam Beam width applied to word exits .TP .B \-wip Word insertion penalty .TP .B \-wlen Hamming window length .SH AUTHOR Written by numerous people at CMU from 1994 onwards. This manual page by David Huggins-Daines .SH COPYRIGHT Copyright \(co 1994-2016 Carnegie Mellon University. See the file \fILICENSE\fR included with this package for more information. .br .SH "SEE ALSO" .BR pocketsphinx_batch (1), .BR sphinx_fe (1). .br