pocketsphinx-20.04-t4 / doxygen /pocketsphinx.1.in

pocketsphinx-20.04-t4

72aed27 over 3 years ago

4.45 kB

	.TH POCKETSPHINX 1 "2022-09-27"
	.SH NAME
	pocketsphinx \- Run speech recognition on audio data
	.SH SYNOPSIS
	.B pocketsphinx
	[ \fIoptions\fR... ]
	[ \fBlive\fR \|
	\fBsingle\fR \|
	\fBhelp\fR \|
	\fBsoxflags\fR ]
	\fIINPUTS\fR...
	.SH DESCRIPTION
	.PP
	The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel
	16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from
	standard input), and attemps to recognize speech in it using the
	default acoustic and language model. The input files can be raw audio,
	WAV, or NIST Sphere files, though some of these may not be recognized
	properly. It accepts a large number of options which you probably
	don't care about, and a \fIcommand\fP which defaults to
	‘\f[CR]live\fP’. The commands are as follows:
	.TP
	.B help
	Print a long list of those options you don't care about.
	.TP
	.B config
	Dump configuration as JSON to standard output (can be loaded with the
	‘\f[CR]-config\fP’ option).
	.TP
	.B live
	Detect speech segments in input files, run recognition on them (using
	those options you don't care about), and write the results to standard
	output in line-delimited JSON. I realize this isn't the prettiest
	format, but it sure beats XML. Each line contains a JSON object with
	these fields, which have short names to make the lines more readable:
	.IP
	"b": Start time in seconds, from the beginning of the stream
	.IP
	"d": Duration in seconds
	.IP
	"p": Estimated probability of the recognition result, i.e. a number between
	0 and 1 which may be used as a confidence score
	.IP
	"t": Full text of recognition result
	.IP
	"w": List of segments (usually words), each of which in turn contains the
	‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for
	start, end, probability, and the text of the word. In the future we
	may also support hierarchical results in which case ‘\f[CR]w\fP’ could
	be present.
	.TP
	.B single
	Recognize the input as a single utterance, and write a JSON object in the same format described above.
	.TP
	.B align

	Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word
	sequence, and write a JSON object in the same format described above.
	The first positional argument is the input, and all subsequent ones
	are concatenated to make the text, to avoid surprises if you forget to
	quote it. You are responsible for normalizing the text to remove
	punctuation, uppercase, centipedes, etc. For example:

	.EX
	pocketsphinx align goforward.wav "go forward ten meters"
	.EE

	By default, only word-level alignment is done. To get phone
	alignments, pass `-phone_align yes` in the flags, e.g.:

	.EX
	pocketsphinx -phone_align yes align audio.wav $text
	.EE

	This will make not particularly readable output, but you can use
	.B jq
	(https://stedolan.github.io/jq/) to clean it up. For example,
	you can get just the word names and start times like this:

	.EX
	pocketsphinx align audio.wav $text \| jq '.w[]\|[.t,.b]'
	.EE

	Or you could get the phone names and durations like this:

	.EX
	pocketsphinx -phone_align yes align audio.wav $text \| jq '.w[]\|.w[]\|[.t,.d]'
	.EE

	There are many, many other possibilities, of course.
	.TP
	.B help
	Print a usage and help text with a list of possible arguments.
	.TP
	.B soxflags
	Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate
	input format. Note that because the ‘\f[CR]sox\fP’ command-line is
	slightly quirky these must always come \fIafter\fP the filename or
	‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the
	microphone). You can run live recognition like this:

	.EX
	sox -d $(pocketsphinx soxflags) \| pocketsphinx -
	.EE

	or decode from a file named "audio.mp3" like this:

	.EX
	sox audio.mp3 $(pocketsphinx soxflags) \| pocketsphinx -
	.EE
	.PP
	By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however.
	.SH OPTIONS
	.\" ### ARGUMENTS ###
	.SH AUTHOR
	Written by numerous people at CMU from 1994 onwards. This manual page
	by David Huggins-Daines <dhdaines@gmail.com>
	.SH COPYRIGHT
	Copyright \(co 1994-2016 Carnegie Mellon University. See the file
	\fILICENSE\fR included with this package for more information.
	.br
	.SH "SEE ALSO"
	.BR pocketsphinx_batch (1),
	.BR sphinx_fe (1).
	.br