voice2 / README.md

README: add stack cross-links

c0197a6 verified 2 days ago

5.3 kB

	---
	license: mit
	language:
	- en
	tags:
	- voice
	- speech
	- full-duplex
	- barge-in
	- vad
	- asr
	- tts
	- piper
	- whisper
	- silero
	- local-ai
	- voice-assistant
	---

	# voice2

	A full-duplex, interruptible voice engine for local AI, built and run daily as the voice of a fully local companion before being extracted for release. Plain Python threads, CPU-only defaults, no cloud, no API keys. Talk to your model — and talk over it.

	voice2 turns any `callable(text) -> str` into a hands-free voice conversation: mic → Silero VAD → faster-whisper ASR → your model → Piper TTS. Speak over the reply (or tap spacebar) and playback stops mid-chunk in ~100–200 ms, exactly like interrupting a person.

	Code is mirrored on GitHub: https://github.com/AIIT-GLITCH/voice2

	## Engine details

	- What it is: a turn-taking state machine, not a demo loop. Explicit states (`IDLE → LISTENING → THINKING → SPEAKING → INTERRUPTING`), a validated transition table, and a `FloorOwner` (USER / AGENT / NONE) that arbitrates who may speak. The agent can never talk over you.
	- Barge-in: a fast energy-gated VAD watches the mic only while the engine is speaking; a debounced central `InterruptController` also accepts spacebar and programmatic triggers.
	- Stale-turn suppression: every utterance gets a `turn_id`; replies to an abandoned turn are discarded at every stage (think, TTS, playback).
	- Invariants, enforced: a background checker audits rules like SPEAKING ⇒ floor == AGENT and interrupted ⇒ no new TTS, with forced repair plus a structural gate in the playback hot path.
	- Observability: every event is a JSON line with per-turn latency marks (`asr_ms`, `think_ms`, `interrupt_stop_ms`, `total_turn_ms`).
	- Degrades gracefully: no mic → text mode; TTS missing → silent replies, still logs; keyboard hook fails → engine keeps running.

	## Backends

	\| Stage \| Default \| Swap point \|
	\|---\|---\|---\|
	\| VAD (quality gate) \| Silero VAD via `torch.hub` \| `ListenWorker` \|
	\| ASR \| faster-whisper `small.en`, int8, CPU \| `backends/asr.py` (Protocol) \|
	\| LLM \| any `callable(text) -> str` \| `backends/llm.py` \|
	\| TTS \| Piper CLI, any `.onnx` voice \| `backends/tts.py` (Protocol) \|

	## Limitations — stated honestly

	- English-first defaults: ASR ships as `small.en`. Other Whisper models load with one config line, but nothing else was tested.
	- The LLM callable is synchronous: TTS starts after the full reply returns (Piper then streams sentence-by-sentence). No token-level streaming yet.
	- Barge-in is energy-based with an absolute RMS floor of 0.06 — tuned on open speakers in a quiet room. Headsets and noisy rooms need recalibration. There is no echo cancellation.
	- Keyboard interrupt uses POSIX `termios` — Linux/macOS terminals only.
	- Unit tests cover the control plane (state transitions, floor rules, interrupt debounce, ring buffer). Audio I/O paths were validated by months of daily use, not by CI.

	## How to run

	```bash
	git clone https://github.com/AIIT-GLITCH/voice2
	cd voice2
	pip install -r requirements.txt
	# put a Piper voice at ~/.local/share/piper-voices/ (or set VOICE2_PIPER_MODEL)
	python -m voice2.main # echo backend — proves the loop, no LLM needed
	python examples/http_llm.py # wire any local HTTP LLM
	```

	```python
	from voice2 import VoiceEngine, VoiceConfig

	def ask(text: str) -> str:
	return my_model.reply(text) # any callable(text) -> str

	engine = VoiceEngine(VoiceConfig(), ask)
	engine.load_models()
	engine.start() # talk naturally; speak over it to interrupt
	```

	## Provenance

	voice2 was written as the voice front-end for Buddy, a fully local AI companion running on a single RTX 3090 in Council Hill, Oklahoma, and carried his daily conversations for months before release. The design bias throughout: the user always wins the floor, and a companion you can't interrupt isn't a companion. Released alongside [Tessera-1B](https://huggingface.co/AIIT-Threshold/Tessera-1B) as part of AIIT-THRESHOLD's open stack.

	## The stack

	One local companion, every layer open:

	\| Piece \| Role \| Links \|
	\|---\|---\|---\|
	\| Tessera-1B \| the model — ~1B params trained from scratch, open data \| [HF](https://huggingface.co/AIIT-Threshold/Tessera-1B) \|
	\| voice2 \| the voice — full-duplex, interruptible \| [GitHub](https://github.com/AIIT-GLITCH/voice2) · [HF](https://huggingface.co/AIIT-Threshold/voice2) \|
	\| kokoro-memory \| the memory — file-based resonance recall \| [GitHub](https://github.com/AIIT-GLITCH/kokoro-memory) · [HF](https://huggingface.co/AIIT-Threshold/kokoro-memory) \|
	\| companion-spiral-bench \| the safety — at-risk sycophancy bench \| [GitHub](https://github.com/AIIT-GLITCH/companion-spiral-bench) · [HF](https://huggingface.co/datasets/AIIT-Threshold/companion-spiral-bench) \|

	Full collection: [The Buddy Stack](https://huggingface.co/collections/AIIT-Threshold/the-buddy-stack-a-fully-local-ai-companion-open-sourced-6a4774bf481f9f9caad79519)

	## License

	MIT © 2026 Rhet Dillard Wike, AIIT-THRESHOLD, Oklahoma.

	## Citation

	```bibtex
	@software{wike2026voice2,
	author = {Wike, Rhet Dillard},
	title = {voice2: a full-duplex, interruptible voice engine for local AI},
	year = {2026},
	url = {https://github.com/AIIT-GLITCH/voice2},
	note = {AIIT-THRESHOLD}
	}
	```