| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - voice |
| - speech |
| - full-duplex |
| - barge-in |
| - vad |
| - asr |
| - tts |
| - piper |
| - whisper |
| - silero |
| - local-ai |
| - voice-assistant |
| --- |
| |
| # voice2 |
|
|
| A full-duplex, interruptible voice engine for local AI, **built and run daily as the voice of a fully local companion** before being extracted for release. Plain Python threads, CPU-only defaults, no cloud, no API keys. Talk to your model β and talk over it. |
|
|
| voice2 turns any `callable(text) -> str` into a hands-free voice conversation: mic β Silero VAD β faster-whisper ASR β your model β Piper TTS. Speak over the reply (or tap spacebar) and playback stops mid-chunk in ~100β200 ms, exactly like interrupting a person. |
|
|
| Code is mirrored on GitHub: https://github.com/AIIT-GLITCH/voice2 |
|
|
| ## Engine details |
|
|
| - **What it is:** a turn-taking state machine, not a demo loop. Explicit states (`IDLE β LISTENING β THINKING β SPEAKING β INTERRUPTING`), a validated transition table, and a `FloorOwner` (USER / AGENT / NONE) that arbitrates who may speak. The agent can never talk over you. |
| - **Barge-in:** a fast energy-gated VAD watches the mic *only while the engine is speaking*; a debounced central `InterruptController` also accepts spacebar and programmatic triggers. |
| - **Stale-turn suppression:** every utterance gets a `turn_id`; replies to an abandoned turn are discarded at every stage (think, TTS, playback). |
| - **Invariants, enforced:** a background checker audits rules like *SPEAKING β floor == AGENT* and *interrupted β no new TTS*, with forced repair plus a structural gate in the playback hot path. |
| - **Observability:** every event is a JSON line with per-turn latency marks (`asr_ms`, `think_ms`, `interrupt_stop_ms`, `total_turn_ms`). |
| - **Degrades gracefully:** no mic β text mode; TTS missing β silent replies, still logs; keyboard hook fails β engine keeps running. |
|
|
| ## Backends |
|
|
| | Stage | Default | Swap point | |
| |---|---|---| |
| | VAD (quality gate) | Silero VAD via `torch.hub` | `ListenWorker` | |
| | ASR | faster-whisper `small.en`, int8, CPU | `backends/asr.py` (Protocol) | |
| | LLM | any `callable(text) -> str` | `backends/llm.py` | |
| | TTS | Piper CLI, any `.onnx` voice | `backends/tts.py` (Protocol) | |
|
|
| ## Limitations β stated honestly |
|
|
| - English-first defaults: ASR ships as `small.en`. Other Whisper models load with one config line, but nothing else was tested. |
| - The LLM callable is synchronous: TTS starts after the full reply returns (Piper then streams sentence-by-sentence). No token-level streaming yet. |
| - Barge-in is energy-based with an absolute RMS floor of 0.06 β tuned on open speakers in a quiet room. Headsets and noisy rooms need recalibration. There is no echo cancellation. |
| - Keyboard interrupt uses POSIX `termios` β Linux/macOS terminals only. |
| - Unit tests cover the control plane (state transitions, floor rules, interrupt debounce, ring buffer). Audio I/O paths were validated by months of daily use, not by CI. |
|
|
| ## How to run |
|
|
| ```bash |
| git clone https://github.com/AIIT-GLITCH/voice2 |
| cd voice2 |
| pip install -r requirements.txt |
| # put a Piper voice at ~/.local/share/piper-voices/ (or set VOICE2_PIPER_MODEL) |
| python -m voice2.main # echo backend β proves the loop, no LLM needed |
| python examples/http_llm.py # wire any local HTTP LLM |
| ``` |
|
|
| ```python |
| from voice2 import VoiceEngine, VoiceConfig |
| |
| def ask(text: str) -> str: |
| return my_model.reply(text) # any callable(text) -> str |
| |
| engine = VoiceEngine(VoiceConfig(), ask) |
| engine.load_models() |
| engine.start() # talk naturally; speak over it to interrupt |
| ``` |
|
|
| ## Provenance |
|
|
| voice2 was written as the voice front-end for **Buddy**, a fully local AI companion running on a single RTX 3090 in Council Hill, Oklahoma, and carried his daily conversations for months before release. The design bias throughout: the user always wins the floor, and a companion you can't interrupt isn't a companion. Released alongside [Tessera-1B](https://huggingface.co/AIIT-Threshold/Tessera-1B) as part of AIIT-THRESHOLD's open stack. |
|
|
| ## The stack |
|
|
| One local companion, every layer open: |
|
|
| | Piece | Role | Links | |
| |---|---|---| |
| | Tessera-1B | the model β ~1B params trained from scratch, open data | [HF](https://huggingface.co/AIIT-Threshold/Tessera-1B) | |
| | voice2 | the voice β full-duplex, interruptible | [GitHub](https://github.com/AIIT-GLITCH/voice2) Β· [HF](https://huggingface.co/AIIT-Threshold/voice2) | |
| | kokoro-memory | the memory β file-based resonance recall | [GitHub](https://github.com/AIIT-GLITCH/kokoro-memory) Β· [HF](https://huggingface.co/AIIT-Threshold/kokoro-memory) | |
| | companion-spiral-bench | the safety β at-risk sycophancy bench | [GitHub](https://github.com/AIIT-GLITCH/companion-spiral-bench) Β· [HF](https://huggingface.co/datasets/AIIT-Threshold/companion-spiral-bench) | |
|
|
| Full collection: [The Buddy Stack](https://huggingface.co/collections/AIIT-Threshold/the-buddy-stack-a-fully-local-ai-companion-open-sourced-6a4774bf481f9f9caad79519) |
|
|
| ## License |
|
|
| MIT Β© 2026 Rhet Dillard Wike, AIIT-THRESHOLD, Oklahoma. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{wike2026voice2, |
| author = {Wike, Rhet Dillard}, |
| title = {voice2: a full-duplex, interruptible voice engine for local AI}, |
| year = {2026}, |
| url = {https://github.com/AIIT-GLITCH/voice2}, |
| note = {AIIT-THRESHOLD} |
| } |
| ``` |
|
|