Soniqo Audio

community
Activity Feed

AI & ML interests

None defined yet.

aufklarer 
posted an update 10 days ago
view post
Post
2119
We benchmarked https://github.com/soniqo/speech-swift, our open-source Swift library for on-device speech AI, against Whisper Large v3 (FP16) on LibriSpeech test-clean.

Three models beat it. Two architectural approaches:

Qwen3-ASR (LALM — Qwen3 LLM as ASR decoder, AuT encoder pretrained on ~40M hours) hits 2.35% WER at 1.7B 8-bit, running at 43x real-time on MLX. Greedy decoding matches beam search — the LLM decoder is strong enough that the greedy path is nearly always optimal.

Parakeet TDT (non-autoregressive transducer — FastConformer + TDT joint network) hits 2.74% WER in 634 MB as a CoreML INT8 model on the Neural Engine. No generative hallucination by design. Leaves GPU completely free.

Two findings worth flagging:
- 4-bit quantization is catastrophic for non-English: Korean 6.89% → 19.95% WER on FLEURS. Use 8-bit for multilingual.
- On CoreML, INT8 is 3.3x *faster* than INT4 — opposite of GPU behavior. Native ANE INT8 MACs vs INT4 lookup table indirection.

All numbers reproducible in 15 minutes.

Full article: https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174

Library: https://github.com/soniqo/speech-swift

Models: Qwen/Qwen3-ASR-0.6B, Qwen/Qwen3-ASR-1.7B, nvidia/parakeet-tdt-0.6b-v2
aufklarer 
posted an update about 1 month ago
view post
Post
466
Speaker Diarization and VAD on Apple Silicon — MLX-Native Models

Three MLX-optimized models for on-device speaker diarization and voice activity detection, running natively on Apple Silicon via https://github.com/ivan-digital/qwen3-asr-swift:

- aufklarer/Silero-VAD-v5-MLX — Streaming VAD, 309K params, ~1.2 MB. Processes 32ms chunks at 23× real-time on M2 Max.
- aufklarer/Pyannote-Segmentation-MLX — Multi-speaker segmentation, ~1.49M params, ~5.7 MB. 7-class powerset output for up to 3 simultaneous speakers.
- aufklarer/WeSpeaker-ResNet34-LM-MLX — Speaker embedding, ~6.6M params, ~25 MB. 256-dim L2-normalized vectors with BatchNorm fused into Conv2d.

Together they form a diarization pipeline: pyannote segments → WeSpeaker embeds → agglomerative clustering links speakers across the recording. ~32 MB total.

git clone https://github.com/ivan-digital/qwen3-asr-swift
cd qwen3-asr-swift && swift build -c release

.build/release/audio diarize meeting.wav --max-speakers 4 --json
.build/release/audio vad-stream recording.wav


The library also includes ASR, TTS, multilingual synthesis, forced alignment, and speech-to-speech (PersonaPlex 7B). Apache 2.0.

Full architecture details: https://blog.ivan.digital/speaker-diarization-and-voice-activity-detection-on-apple-silicon-native-swift-with-mlx

Library: https://github.com/ivan-digital/qwen3-asr-swift
aufklarer 
posted an update about 1 month ago
view post
Post
2518
PersonaPlex-7B on Apple Silicon (Swift + MLX Swift)

NVIDIA PersonaPlex is a full-duplex speech-to-speech model — it can listen while it speaks, which enables more natural conversational behaviors like interruptions, overlaps, and quick backchannels.

We put together a native Swift implementation using MLX Swift so it can run locally on Apple Silicon, along with a 4-bit MLX conversion and a small CLI/demo to make it easy to try out.

If you’re interested in on-device voice agents (or just want to see what full-duplex S2S looks like in a real Swift codebase), the details and setup notes are here:

Blog post: https://blog.ivan.digital/nvidia-personaplex-7b-on-apple-silicon-full-duplex-speech-to-speech-in-native-swift-with-mlx-0aa5276f2e23

Repo: https://github.com/ivan-digital/qwen3-asr-swift
aufklarer 
posted an update about 2 months ago
view post
Post
3459
Context Engineering for Code Agents: Why They Fail and How to Fix Them

Code agents don't fail because they can't code — they fail because their context turns into a junk drawer.

I wrote a practical survey covering the emerging discipline of context engineering for agentic hybrid applications: the techniques, papers, and architectural patterns that keep long-running code agents on track as their token windows fill up with tool logs, stale diffs, and repeated file dumps.
What's covered:

Why long context windows alone don't save you (position bias, distractor sensitivity)
Observation masking vs. LLM summarization — and when simple beats clever
Tool-output compression with approaches like LLMLingua-2
Trajectory reduction: pruning dead branches from agent history
Memory hierarchies: session → working set → notes → cross-session
How MCP and standardized tool interfaces reduce context debt
Dynamic context policies trained with RL (DeepMiner, MEM1)
Meta-agent CI loops for measuring regressions across agent configs

The core argument: the engineering challenge isn't "make the model smarter" — it's make the agent's context and verification smarter. That's where the real leverage is in 2026.

👉 Read the full post: https://blog.ivan.digital/context-engineering-for-agentic-hybrid-applications-why-code-agents-fail-and-how-to-fix-them-076cab699262
  • 2 replies
·
aufklarer 
posted an update about 2 months ago
view post
Post
801
Qwen3-ASR Swift: On-Device Speech Recognition for Apple Silicon

I'm excited to release https://github.com/ivan-digital/qwen3-asr-swift, an open-source Swift implementation of Alibaba's
Qwen3-ASR, optimized for Apple Silicon using MLX.

Why Qwen3-ASR? Exceptional noise robustness — 3.5x better than Whisper in noisy conditions (17.9% vs 63% CER).

Features:
- 52 languages (30 major + 22 Chinese dialects)
- ~600MB model (4-bit quantized)
- ~100ms latency on M-series chips
- Fully local, no cloud API

Also more inference and model architecture in blog post https://blog.ivan.digital/qwen3-asr-swift-on-device-asr-tts-for-apple-silicon-architecture-and-benchmarks-27cbf1e4463f
aufklarer 
posted an update 4 months ago
view post
Post
626
I did deep dive comparison of Claude Code vs OpenAI Codex code agents architectures, interesting what is your personal experience on this?

Both Claude Code and OpenAI Codex are built on the same backbone: a single-agent event loop that repeatedly thinks, calls tools, inspects the result, and repeats until it’s done. No swarms, no hidden graph orchestration — just one reflective agent iterating through a ReAct-style cycle.

https://blog.ivan.digital/claude-code-vs-openai-codex-agentic-planner-vs-shell-first-surgeon-d6ce988526e8
  • 3 replies
·
aufklarer 
posted an update 4 months ago
aufklarer 
posted an update 5 months ago
view post
Post
3277
Fine-Tuning Qwen3 Embeddings for product category classification on the Large-Scale Product Corpus

Language-models such as GPT, Llama, DeepSeek, Qwen trained with a filtered slice of Common Crawl. For e-commerce work, though, we can start with the Web Data Commons (WDC), the project by the University of Mannheim. It extracts web pages that carry some metadata and publishes the result as the Large-Scale Product Corpus (LSPC).

Search engines like Google reward pages that include detailed product markup, so merchants already populate their sites with SEO-friendly fields such as title, brand, GTIN, price — and, crucially, category labels. Thanks to these built-in annotations, the WDC Large-Scale Product Corpus arrives almost fully self-labelled. I used those labels to fine-tune Qwen3 Embedding with Low-Rank Adaptation (LoRA), code is available on github. The resulting 615 million-parameter checkpoint fits comfortably in limited GPU memory yet updates the model’s representation space, mapping raw product titles to six top-level categories with a macro-F1 of 0.836 (83.6 %).

More details: https://blog.ivan.digital/fine-tuning-qwen3-embeddings-for-product-category-classification-on-the-large-scale-product-corpus-3a0919506bc8