File size: 18,900 Bytes
9e87252
 
 
 
 
 
 
 
 
feeda9c
39e987d
 
 
9fcc69f
 
58f62a2
f1b7f98
58f62a2
 
b893d54
 
 
 
 
9e87252
 
d638a9d
 
 
 
 
b893d54
 
 
 
 
d638a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: LocalDuo
emoji: πŸ”₯
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 6.16.0
python_version: '3.12'
app_file: app.py
pinned: true
short_description: πŸ‡°πŸ‡·βœ¨ LocalDuo - Learn Korean from Documents
preload_from_hub:
  - Qwen/Qwen3.5-2B
models:
  - Qwen/Qwen3.5-2B
  - CohereLabs/cohere-transcribe-03-2026
  - Supertone/supertonic-3
thumbnail: >-
  https://raw.githubusercontent.com/ShayekhBinIslam/file-host/main/thumbnail.png

tags:
  - track:backyard
  - achievement:offgrid
  - achievement:fieldnotes
---

# LocalDuo β€” Build Small Hackathon Field Notes

**Author:** Shayekh Bin Islam, KAIST, South Korea  
**Date:** June 2026  
**Stack:** Gradio Β· Qwen 3.5-9B VLM Β· Cohere ASR Β· Supertonic TTS Β· HuggingFace Spaces (ZeroGPU)

**Live Demo:** https://huggingface.co/spaces/build-small-hackathon/LocalDuo/  
**Recorded Demo:** https://youtu.be/PoZs9ltbdik  
**Social:** https://www.linkedin.com/posts/shayekhbinislam_hi-everyone-i-have-built-this-app-localduo-share-7472275977369210880--Q6i/  
**Field Note:** https://huggingface.co/blog/build-small-hackathon/localduo  

---

## What I Built

**LocalDuo** is an end-to-end Korean language learning application that takes *any* Korean-language content β€” a PDF textbook, a live website, an audio recording, or a YouTube video β€” and automatically transforms it into interactive vocabulary flashcards with native-quality audio pronunciation.

The core idea: **instead of studying from generic word lists, learn vocabulary from content you actually care about.** Upload a chapter from your Korean textbook, paste a BBC Korean news article, or drop in a K-drama YouTube clip, and the app extracts the most useful Korean vocabulary, transliterates it into your native script, explains the grammar, generates TTS pronunciation audio, and packages everything into swipeable flashcards with a built-in quiz mode.

### Feature Overview

| Feature | Description |
|---|---|
| **Multi-Source Input** | Website URLs, PDF uploads, audio file uploads, YouTube links, and pre-saved deck imports β€” five distinct input pipelines unified into one interface |
| **Vision-Language Extraction** | Qwen 3.5-9B processes both text *and* page images simultaneously, enabling vocabulary extraction from visual content (handwritten notes, textbook diagrams, infographics) |
| **Speech-to-Text Pipeline** | Cohere ASR (`cohere-transcribe-03-2026`) transcribes Korean audio from YouTube videos and uploaded audio files, with Korean-only filtering to strip English artifacts |
| **Text-to-Speech Pronunciation** | Supertonic-3 TTS generates natural Korean pronunciation for every extracted word, embedded as base64 audio data URIs directly in the flashcard HTML |
| **Interactive Flashcard SPA** | A full single-page application embedded via `<iframe srcdoc>` with card flipping, navigation, audio playback, and clipboard copy β€” all in vanilla JS/CSS |
| **5-Question Quiz Mode** | Auto-generated multiple-choice quizzes from the current deck with animated scoring and progress tracking |
| **Multilingual Transliteration** | Supports 200+ target languages organized by language family (Indo-European, Sino-Tibetan, Afro-Asiatic, etc.) with native script transliteration |
| **Export to Anki & JSON** | One-click export to `.apkg` (via `genanki`) for Anki spaced repetition, or `.json` for programmatic use |
| **Think/Non-Think Toggle** | User control over the model's reasoning chain β€” enable deep thinking for accuracy, or disable for instant JSON output |
| **Korean-Themed UI** | Custom dark theme inspired by Korean aesthetics: warm gold (금) accents, ink-wash animated backgrounds, Noto Serif KR typography |

---

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INPUT LAYER                        β”‚
β”‚  Website URL β”‚ PDF Upload β”‚ Audio β”‚ YouTube β”‚ Import  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚       β”‚            β”‚       β”‚         β”‚
       β–Ό       β–Ό            β”‚       β”‚         β–Ό
  Playwright  PyMuPDF       β”‚       β”‚    JSON/Anki
  + BS4       (fitz)        β”‚       β”‚    Parser
  Scraper     Extract       β”‚       β”‚
       β”‚       β”‚            β–Ό       β–Ό
       β”‚       β”‚     Cohere ASR (cohere-transcribe-03-2026)
       β”‚       β”‚         Korean audio β†’ text
       β”‚       β”‚            β”‚       β”‚
       β–Ό       β–Ό            β–Ό       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              EXTRACTION LAYER (GPU)                   β”‚
β”‚                                                       β”‚
β”‚  Qwen 3.5-9B VLM (AutoModelForImageTextToText)        β”‚
β”‚  β€’ Multimodal: text + page images β†’ structured JSON   β”‚
β”‚  β€’ Streaming via TextIteratorStreamer                 β”‚
β”‚  β€’ Think/Non-think mode (enable_thinking flag)        β”‚
β”‚  β€’ Auto-force JSON after configurable char threshold  β”‚
β”‚  β€’ 3-attempt retry with partial JSON salvaging        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               POST-PROCESSING LAYER                   β”‚
β”‚  β€’ JSON parsing with jiter (partial_mode=True)        β”‚
β”‚  β€’ Supertonic-3 TTS β†’ base64 audio data URIs          β”‚
β”‚  β€’ Flashcard SPA builder (iframe srcdoc)              β”‚
β”‚  β€’ Quiz generator (randomized MCQ)                    β”‚
β”‚  β€’ Anki .apkg export (genanki)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 PRESENTATION LAYER                    β”‚
β”‚  Gradio Blocks UI with Korean-inspired dark theme     β”‚
β”‚  β€’ Streaming model output (live generation view)      β”‚
β”‚  β€’ Interactive flashcard carousel (flip, nav, audio)  β”‚
β”‚  β€’ 5-question multiple-choice quiz                    β”‚
β”‚  β€’ Export buttons (JSON, Anki)                        β”‚
β”‚  β€’ Generation controls (stop thinking, kill gen)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Technical Deep Dives

### 1. Taming the Thinking Model

The biggest engineering challenge was using Qwen 3.5-9B in production. This model uses a `<think>...</think>` reasoning chain before generating output, which is great for accuracy but catastrophic for latency β€” the model would sometimes think for 10,000+ characters before producing any JSON.

**Solution: A multi-layered forcing mechanism.**

```
User clicks "Generate"
        β”‚
        β–Ό
Model starts thinking (<think> block)
        β”‚
        β”œβ”€β”€ Thinking chars > auto_force_chars (default: 4000)?
        β”‚       YES β†’ Kill generation thread
        β”‚             Append "</think>\n```json\n[\n"
        β”‚             Restart generation with partial context
        β”‚
        β”œβ”€β”€ User clicks "⚑ Stop thinking, Generate now"?
        β”‚       YES β†’ Same forced restart
        β”‚
        β”œβ”€β”€ Total output > 10,000 chars (hard limit)?
        β”‚       YES β†’ Hard force, kill and restart again
        β”‚
        └── "Non-Think" mode toggled?
                YES β†’ apply_chat_template(enable_thinking=False)
                      Appends empty <think>\n\n</think> block
                      Forces ```json prefix immediately
```

This required:
- **Thread management**: `model.generate()` runs in a separate `Thread` with a `TextIteratorStreamer`. Killing generation means setting a `StoppingCriteria` flag, draining the streamer queue, joining the thread with timeout, then spawning a new thread with extended context.
- **Partial context stitching**: When forcing, the entire output so far (including partial thinking) is appended to the chat template as partial assistant text, so the model has context for what it was about to generate.
- **Global flag coordination**: `global_stop_thinking` and `global_kill_threads` are module-level mutable lists (`[False]`) to enable cross-thread communication in Python's GIL environment.

### 2. Robust JSON Extraction

LLMs are unreliable JSON producers. The extraction pipeline has 4 fallback layers:

1. **Regex extraction**: Search for ` ```json ... ``` ` fenced blocks (last match preferred)
2. **Raw JSON detection**: Regex for `[...]` or `{...}` patterns
3. **`json.loads()`**: Standard parsing
4. **`jiter.from_json(partial_mode=True)`**: Rust-based partial JSON parser that can handle truncated arrays, missing closing brackets, and other malformation from killed generation

Additionally, if the generation is killed mid-stream, the app attempts to **salvage partial JSON** from whatever the model produced before being interrupted.

### 3. Audio as Data URIs

A design constraint of deploying on HuggingFace Spaces: you can't easily serve dynamic audio files from disk to the frontend. 

**Solution**: Convert all TTS output to base64-encoded WAV data URIs (`data:audio/wav;base64,...`) and embed them directly in the flashcard HTML. Each card's audio is a self-contained data URI that plays via `new Audio(uri).play()` in the browser. This eliminates all file-serving concerns but increases HTML payload size β€” a single deck with 10 cards and audio is ~2-5MB of base64-encoded HTML.

### 4. Dual-Environment Architecture (`IS_HF`)

The app runs in two modes:
- **Local development**: Full debug logging to `./log/`, GPU on CUDA device, fixed ports
- **HuggingFace Spaces**: `tempfile.gettempdir()` for file I/O, `@spaces.GPU` decorators for ZeroGPU allocation, Playwright installed at runtime, all debug file writes disabled to prevent file descriptor exhaustion

The `IS_HF` flag is detected by trying to `import spaces` β€” if it succeeds, we're on HF Spaces.

### 5. YouTube β†’ Flashcards Pipeline

This was the most complex input pipeline:

```
YouTube URL β†’ yt-dlp (first 5 min, WAV) β†’ Cohere ASR β†’ Korean text
                                                          β”‚
                                                          β–Ό
                                                Korean-only filtering
                                                (regex: [κ°€-νž£γ„±-γ…Žγ…-γ…£])
                                                          β”‚
                                                          β–Ό
                                              Qwen VLM (text-only mode)
                                                          β”‚
                                                          β–Ό
                                                  Flashcards + TTS
```

Challenges:
- YouTube bot detection required optional cookies.txt support
- Cohere ASR sometimes returns English-only lines (song lyrics, UI text), which are filtered out using Korean Unicode range detection
- Audio extraction is limited to first 5 minutes to stay within GPU time limits (180s `@spaces.GPU` duration)

### 6. The Flashcard SPA

The flashcard interface is a complete single-page application (~800 lines of HTML/CSS/JS) embedded via `<iframe srcdoc>`. It features:

- **Card flipping animation**: CSS `transform: rotateY(180deg)` with `backface-visibility: hidden`
- **Swipe navigation**: Previous/Next buttons with card counter
- **Audio playback**: One-click pronunciation via base64 data URI
- **Copy to clipboard**: Click-to-copy Korean text with visual feedback
- **Dark theme**: Fully self-contained styling that matches the parent Gradio theme
- **Responsive layout**: Works on desktop and mobile viewports

The entire SPA is generated server-side as a Python f-string with the vocabulary JSON baked in. This avoids any client-server communication for card data.

---

## Challenges & Solutions

| Challenge | Solution |
|---|---|
| **Qwen thinks for 10K+ chars** | Multi-layered auto-force mechanism with configurable thresholds and manual override button |
| **LLM outputs malformed JSON** | 4-layer fallback: regex extraction β†’ raw JSON detection β†’ `json.loads` β†’ `jiter` partial parser |
| **White backgrounds in Firefox dark theme** | Aggressive CSS `!important` overrides on every Gradio internal class including `.file-preview *`, `[data-testid="file"] *`, `.wrap.default`, etc. |
| **File descriptor errors on HF Spaces** | Disabled all debug file I/O behind `if not IS_HF:` guards |
| **Export downloads stuck at "processing"** | Switched from hardcoded file paths to `tempfile.mkstemp()` with unique names per export |
| **No audio on initial demo cards** | Added TTS generation loop at startup for `BOOTSTRAP_VOCAB` before `create_demo()` |
| **YouTube bot detection** | Optional `cookies.txt` upload support via yt-dlp's `cookiefile` parameter |
| **Gradio checkboxes invisible in dark theme** | Custom `appearance: none` checkbox CSS with gold gradient fill and βœ“ pseudo-element |
| **GPU timeout on Spaces (ZeroGPU)** | `@spaces.GPU(duration=180)` decorators, 5-minute YouTube audio limit, configurable auto-force threshold |

---

## What I Learned

### On LLM Engineering

1. **Thinking models need leashes.** Qwen 3.5's `<think>` block is powerful but unpredictable. In production, you *must* have kill switches, auto-force thresholds, and timeout mechanisms. Unbounded thinking is a denial-of-service on your own GPU.

2. **Partial JSON parsing is essential.** If you're generating structured output from an LLM, invest in a robust partial parser (`jiter` was excellent). You will kill generation mid-token, and you need to salvage whatever was produced.

3. **`enable_thinking=False` is underrated.** The Qwen chat template has a built-in mechanism to skip the thinking chain entirely β€” it emits an empty `<think>\n\n</think>` block. For structured extraction tasks where you've already provided clear examples, non-think mode is 5-10x faster with comparable quality.

4. **Thread management for streaming is tricky.** Python's `TextIteratorStreamer` + threading model works but requires careful cleanup β€” drain the queue before joining, use stopping criteria flags, and always call `streamer.end()` in a `finally` block.

### On Multimodal Pipelines

5. **VLMs handle mixed text+image extraction surprisingly well.** Qwen 3.5-9B can simultaneously read Korean text from a page render *and* understand the visual context (diagrams, tables, images) to produce better vocabulary than text-only extraction.

6. **ASR output needs post-filtering.** Korean ASR models (even good ones like Cohere's) produce mixed-language output on music, UI sounds, and English speech. A simple regex filter for Korean Unicode characters (`[κ°€-힣]`) dramatically improves downstream quality.

7. **Audio as data URIs is a viable architecture.** For small-to-medium audio clips (1-3 seconds of TTS), base64 data URIs embedded directly in HTML eliminate all file-serving complexity. The payload size is manageable and the UX is seamless.

### On Gradio & HuggingFace Spaces

8. **Gradio's CSS is a battlefield.** The framework injects deeply nested internal styles that are nearly impossible to override cleanly. The only reliable approach is aggressive `!important` declarations on specific element selectors. Firefox is especially stubborn β€” it requires targeting `*` descendants of containers that Chrome handles transitively.

9. **ZeroGPU has hard time limits.** The `@spaces.GPU(duration=180)` decorator means your entire pipeline β€” extraction, retry logic, TTS generation β€” must complete in 3 minutes. This forces architectural decisions: limit input size, cap retry attempts, use auto-force thresholds.

10. **File I/O on Spaces is fragile.** Debug logging, temporary files, and export paths all need special handling. `tempfile.mkstemp()` for exports, `tempfile.gettempdir()` for log directories, and guarding all debug writes behind `IS_HF` flags.

### On UI/UX

11. **Demo-ready defaults matter enormously.** Pre-loading a PDF example, an audio example, a BBC Korean URL, and a YouTube link β€” plus generating TTS audio for bootstrap flashcards at startup β€” means the app is *immediately* impressive on first load. The 30 seconds of startup TTS generation pays for itself in user engagement.

12. **Dark themes need obsessive attention.** Every single Gradio component β€” file previews, checkboxes, tab navs, dropdowns, sliders, progress bars, scrollbars β€” needs explicit dark styling. Miss one and you get a jarring white rectangle in your otherwise polished UI.

---

## Models Used

| Model | Purpose | Notes |
|---|---|---|
| **Qwen/Qwen3.5-9B** | Vision-Language extraction & translation | Custom chat template with `enable_thinking` support. Multimodal (text + images). |
| **CohereLabs/cohere-transcribe-03-2026** | Korean speech-to-text | Used for YouTube and audio upload transcription. Runs on CPU. |
| **Supertonic-3** | Korean text-to-speech | Generates natural pronunciation audio. F1 voice style, 0.7 speed, 12-step denoising. |


---

## Final Reflection

Building LocalDuo taught me that the "last mile" of LLM applications β€” the gap between a working model and a polished product β€” is where most of the engineering happens. The model inference itself is perhaps 20% of the code. The other 80% is: input parsing across 5 formats, thread management for streaming, JSON robustness, CSS battles with Gradio, file I/O gymnastics on cloud platforms, and the thousand small UX decisions that make the difference between a demo and a tool someone would actually use to learn Korean.

The most satisfying moment was watching the full pipeline work end-to-end: paste an audio input β†’ Cohere transcribes Korean speech β†’ Qwen extracts vocabulary β†’ Supertonic speaks each word β†’ flashcards appear with audio playback. Three models, three modalities, one click.