Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.18.0
Architecture
TinyPress is built modular β each concern lives in its own place, nothing bleeds into something it shouldn't.
How a compression request flows
User Input (Gradio UI)
β
βΌ
core/compressor.py β builds the prompt, calls the model, trims if it overshoots
β
βΌ
models/model_loader.py β Qwen2.5-1.5B-Instruct, loaded once and reused
β
βΌ
core/scorer.py β checks how much meaning survived using all-MiniLM-L6-v2
β
βΌ
db/store.py β saves the run to SQLite
β
βΌ
ui/compress_tab.py β shows the result and metrics back to the user
What each module does
| Module | Responsibility |
|---|---|
app.py |
Starts everything β DB init, model load, Gradio launch |
config.py |
One place for all settings β model names, token limits, DB path, port |
ui/compress_tab.py |
The compression interface β input, slider, output, metrics |
ui/history_tab.py |
History view β past runs, averages, trends |
core/compressor.py |
Builds the compression prompt, runs generation, hard-trims if needed |
core/scorer.py |
Cosine similarity between original and compressed text |
core/tokenizer_utils.py |
Token counting and per-token string extraction using the LLM's own tokenizer |
core/diff.py |
Word-level SequenceMatcher diff β produces annotated HTML for the history side-by-side view |
models/model_loader.py |
Singleton model store β loads LLM + embedder on demand, supports hot-swapping both via switch_llm / switch_embedder |
db/store.py |
SQLite operations β init, save a run, fetch history, delete a run; auto-migrates older DBs |
db/schema.sql |
The compression_runs table definition |
A few decisions worth knowing
Models load once at startup. This matters on a laptop β you don't want to reload a 1.5B model on every request. Both the LLM and the embedder are held in memory after the first load.
Model hot-swapping without a restart. The Model Settings accordion in the UI lets you pick a different compression model or scoring embedder mid-session. Both switch_llm and switch_embedder in model_loader.py unload the current model (deletes the references, calls gc.collect, and flushes the CUDA cache if a GPU is present) before loading the new one β so you don't end up with two large models in memory at once.
Hard token trim as a safety net. If the model overshoots the target budget, the output gets trimmed at the tokenizer level. It's a fallback, not the primary path β the prompt already asks the model to stay within budget.
Thin UI layer. The Gradio handlers in ui/ don't contain logic. They take inputs, call into core/, and return outputs. All the real work happens in core/ and db/.
DB auto-migration. store.py runs ALTER TABLE β¦ ADD COLUMN for tokenizer, duration_ms, feedback, and feedback_comment on startup β so existing databases from earlier builds upgrade silently rather than crashing. feedback is nullable (INTEGER): NULL = no rating, 1 = π, -1 = π. feedback_comment holds the optional text note.
π README.md