nano-case
Converts a messy identifier to a target case style β snake, kebab, camel,
pascal, const β and, crucially, segments boundary-destroyed inputs (no
separators, one global case: sdkmodel, HTTPREQUESTHANDLER) that a regular
expression provably cannot split. A ~1M-parameter (1,016,960) byte-level
transformer.
const | sdkmodel => SDK_MODEL
camel | usertablehandler => userTableHandler
kebab | sqlqueryname => sql-query-name
snake | md5cache => md5_cache
pascal | rendertokenerror => RenderTokenError
It runs on a CPU in milliseconds and was trained entirely on code-generated data β no scraping, no labelling, no distillation. Weights + a self-contained inference file are here and on Hugging Face.
π Technical report (PDF) Β· π Full writeup (gist)
Why a model and not a regex
Converting a clean identifier between cases is a solved, free problem β a regex
splits on separators and camel-humps and re-renders. So that slice has no value.
The value is the regex-killer slice: inputs where the separators and casing are
gone (userprofilecache, HTTPREQUESTHANDLER), leaving nothing to split on. The
only way back to the intended words is a learned vocabulary. That is what
nano-case has, and what a script cannot have.
Benchmark (held-out, seed 987654321, N=4000)
Exact-match accuracy, model vs a standard regex case-converter. Mean Β± std over 3 training seeds (0/1/2).
| model | regex script | |
|---|---|---|
| overall | 99.8% Β± 0.0% | 61.8% |
| smushed slice (N=1410) | 99.7% Β± 0.0% | 8.2% |
The smushed slice is the regex-killer: boundary-destroyed, single-case, multi-word inputs. 8.2% for the script vs 99.7% for the model is the "you genuinely need a model here" result.
Reproduce: python eval_nano_case.py --n 4000.
Where it breaks (out-of-distribution)
nano-case's segmentation prior is its ~120-token training vocabulary. Honest limits, measured:
| input type | accuracy |
|---|---|
| in-vocab smushed (the trained slice) | 100% |
| out-of-vocabulary words smushed | 2% |
| chains longer than trained (5β6 words) | 33% |
So it nails smushing of known words and degrades on unknown tokens / very long chains β the expected ceiling of a 1M model on a vocabulary task, reported rather than hidden.
Use it
pip install -r requirements.txt
python modeling_nano_case.py # demo
from modeling_nano_case import load, to_case
m = load("model.safetensors", "config.json")
to_case(m, "const", "sdkmodel") # -> "SDK_MODEL"
to_case(m, "camel", "user_table_handler")# -> "userTableHandler"
How it was trained
Code-generated data (sample words from a fixed vocabulary β render the gold target canonically β corrupt a copy into a messy input, ~45% boundary-destroyed), SFT with the prompt masked so only the target + newline EOS is supervised. ~1M-param byte-level transformer (RMSNorm, RoPE, GQA, SwiGLU), 12k steps, AdamW, cosine LR. Full recipe and exact config in TRAINING.md.
Files
modeling_nano_case.pyβ self-contained model +load()/to_case()(torch + safetensors only).data_cases.pyβ the code data generator (shared by train and eval).eval_nano_case.pyβ the model-vs-regex benchmark.test_nano_case.pyβ labels-correct / no-leakage / determinism / published-weights regression.model.safetensors,config.jsonβ weights + architecture.report/nano-case-report.pdfβ the technical report.TRAINING.mdβ reproduction recipe.
License
MIT. Built by Vuk RosiΔ.
- Downloads last month
- 26