nano-case

Converts a messy identifier to a target case style β€” snake, kebab, camel, pascal, const β€” and, crucially, segments boundary-destroyed inputs (no separators, one global case: sdkmodel, HTTPREQUESTHANDLER) that a regular expression provably cannot split. A ~1M-parameter (1,016,960) byte-level transformer.

const  | sdkmodel          => SDK_MODEL
camel  | usertablehandler  => userTableHandler
kebab  | sqlqueryname      => sql-query-name
snake  | md5cache          => md5_cache
pascal | rendertokenerror  => RenderTokenError

It runs on a CPU in milliseconds and was trained entirely on code-generated data β€” no scraping, no labelling, no distillation. Weights + a self-contained inference file are here and on Hugging Face.

πŸ“„ Technical report (PDF) Β· πŸ“ Full writeup (gist)

Why a model and not a regex

Converting a clean identifier between cases is a solved, free problem β€” a regex splits on separators and camel-humps and re-renders. So that slice has no value. The value is the regex-killer slice: inputs where the separators and casing are gone (userprofilecache, HTTPREQUESTHANDLER), leaving nothing to split on. The only way back to the intended words is a learned vocabulary. That is what nano-case has, and what a script cannot have.

Benchmark (held-out, seed 987654321, N=4000)

Exact-match accuracy, model vs a standard regex case-converter. Mean Β± std over 3 training seeds (0/1/2).

model regex script
overall 99.8% Β± 0.0% 61.8%
smushed slice (N=1410) 99.7% Β± 0.0% 8.2%

The smushed slice is the regex-killer: boundary-destroyed, single-case, multi-word inputs. 8.2% for the script vs 99.7% for the model is the "you genuinely need a model here" result.

Reproduce: python eval_nano_case.py --n 4000.

Where it breaks (out-of-distribution)

nano-case's segmentation prior is its ~120-token training vocabulary. Honest limits, measured:

input type accuracy
in-vocab smushed (the trained slice) 100%
out-of-vocabulary words smushed 2%
chains longer than trained (5–6 words) 33%

So it nails smushing of known words and degrades on unknown tokens / very long chains β€” the expected ceiling of a 1M model on a vocabulary task, reported rather than hidden.

Use it

pip install -r requirements.txt
python modeling_nano_case.py            # demo
from modeling_nano_case import load, to_case
m = load("model.safetensors", "config.json")
to_case(m, "const", "sdkmodel")          # -> "SDK_MODEL"
to_case(m, "camel", "user_table_handler")# -> "userTableHandler"

How it was trained

Code-generated data (sample words from a fixed vocabulary β†’ render the gold target canonically β†’ corrupt a copy into a messy input, ~45% boundary-destroyed), SFT with the prompt masked so only the target + newline EOS is supervised. ~1M-param byte-level transformer (RMSNorm, RoPE, GQA, SwiGLU), 12k steps, AdamW, cosine LR. Full recipe and exact config in TRAINING.md.

Files

  • modeling_nano_case.py β€” self-contained model + load() / to_case() (torch + safetensors only).
  • data_cases.py β€” the code data generator (shared by train and eval).
  • eval_nano_case.py β€” the model-vs-regex benchmark.
  • test_nano_case.py β€” labels-correct / no-leakage / determinism / published-weights regression.
  • model.safetensors, config.json β€” weights + architecture.
  • report/nano-case-report.pdf β€” the technical report.
  • TRAINING.md β€” reproduction recipe.

License

MIT. Built by Vuk Rosić.

Downloads last month
26
Safetensors
Model size
1.02M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support