Instructions to use Quazim0t0/Escarda-86M-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Quazim0t0/Escarda-86M-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Quazim0t0/Escarda-86M-Base", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Quazim0t0/Escarda-86M-Base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Quazim0t0/Escarda-86M-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Quazim0t0/Escarda-86M-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Escarda-86M-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Quazim0t0/Escarda-86M-Base
- SGLang
How to use Quazim0t0/Escarda-86M-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Quazim0t0/Escarda-86M-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Escarda-86M-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Quazim0t0/Escarda-86M-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Quazim0t0/Escarda-86M-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Quazim0t0/Escarda-86M-Base with Docker Model Runner:
docker model run hf.co/Quazim0t0/Escarda-86M-Base
Escarda-86M-Base
Escarda-86M-Base is a ~86M-parameter, from-scratch decoder-only language model β the
base sibling of Quazim0t0/Escarda-86M
(the chat-tuned model). It shares the same SpikeWhaleLM architecture (Multi-head Latent
Attention, an n-gram "engram" memory, hash-lookup layers, hyper-connections, an HRM
refinement step, and JEPA / multi-token-prediction training objectives) and the same
custom ChatML-aware tokenizer.
This checkpoint is a JEPA-distilled base. It is best used as a starting point for continued pretraining / fine-tuning rather than as a chat assistant.
Related models: SFT / chat model β Quazim0t0/Escarda-86M Β· live demo β Escarda-86M-Chat Space
Trained using Modal's credits during the Small Models, Big Adventures Hackathon.
Model summary
| Parameters | ~85.7M (tie_word_embeddings=True) |
| Type | Decoder-only LM (SpikeWhaleLM, model_type: spike_whale) |
| Hidden size / layers | 640 / 16 |
| Attention | 10 heads (head_dim=64), 1 KV head (MQA), MLA low-rank Q/O, decoupled RoPE(16)+NoPE(48), QK-norm |
| Context length | 4096 tokens |
| Vocab | 16,512 (custom length-max tokenizer) |
| License | Apache-2.0 |
For the full architecture description see the chat model's card.
Evaluation
Zero-shot, scored by continuation log-likelihood in the lm-eval-harness style over full
splits. byte_ppl is exp(sum_NLL_nats / total_UTF8_bytes) on WikiText-2 test (tokenizer-
independent). BLiMP is fraction of minimal pairs with logprob(good) > logprob(bad)
(12 paradigms Γ 150). Stderr is binomial sqrt(p(1-p)/n).
β οΈ Produced with a local harness that approximates lm-eval-harness (same scoring method; prompt/normalization differ slightly). Treat sub-0.02 gaps as noise.
Language modeling
| Metric | Value |
|---|---|
| WikiText-2 byte_ppl β | 2.2228 |
| BLiMP acc β | 0.7144 |
Multiple-choice suite
| Task | acc | Β± | acc_norm | Β± |
|---|---|---|---|---|
| arc_easy | 0.3801 | 0.0100 | 0.3615 | 0.0099 |
| arc_challenge | 0.1886 | 0.0114 | 0.2235 | 0.0122 |
| hellaswag | 0.2759 | 0.0045 | 0.2832 | 0.0045 |
| winogrande | 0.5162 | 0.0140 | β | β |
| piqa | 0.5843 | 0.0115 | 0.5631 | 0.0116 |
| openbookqa | 0.1300 | 0.0150 | 0.2500 | 0.0194 |
| boolq | 0.5138 | 0.0087 | β | β |
ArithMark-2.0 (AxiomicLabs)
| Metric | Value |
|---|---|
| acc | 0.2536 Β± 0.0087 |
| acc_norm | 0.2348 Β± 0.0085 |
n = 2,500 Β· chance = 0.25.
Note: as a distilled base, this checkpoint has the lowest byte-perplexity of the Escarda family but trades off downstream task accuracy β a good reminder that perplexity alone is not a reliable capability ranking. For the strongest chat behaviour use Escarda-86M; use this model when you want a low-loss base to continue pretraining or fine-tune.
Training & token budget
- Tokens: ~20B (from-scratch pretraining of the SpikeWhale base, ~28k steps); this checkpoint is a JEPA-distilled snapshot of that base.
- Token/param ratio: ~233 tokens/param (20B / 85.7M) β roughly 11β12Γ the Chinchilla ~20-tokens/param compute-optimal heuristic, i.e. a deliberately over-trained small model (the inference-efficient trade-off).
Fitting the Chinchilla data term to this model's own pretraining loss curve gives:
L(D) β 2.611 + 77,715 Β· D^(β0.537) (nats/token, RΒ² = 0.92)
From that fit:
- Compute-optimal tokens for this 86M size β 4.3B β the 20B run is ~4.6Γ past compute-optimal.
- Diminishing-returns knee β 22.5B tokens (where +1B tokens buys < 0.005 nats) β the 20B stopping point lands right at the knee, a well-judged budget.
- The model is parameter-bound, not data-bound at 20B: the capacity term (
0.82 nats) exceeds the data term (0.54), so extra tokens help little. Doubling to 40B is projected to lower loss only0.07 nats (7% perplexity) with negligible downstream gain β the lever for better quality is more parameters, not more tokens. (This is also why, as a distilled base, it reaches the lowest perplexity of the family without the best downstream scores β it is already at its data-term floor.)
Caveats: single-size fit (folds irreducible loss + capacity floor into one constant); the cosine-LR decay inflates the fitted exponent, so treat Ξ² as an upper bound; token counts are anchored to the ~20B figure and scale linearly if that differs.
Usage
Custom architecture β load with trust_remote_code=True (the modeling code ships in this
repo via auto_map):
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"Quazim0t0/Escarda-86M-Base", trust_remote_code=True)
The tokenizer is the custom SpikeTokenizer (tokenizer.json, algorithm: length-max);
load it with the spike_tokenizer.py helper from the project rather than AutoTokenizer.
Acknowledgements
Built with Modal credits during the Small Models, Big Adventures Hackathon, and released to the community as a base to build on.
- Downloads last month
- 51