Text Generation
Transformers
Safetensors
mistral
reasoning
thinking
chain-of-thought
sft
lora
conversational
text-generation-inference
Instructions to use schneewolflabs/A1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use schneewolflabs/A1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="schneewolflabs/A1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("schneewolflabs/A1") model = AutoModelForCausalLM.from_pretrained("schneewolflabs/A1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use schneewolflabs/A1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "schneewolflabs/A1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schneewolflabs/A1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/schneewolflabs/A1
- SGLang
How to use schneewolflabs/A1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "schneewolflabs/A1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schneewolflabs/A1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "schneewolflabs/A1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schneewolflabs/A1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use schneewolflabs/A1 with Docker Model Runner:
docker model run hf.co/schneewolflabs/A1
| base_model: schneewolflabs/A0i-12B | |
| datasets: | |
| - schneewolflabs/BigDenker-SFT | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - reasoning | |
| - thinking | |
| - chain-of-thought | |
| - mistral | |
| - sft | |
| - lora | |
| license: apache-2.0 | |
| # A1 | |
| A1 is a reasoning-tuned version of [`schneewolflabs/A0i-12B`](https://huggingface.co/schneewolflabs/A0i-12B) (Mistral Nemo–class, 12B). It was supervised-fine-tuned on [`schneewolflabs/BigDenker-SFT`](https://huggingface.co/datasets/schneewolflabs/BigDenker-SFT) to produce explicit `<think>…</think>` chain-of-thought before its final answer. | |
| ## What's different from the base model | |
| The base A0i-12B does not reason — given a prompt it answers directly. A1 produces a reasoning trace inside `<think></think>` and then the answer, in the Qwen3 thinking convention. | |
| The reasoning tokens were added **without resizing the vocabulary**. A0i's tokenizer ships with 986 unused reserved slots (`<SPECIAL_14>`…`<SPECIAL_999>`); ten of these were repurposed in place (token IDs unchanged, so embeddings were *not* resized): | |
| | ID | token | ID | token | | |
| |----|-------|----|-------| | |
| | 14 | `<think>` | 19 | `</tool_response>` | | |
| | 15 | `</think>` | 20 | `<\|vision_start\|>` | | |
| | 16 | `<tool_call>` | 21 | `<\|vision_end\|>` | | |
| | 17 | `</tool_call>` | 22 | `<\|image_pad\|>` | | |
| | 18 | `<tool_response>` | 23 | `<\|video_pad\|>` | | |
| Before training, these rows (which were zero/untrained in the base) were initialized from the mean of their surface string's sub-token embeddings (computed separately for `embed_tokens` and `lm_head`, which are untied), with a small symmetry-breaking perturbation. They were then trained as part of the finetune. | |
| > Note: the vision/tool tokens exist for chat-template completeness. A1 is a **text-only** model and was not trained on vision or tool-calling data. | |
| ## Usage | |
| A1 uses a Qwen3-style chat template (bundled as `chat_template.jinja`). The template injects `<think>\n` as the assistant prefix, so the model continues with its reasoning, emits `</think>`, then the answer. **Always use the chat template** — using the model without it will not trigger reasoning. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| tok = AutoTokenizer.from_pretrained("schneewolflabs/A1") | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "schneewolflabs/A1", dtype=torch.bfloat16, device_map="auto" | |
| ) | |
| messages = [ | |
| {"role": "system", "content": "You are a helpful assistant."}, | |
| {"role": "user", "content": "A bat and ball cost $1.10. The bat is $1.00 more than the ball. How much is the ball?"}, | |
| ] | |
| enc = tok.apply_chat_template( | |
| messages, add_generation_prompt=True, return_tensors="pt", return_dict=True | |
| ).to(model.device) | |
| out = model.generate(**enc, max_new_tokens=1024, do_sample=False) | |
| print(tok.decode(out[0][enc["input_ids"].shape[1]:], skip_special_tokens=False)) | |
| ``` | |
| The output has the form `…reasoning…</think>\n\n…final answer…<|im_end|>`. | |
| ## Training | |
| - **Method:** SFT, 1 epoch, via the Merlina training system (grimoire `SFTLoss`, prompt-masked completion). | |
| - **Adaptation:** LoRA (r=64, α=128, dropout 0.05) on attention + MLP projections, plus `embed_tokens` and `lm_head` as fully-trained `modules_to_save` (required so the repurposed token rows actually learn). | |
| - **Hyperparameters:** lr 2e-5 (cosine, 5% warmup), effective batch 16 (bs 1 × grad-accum 16), max sequence length 4096, bf16, seed 42. | |
| - **Split:** 90% train / 10% held-out eval (random, seed 42). | |
| - **Result:** train loss 1.22 → 0.687; held-out eval loss ≈ 0.653 (eval ≤ train — no overfitting at 1 epoch). | |
| - The conservative learning rate plus the semantic embedding initialization were chosen to add reasoning while limiting drift of the base model's general token representations (the full embedding/lm_head matrices were trainable). | |
| ## Evaluation notes | |
| Behavioral checks show coherent step-by-step reasoning that **generalizes beyond the training distribution** — e.g. it solves the bat-and-ball problem correctly ($0.05) and explicitly rejects the common intuitive-trap answer. | |
| ## Limitations | |
| - **Always-on thinking:** the template starts every assistant turn with `<think>`; the model reasons even on trivial prompts. A non-thinking path exists via the template (`enable_thinking=False` injects an empty `<think></think>`) but was not specifically tuned. | |
| - **Single-source SFT:** trained on one dataset/style (BigDenker), so reasoning phrasing is fairly homogeneous. | |
| - **One epoch / conservative LR:** a deliberate, safe first pass — not exhaustively tuned. | |
| - Inherits all limitations and biases of the base model and the SFT data. | |
| ## Provenance | |
| Base: `schneewolflabs/A0i-12B` · Data: `schneewolflabs/BigDenker-SFT` · Tokenizer/template: repurposed reserved tokens + Qwen3 chat template. | |