Instructions to use paiml/albor-370m-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use paiml/albor-370m-v1 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="paiml/albor-370m-v1",
	filename="albor-370m-v1-q4k.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use paiml/albor-370m-v1 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf paiml/albor-370m-v1
# Run inference directly in the terminal:
llama cli -hf paiml/albor-370m-v1

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf paiml/albor-370m-v1
# Run inference directly in the terminal:
llama cli -hf paiml/albor-370m-v1

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf paiml/albor-370m-v1
# Run inference directly in the terminal:
./llama-cli -hf paiml/albor-370m-v1

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf paiml/albor-370m-v1
# Run inference directly in the terminal:
./build/bin/llama-cli -hf paiml/albor-370m-v1

Use Docker

docker model run hf.co/paiml/albor-370m-v1

LM Studio
Jan

vLLM

How to use paiml/albor-370m-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "paiml/albor-370m-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "paiml/albor-370m-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/paiml/albor-370m-v1

Ollama
How to use paiml/albor-370m-v1 with Ollama:
```
ollama run hf.co/paiml/albor-370m-v1
```

Unsloth Studio

How to use paiml/albor-370m-v1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for paiml/albor-370m-v1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for paiml/albor-370m-v1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for paiml/albor-370m-v1 to start chatting

How to use paiml/albor-370m-v1 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf paiml/albor-370m-v1

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "paiml/albor-370m-v1"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use paiml/albor-370m-v1 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf paiml/albor-370m-v1

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default paiml/albor-370m-v1

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use paiml/albor-370m-v1 with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf paiml/albor-370m-v1

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "paiml/albor-370m-v1" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use paiml/albor-370m-v1 with Docker Model Runner:
```
docker model run hf.co/paiml/albor-370m-v1
```

Lemonade

How to use paiml/albor-370m-v1 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull paiml/albor-370m-v1

Run and chat with the model

lemonade run user.albor-370m-v1-{{QUANT_TAG}}

List all available models

lemonade list

albor-370m-v1

Stack-existence-proof model for the Sovereign AI Stack. This is a 494M-parameter Qwen2 architecture trained end-to-end with the Rust-only Sovereign AI Stack — no PyTorch, no Python training loop, no HuggingFace transformers — purely aprender + entrenar + trueno + realizar.

Intended use: stack capability proof, NOT a production code-completion model. See §88 framing below.

Model description

albor-370m-v1 is a 494M-parameter Qwen2-architecture transformer trained on a 49.6B-token Python corpus using the aprender Sovereign AI Stack. The training pipeline (aprender-train ≈ entrenar) was used end-to-end:

Tokenization: apr tokenize encode-corpus (BPE NFC, Qwen2 vocab, 151,936 tokens)
Architecture: 24 layers × 14 attention heads × 2 KV heads × 896 hidden dim
Initialization: Qwen/Qwen2.5-Coder-0.5B-Instruct (via apr convert)
Training: apr pretrain --device cuda on RTX 4090 (cuBLAS TF32 forward + custom backward kernels)
Checkpoint format: .apr (aprender native, row-major, no PyTorch dependency)
Inference: apr run model.apr "..." (realizar inference engine)

Training procedure

Parameter	Value
Architecture	Qwen2 (24L, 14H, 2KV, 896d, 4864 intermediate)
Parameters	~494M
Tokenizer	Qwen2 BPE (151,936 vocab)
Init	`Qwen/Qwen2.5-Coder-0.5B-Instruct` (post-fine-tune from instruct)
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=1e-8, wd=0.01)
LR schedule	Cosine warmup (500 steps) → 1.5e-5 peak → 1.0e-7 floor
Batch	16
Seq length	512
Total steps	5,000 (50 epochs × 100 steps)
Total tokens consumed	40.96M
Hardware	NVIDIA RTX 4090 (sm_89), 24 GB
Wall time	53 minutes
Throughput	15,460 tok/s (pure training) / 12,880 tok/s (with 2.5 GB checkpoint per epoch)
GPU utilization	99-100% sustained, 10.4 GB / 24 GB used, 57°C

Evaluation

Best val_loss: 4.6227 @ epoch 49 (smooth monotonic descent across 50 epochs, no early-stop)
Best val_perplexity: 101.78
Inference throughput: 315.6 tok/s (epoch-020 apr bench on RTX 4090)

Trajectory (every 5 epochs)

ep  0:  7.43 (init eval)
ep  5:  5.91
ep 10:  5.54
ep 15:  5.18
ep 20:  5.02
ep 25:  4.95
ep 30:  4.83
ep 35:  4.77
ep 40:  4.71
ep 45:  4.70
ep 49:  4.62  ← BEST

The descent is smooth and monotonic — the §85 P2-E run demonstrates that the Sovereign AI Stack's training loop reaches the expected loss floor for this architecture and corpus given the compute budget. Marginal-gain decay analysis predicts ~4.4 floor at 100 epochs and ~3.5 floor at 1.2M steps (Chinchilla compute-optimal D ≈ 20·N).

§88 framing — "stack-existence-proof"

Per SPEC-SHIP-TWO-001 §88, this model is shipped as a stack capability proof, not a production code-completion model. The original AC-SHIP2-003 strict target (val_loss ≤ 2.2) requires ~~213 GPU-hours (~~9 days continuous) of training compute, which exceeds the project's 48-GPU-hour single-shot iteration budget. The §88 amendment introduces a compute-bounded target (val_loss ≤ 4.7) that this model satisfies.

The primary purpose this model serves is to demonstrate end-to-end stack capability:

✅ Pure-Rust tokenization (apr tokenize encode-corpus)
✅ Pure-Rust training (apr pretrain, no PyTorch)
✅ Pure-Rust checkpointing (.apr format, no safetensors dependency for the training path)
✅ Pure-Rust inference (apr run)
✅ Pure-Rust quantization + export (apr export --format gguf)
✅ Cross-stack interop (GGUF export loads in llama.cpp; SafeTensors round-trip works)

If you need a production-quality 0.5B code-completion model, use Qwen/Qwen2.5-Coder-0.5B-Instruct directly. The next iteration of albor (distillation epic PMAT-683/684) is the planned route to stricter quality targets.

Intended uses

✅ Sovereign AI Stack demonstrations — show the Rust-only training pipeline working end-to-end
✅ Inference infrastructure validation — drop-in test artifact for realizar / apr run / apr serve
✅ Tokenization round-trip testing — exercise the BPE NFC + chat-template pipeline
✅ Quantization research — Q4_K / Q6_K conversion via apr quantize benchmarks against this checkpoint
⚠️ NOT recommended for:
- Production code completion (use Qwen2.5-Coder-0.5B-Instruct or larger)
- Zero-shot reasoning (val_perplexity ≈ 102 → mathematically incapable)
- Long-context generation (max_position_embeddings = 32,768 but model wasn't trained beyond seq=512)
- HumanEval / MBPP submission as a competitive code-LM (target distillation epic for that)

Limitations

Compute-bounded training: 40.96M tokens consumed of a 49.6B-token corpus (0.083% sampling). Compute-optimal Chinchilla target (D=20·N) would require ~9 days continuous GPU; this model trades depth-of-fit for iteration speed on the stack.
Plateau evidence: doubling the training compute (P2-G with 10k steps vs P2-E's 5k) at the same LR/warmup produces a WORSE result (val_loss 4.6497 vs 4.6227, EARLY_STOP). This is a known marginal-gain-decay regime — more-of-the-same-recipe doesn't help. See §87 Chinchilla 20·N hard gate.
Init lineage: weights inherit from Qwen/Qwen2.5-Coder-0.5B-Instruct (Apache-2.0 license). The fine-tuning pass shifts the model's distribution toward the codeparrot+the-stack-dedup-Python distribution but does NOT fully replace the Instruct prior. Expect chat-formatted outputs to occasionally surface.
Validation set drift: P2-E held-out val batches were drawn from the first 16 batches of the qwen-v3 shard iterator — a mixed codeparrot + the-stack-dedup distribution. The new apr pretrain --val-shard flag (PR #1744) supports independent val sets for future runs.

Training data

Source	Size	License	Role
`codeparrot/codeparrot-clean`	12.8 GB	Apache-2.0 (permissive subset)	~25% of mix
`bigcode/the-stack-dedup` (Python)	28.6 GB	Permissive licenses (filtered, dedup'd)	~75% of mix
Combined corpus	49.6B tokens	Permissive (filtered + dedup'd)	qwen-v3

The corpus is tokenized at ingest time via apr tokenize encode-corpus --num-workers 48 and saved to disk as .bin shards (little-endian u32 tokens). The apr-corpus-ingest binary handles license filtering + minhash deduplication upstream.

How to use

Inference (recommended path)

# Install the aprender CLI
cargo install aprender

# Pull the model
apr pull paiml/albor-370m-v1

# Generate
apr run paiml/albor-370m-v1 "def fibonacci(n):"

# Benchmark
apr bench paiml/albor-370m-v1 --iterations 100

Direct .apr load (Rust, no Python)

use realizar::Model;
let model = Model::load_apr("albor-370m-v1.apr")?;
let output = model.generate(&input_ids, generation_config)?;

HuggingFace Transformers (cross-stack compat)

The repo includes model.safetensors + config.json + tokenizer.json + tokenizer_config.json + generation_config.json, so the model is directly loadable with HuggingFace Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("paiml/albor-370m-v1")
model = AutoModelForCausalLM.from_pretrained("paiml/albor-370m-v1", torch_dtype="auto")
inputs = tokenizer("def fibonacci(n):", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

This path is provided for compatibility — but note the model was trained with the Rust-only Sovereign AI Stack, not PyTorch.

Export to other formats

# Quantize to Q4_K for llama.cpp (--quantize int4 selects Q4_K internally)
apr export albor-370m-v1.apr --format gguf --quantize int4 -o albor-370m-q4k.gguf

# Export to SafeTensors for transformers (round-trip works but loses some metadata)
apr export albor-370m-v1.apr --format safetensors -o albor-370m.safetensors

Reproduce the training run

# Pull the source init
apr pull Qwen/Qwen2.5-Coder-0.5B-Instruct -o qwen-init.apr

# Pull the corpus (dataset asset-type discriminator)
apr pull dataset bigcode/the-stack-dedup --filter python --subset deduped -o the-stack-py/
apr pull dataset codeparrot/codeparrot-clean -o codeparrot-clean/

# Tokenize (multi-source, NFC normalized, Qwen2 vocab)
apr tokenize encode-corpus \
  --corpus the-stack-py/ \
  --corpus codeparrot-clean/ \
  --tokenizer qwen-init.apr \
  -o qwen-v3-shards/

# Train (the P2-E recipe)
apr pretrain \
  --dataset qwen-v3-shards/ \
  --tokenizer qwen-tokenizer/ \
  --run-dir runs/albor-370m-v1/ \
  --init qwen-init.apr \
  --mode finetune \
  --lr 1.5e-5 \
  --num-steps 5000 \
  --warmup-steps 500 \
  --batch-size 16 \
  --seq-length 512 \
  --target-val-loss 3.0 \
  --vocab-size 151936 \
  --device cuda \
  --seed 42 \
  --force-under-provisioned

Expected wall time: 53 minutes on RTX 4090.

Citation

@misc{albor-370m-v1,
  title  = {albor-370m-v1: Stack-Existence-Proof for the Sovereign AI Stack},
  author = {PAIML Engineering},
  year   = 2026,
  url    = {https://huggingface.co/paiml/albor-370m-v1},
  note   = {494M-parameter Qwen2 architecture trained end-to-end with the Rust-only aprender Sovereign AI Stack (no PyTorch). See SPEC-SHIP-TWO-001 §88 for the framing.}
}

License + Provenance

Field	Value
Model weights license	Apache-2.0 (inherits from base)
Init checkpoint	`Qwen/Qwen2.5-Coder-0.5B-Instruct` (Apache-2.0)
Training data sources	`codeparrot/codeparrot-clean` + `bigcode/the-stack-dedup` (Python, permissive-licensed subset)
Training data license	Aggregated permissive (Apache-2.0 / MIT / BSD via license filtering)
Training stack	aprender + entrenar + trueno + realizar (all Apache-2.0)
Training hardware	NVIDIA RTX 4090 (sm_89)
Training framework	Rust-native (no PyTorch, no HF transformers)

Acknowledgments

Base model: Qwen team for the Qwen2.5-Coder-0.5B-Instruct base.
Training data: BigCode (the-stack-dedup) + codeparrot (codeparrot-clean).
Stack components: aprender (ML framework), entrenar (training, in-tree), trueno (SIMD/GPU compute, in-tree), realizar (inference, in-tree).

Changelog

v1.0.0 (2026-05-17): Initial release. Stack-existence-proof model per SPEC §88. Best val_loss = 4.6227. Trained from Qwen/Qwen2.5-Coder-0.5B-Instruct init on codeparrot/codeparrot-clean + bigcode/the-stack-dedup Python permissive subset.

For full training methodology and methodology lessons learned (Class 3 packaging cascades, audit hypothesis bounds, upstream metadata masquerade), see docs/specifications/aprender-train/ship-model-2-spec.md §81–§88.

Downloads last month: 124

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for paiml/albor-370m-v1

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-Coder-0.5B

Finetuned

Qwen/Qwen2.5-Coder-0.5B-Instruct

Quantized

(68)

this model

Datasets used to train paiml/albor-370m-v1

Evaluation results

Validation Cross-Entropy on codeparrot-thestack-python-permissive-shards-qwen-v3
self-reported

4.623
Validation Perplexity on codeparrot-thestack-python-permissive-shards-qwen-v3
self-reported

101.780
Inference Throughput (tok/s, RTX 4090) on codeparrot-thestack-python-permissive-shards-qwen-v3
self-reported

315.600