ATLAS-OLMo-3-7B-Think-v4

OLMo-3-7B-Think served by the ATLAS pure-Rust inference engine (v4.1.0) — zero external crate dependencies, BF16 GPU inference at 61.7 tok/s on A100-SXM4-40GB.

Latest Release — v4.1.0

v4.1.0 — Full GPU Attention + StigmergicHook · 600 tests

  • 4× throughput: 61.7 tok/s BF16 on A100 (up from 15.4 tok/s) — zero intra-layer PCIe transfers during decode
  • Full GPU attention path: custom CUDA kernels handle the entire decode step on-device
  • New CUDA kernels: decode_attention_kernel, qk_norm_inplace_kernel, rope_precomputed_kernel, kv_cache_write_kernel, atlas_gpu_argmax
  • New crate atlas-infer: exposes the StigmergicHook trait — a GraphPalace bridge that lets inference hooks read/write stigmergic memory in real time during token generation
  • Issue #18 CLOSED: GPU attention correctness verified against CPU reference (max diff < 1e-5)
Previous release — v4.0.9 (Think Suppression + Output Quality)
  • Think suppression: Logit masking prevents <think> tokens from appearing in final output — users see clean responses without internal reasoning artifacts
  • Anti-repetition defaults: SamplingConfig::olmo3() now sets rep_penalty=1.15, top_k=50, min_p=0.05 — prevents degenerate looping on OLMo-3-7B-Think
  • Think budget: Configurable max tokens for <think> block (default: 200 tokens) — prevents runaway chain-of-thought consuming full context
  • Output quality: Filler token cleanup, model auto-detection improvements

Model Details

Property Value
Base model allenai/Olmo-3-7B-Think
Architecture Olmo3ForCausalLM — post-norm + QK-norm, SWA + YaRN RoPE
Parameters 7.3B
Precision BF16 (bfloat16)
Context 65,536 tokens (YaRN factor=8, original max=8,192)
Vocab 100,278 tokens
License Apache 2.0
Inference engine ATLAS v4.1.0 — pure Rust, zero external crate dependencies
ATLAS tests 600/600 passing

Note on model weights: The weights in this repository are the unmodified allenai/Olmo-3-7B-Think base weights. ATLAS is the inference + memory palace engine — not a fine-tune. Use this repo to benefit from the ATLAS ecosystem (stigmergic memory, ASTRA discovery, ZK proofs).

ATLAS Inference Engine

This model is verified to run correctly with ATLAS, a pure-Rust LLM inference framework with zero external crate dependencies. ATLAS implements the full OLMo-2/3 architecture from scratch:

  • Post-norm layer orderingx = residual + rmsnorm(output) matching the HuggingFace Olmo2DecoderLayer reference
  • QK-norm — per-head RMSNorm on Q and K before RoPE (32 heads × 128 head_dim)
  • Sliding Window Attention — 24/32 layers with window=4,096
  • YaRN RoPE — factor=8, original_max_seq_len=8,192, attn_factor=1.2079
  • BF16 W16A32 — weights in BF16 (14 GB VRAM), activations in f32
  • ChatML<|im_start|>/<|im_end|> chat template with auto-detection
  • Full sampling pipeline — repetition penalty, temperature, top-p, top-k, min-p, frequency/presence penalty
  • Think suppression (v4.0.9) — logit masking for clean API output
  • Full GPU attention path (v4.1.0) — five custom CUDA kernels, zero PCIe transfers during decode
  • StigmergicHook trait (v4.1.0, atlas-infer crate) — GraphPalace bridge for live stigmergic memory reads/writes during token generation

Performance (A100-SXM4-40GB)

Metric Value
Throughput 61.7 tok/s (BF16 GPU, decode)
VRAM ~14 GB
CPU/GPU logit agreement max diff 0.000015
Model load time ~108s (3 shards, 14 GB)

Quick Start with ATLAS

# Build ATLAS from source
git clone https://github.com/web3guru888/ATLAS.git
cd ATLAS
cargo build --release -p atlas-cli

# Download model weights
# (or use huggingface-cli: hf download openhubresearch/ATLAS-OLMo-3-7B-Think-v4)

# Start OpenAI-compatible API server
./target/release/atlas api serve \
  --weights /path/to/ATLAS-OLMo-3-7B-Think-v4 \
  --model olmo3-7b \
  --port 8080

# Query the API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "olmo3-7b",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 100,
    "temperature": 0.6,
    "repetition_penalty": 1.15,
    "top_p": 0.95,
    "top_k": 50,
    "min_p": 0.05
  }'

Quick Start with HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "openhubresearch/ATLAS-OLMo-3-7B-Think-v4",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "openhubresearch/ATLAS-OLMo-3-7B-Think-v4"
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
    return_tensors="pt", return_dict=True
).to(model.device)

output = model.generate(
    **inputs, max_new_tokens=200,
    temperature=0.6, top_p=0.95, do_sample=True,
)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

About ATLAS

ATLAS (Active-inference Training with Learned Adaptive Stigmergy) is a next-generation LLM framework built in pure Rust with zero external crate dependencies — the SQLite principle applied to AI infrastructure. It fuses:

  • GraphPalace — Stigmergic memory palace with pheromone-guided navigation
  • ASTRA — Live discovery engine hitting NASA, WHO, World Bank APIs
  • TRM-CausalValidator — 7M-param recursive validator
  • Champagnat n-Morphic Framework — biologically-grounded training dynamics

22 crates. 600 tests. One coherent system. Zero external Rust dependencies.

Website: atlasagi.org · Observatory: Interactive Demo · Organization: OpenHub Research · Author: Robin Dey

Citation

@software{atlas2026,
  title       = {ATLAS: Active-inference Training with Learned Adaptive Stigmergy},
  author      = {Robin Dey},
  year        = {2026},
  institution = {OpenHub Research, Thailand},
  url         = {https://github.com/web3guru888/ATLAS},
  note        = {Pure Rust LLM framework. v4.1.0: 22 crates, 600 tests,
                 OLMo-3-7B-Think 61.7 tok/s on A100 (BF16, full GPU attention).
                 StigmergicHook GraphPalace bridge. Think suppression + anti-repetition defaults.}
}
Downloads last month
1,062
Safetensors
Model size
528k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openhubresearch/ATLAS-OLMo-3-7B-Think-v4

Finetuned
(7)
this model
Quantizations
1 model