Instructions to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openhubresearch/ATLAS-OLMo-3-7B-Think-v4")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("openhubresearch/ATLAS-OLMo-3-7B-Think-v4") model = AutoModelForCausalLM.from_pretrained("openhubresearch/ATLAS-OLMo-3-7B-Think-v4") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openhubresearch/ATLAS-OLMo-3-7B-Think-v4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openhubresearch/ATLAS-OLMo-3-7B-Think-v4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/openhubresearch/ATLAS-OLMo-3-7B-Think-v4
- SGLang
How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openhubresearch/ATLAS-OLMo-3-7B-Think-v4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openhubresearch/ATLAS-OLMo-3-7B-Think-v4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openhubresearch/ATLAS-OLMo-3-7B-Think-v4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openhubresearch/ATLAS-OLMo-3-7B-Think-v4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with Docker Model Runner:
docker model run hf.co/openhubresearch/ATLAS-OLMo-3-7B-Think-v4
ATLAS-OLMo-3-7B-Think-v4
OLMo-3-7B-Think served by the ATLAS pure-Rust inference engine (v4.1.0) — zero external crate dependencies, BF16 GPU inference at 61.7 tok/s on A100-SXM4-40GB.
Latest Release — v4.1.0
v4.1.0 — Full GPU Attention + StigmergicHook · 600 tests
- 4× throughput: 61.7 tok/s BF16 on A100 (up from 15.4 tok/s) — zero intra-layer PCIe transfers during decode
- Full GPU attention path: custom CUDA kernels handle the entire decode step on-device
- New CUDA kernels:
decode_attention_kernel,qk_norm_inplace_kernel,rope_precomputed_kernel,kv_cache_write_kernel,atlas_gpu_argmax - New crate
atlas-infer: exposes theStigmergicHooktrait — a GraphPalace bridge that lets inference hooks read/write stigmergic memory in real time during token generation - Issue #18 CLOSED: GPU attention correctness verified against CPU reference (max diff < 1e-5)
Previous release — v4.0.9 (Think Suppression + Output Quality)
- Think suppression: Logit masking prevents
<think>tokens from appearing in final output — users see clean responses without internal reasoning artifacts - Anti-repetition defaults:
SamplingConfig::olmo3()now setsrep_penalty=1.15,top_k=50,min_p=0.05— prevents degenerate looping on OLMo-3-7B-Think - Think budget: Configurable max tokens for
<think>block (default: 200 tokens) — prevents runaway chain-of-thought consuming full context - Output quality: Filler token cleanup, model auto-detection improvements
Model Details
| Property | Value |
|---|---|
| Base model | allenai/Olmo-3-7B-Think |
| Architecture | Olmo3ForCausalLM — post-norm + QK-norm, SWA + YaRN RoPE |
| Parameters | 7.3B |
| Precision | BF16 (bfloat16) |
| Context | 65,536 tokens (YaRN factor=8, original max=8,192) |
| Vocab | 100,278 tokens |
| License | Apache 2.0 |
| Inference engine | ATLAS v4.1.0 — pure Rust, zero external crate dependencies |
| ATLAS tests | 600/600 passing |
Note on model weights: The weights in this repository are the unmodified allenai/Olmo-3-7B-Think base weights. ATLAS is the inference + memory palace engine — not a fine-tune. Use this repo to benefit from the ATLAS ecosystem (stigmergic memory, ASTRA discovery, ZK proofs).
ATLAS Inference Engine
This model is verified to run correctly with ATLAS, a pure-Rust LLM inference framework with zero external crate dependencies. ATLAS implements the full OLMo-2/3 architecture from scratch:
- Post-norm layer ordering —
x = residual + rmsnorm(output)matching the HuggingFace Olmo2DecoderLayer reference - QK-norm — per-head RMSNorm on Q and K before RoPE (32 heads × 128 head_dim)
- Sliding Window Attention — 24/32 layers with window=4,096
- YaRN RoPE — factor=8, original_max_seq_len=8,192, attn_factor=1.2079
- BF16 W16A32 — weights in BF16 (14 GB VRAM), activations in f32
- ChatML —
<|im_start|>/<|im_end|>chat template with auto-detection - Full sampling pipeline — repetition penalty, temperature, top-p, top-k, min-p, frequency/presence penalty
- Think suppression (v4.0.9) — logit masking for clean API output
- Full GPU attention path (v4.1.0) — five custom CUDA kernels, zero PCIe transfers during decode
StigmergicHooktrait (v4.1.0,atlas-infercrate) — GraphPalace bridge for live stigmergic memory reads/writes during token generation
Performance (A100-SXM4-40GB)
| Metric | Value |
|---|---|
| Throughput | 61.7 tok/s (BF16 GPU, decode) |
| VRAM | ~14 GB |
| CPU/GPU logit agreement | max diff 0.000015 |
| Model load time | ~108s (3 shards, 14 GB) |
Quick Start with ATLAS
# Build ATLAS from source
git clone https://github.com/web3guru888/ATLAS.git
cd ATLAS
cargo build --release -p atlas-cli
# Download model weights
# (or use huggingface-cli: hf download openhubresearch/ATLAS-OLMo-3-7B-Think-v4)
# Start OpenAI-compatible API server
./target/release/atlas api serve \
--weights /path/to/ATLAS-OLMo-3-7B-Think-v4 \
--model olmo3-7b \
--port 8080
# Query the API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "olmo3-7b",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 100,
"temperature": 0.6,
"repetition_penalty": 1.15,
"top_p": 0.95,
"top_k": 50,
"min_p": 0.05
}'
Quick Start with HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"openhubresearch/ATLAS-OLMo-3-7B-Think-v4",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"openhubresearch/ATLAS-OLMo-3-7B-Think-v4"
)
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True,
return_tensors="pt", return_dict=True
).to(model.device)
output = model.generate(
**inputs, max_new_tokens=200,
temperature=0.6, top_p=0.95, do_sample=True,
)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
About ATLAS
ATLAS (Active-inference Training with Learned Adaptive Stigmergy) is a next-generation LLM framework built in pure Rust with zero external crate dependencies — the SQLite principle applied to AI infrastructure. It fuses:
- GraphPalace — Stigmergic memory palace with pheromone-guided navigation
- ASTRA — Live discovery engine hitting NASA, WHO, World Bank APIs
- TRM-CausalValidator — 7M-param recursive validator
- Champagnat n-Morphic Framework — biologically-grounded training dynamics
22 crates. 600 tests. One coherent system. Zero external Rust dependencies.
Website: atlasagi.org · Observatory: Interactive Demo · Organization: OpenHub Research · Author: Robin Dey
Citation
@software{atlas2026,
title = {ATLAS: Active-inference Training with Learned Adaptive Stigmergy},
author = {Robin Dey},
year = {2026},
institution = {OpenHub Research, Thailand},
url = {https://github.com/web3guru888/ATLAS},
note = {Pure Rust LLM framework. v4.1.0: 22 crates, 600 tests,
OLMo-3-7B-Think 61.7 tok/s on A100 (BF16, full GPU attention).
StigmergicHook GraphPalace bridge. Think suppression + anti-repetition defaults.}
}
- Downloads last month
- 1,062
Model tree for openhubresearch/ATLAS-OLMo-3-7B-Think-v4
Base model
allenai/Olmo-3-1025-7B