Instructions to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openhubresearch/ATLAS-OLMo-3-7B-Think-v4")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openhubresearch/ATLAS-OLMo-3-7B-Think-v4")
model = AutoModelForCausalLM.from_pretrained("openhubresearch/ATLAS-OLMo-3-7B-Think-v4")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openhubresearch/ATLAS-OLMo-3-7B-Think-v4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openhubresearch/ATLAS-OLMo-3-7B-Think-v4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/openhubresearch/ATLAS-OLMo-3-7B-Think-v4

SGLang

How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openhubresearch/ATLAS-OLMo-3-7B-Think-v4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openhubresearch/ATLAS-OLMo-3-7B-Think-v4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openhubresearch/ATLAS-OLMo-3-7B-Think-v4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openhubresearch/ATLAS-OLMo-3-7B-Think-v4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use openhubresearch/ATLAS-OLMo-3-7B-Think-v4 with Docker Model Runner:
```
docker model run hf.co/openhubresearch/ATLAS-OLMo-3-7B-Think-v4
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ATLAS-OLMo-3-7B-Think-v4

OLMo-3-7B-Think served by the ATLAS pure-Rust inference engine (v4.1.0) — zero external crate dependencies, BF16 GPU inference at 61.7 tok/s on A100-SXM4-40GB.

Latest Release — v4.1.0

v4.1.0 — Full GPU Attention + StigmergicHook · 600 tests

4× throughput: 61.7 tok/s BF16 on A100 (up from 15.4 tok/s) — zero intra-layer PCIe transfers during decode
Full GPU attention path: custom CUDA kernels handle the entire decode step on-device
New CUDA kernels: decode_attention_kernel, qk_norm_inplace_kernel, rope_precomputed_kernel, kv_cache_write_kernel, atlas_gpu_argmax
New crate atlas-infer: exposes the StigmergicHook trait — a GraphPalace bridge that lets inference hooks read/write stigmergic memory in real time during token generation
Issue #18 CLOSED: GPU attention correctness verified against CPU reference (max diff < 1e-5)

Previous release — v4.0.9 (Think Suppression + Output Quality)

Think suppression: Logit masking prevents <think> tokens from appearing in final output — users see clean responses without internal reasoning artifacts
Anti-repetition defaults: SamplingConfig::olmo3() now sets rep_penalty=1.15, top_k=50, min_p=0.05 — prevents degenerate looping on OLMo-3-7B-Think
Think budget: Configurable max tokens for <think> block (default: 200 tokens) — prevents runaway chain-of-thought consuming full context
Output quality: Filler token cleanup, model auto-detection improvements

Model Details

Property	Value
Base model	allenai/Olmo-3-7B-Think
Architecture	Olmo3ForCausalLM — post-norm + QK-norm, SWA + YaRN RoPE
Parameters	7.3B
Precision	BF16 (bfloat16)
Context	65,536 tokens (YaRN factor=8, original max=8,192)
Vocab	100,278 tokens
License	Apache 2.0
Inference engine	ATLAS v4.1.0 — pure Rust, zero external crate dependencies
ATLAS tests	600/600 passing

Note on model weights: The weights in this repository are the unmodified allenai/Olmo-3-7B-Think base weights. ATLAS is the inference + memory palace engine — not a fine-tune. Use this repo to benefit from the ATLAS ecosystem (stigmergic memory, ASTRA discovery, ZK proofs).

ATLAS Inference Engine

This model is verified to run correctly with ATLAS, a pure-Rust LLM inference framework with zero external crate dependencies. ATLAS implements the full OLMo-2/3 architecture from scratch:

Post-norm layer ordering — x = residual + rmsnorm(output) matching the HuggingFace Olmo2DecoderLayer reference
QK-norm — per-head RMSNorm on Q and K before RoPE (32 heads × 128 head_dim)
Sliding Window Attention — 24/32 layers with window=4,096
YaRN RoPE — factor=8, original_max_seq_len=8,192, attn_factor=1.2079
BF16 W16A32 — weights in BF16 (14 GB VRAM), activations in f32
ChatML — <|im_start|>/<|im_end|> chat template with auto-detection
Full sampling pipeline — repetition penalty, temperature, top-p, top-k, min-p, frequency/presence penalty
Think suppression (v4.0.9) — logit masking for clean API output
Full GPU attention path (v4.1.0) — five custom CUDA kernels, zero PCIe transfers during decode
StigmergicHook trait (v4.1.0, atlas-infer crate) — GraphPalace bridge for live stigmergic memory reads/writes during token generation

Performance (A100-SXM4-40GB)

Metric	Value
Throughput	61.7 tok/s (BF16 GPU, decode)
VRAM	~14 GB
CPU/GPU logit agreement	max diff 0.000015
Model load time	~108s (3 shards, 14 GB)

Quick Start with ATLAS

# Build ATLAS from source
git clone https://github.com/web3guru888/ATLAS.git
cd ATLAS
cargo build --release -p atlas-cli

# Download model weights
# (or use huggingface-cli: hf download openhubresearch/ATLAS-OLMo-3-7B-Think-v4)

# Start OpenAI-compatible API server
./target/release/atlas api serve \
  --weights /path/to/ATLAS-OLMo-3-7B-Think-v4 \
  --model olmo3-7b \
  --port 8080

# Query the API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "olmo3-7b",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 100,
    "temperature": 0.6,
    "repetition_penalty": 1.15,
    "top_p": 0.95,
    "top_k": 50,
    "min_p": 0.05
  }'

Quick Start with HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "openhubresearch/ATLAS-OLMo-3-7B-Think-v4",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "openhubresearch/ATLAS-OLMo-3-7B-Think-v4"
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
    return_tensors="pt", return_dict=True
).to(model.device)

output = model.generate(
    **inputs, max_new_tokens=200,
    temperature=0.6, top_p=0.95, do_sample=True,
)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

About ATLAS

ATLAS (Active-inference Training with Learned Adaptive Stigmergy) is a next-generation LLM framework built in pure Rust with zero external crate dependencies — the SQLite principle applied to AI infrastructure. It fuses:

GraphPalace — Stigmergic memory palace with pheromone-guided navigation
ASTRA — Live discovery engine hitting NASA, WHO, World Bank APIs
TRM-CausalValidator — 7M-param recursive validator
Champagnat n-Morphic Framework — biologically-grounded training dynamics

22 crates. 600 tests. One coherent system. Zero external Rust dependencies.

Website: atlasagi.org · Observatory: Interactive Demo · Organization: OpenHub Research · Author: Robin Dey

Citation

@software{atlas2026,
  title       = {ATLAS: Active-inference Training with Learned Adaptive Stigmergy},
  author      = {Robin Dey},
  year        = {2026},
  institution = {OpenHub Research, Thailand},
  url         = {https://github.com/web3guru888/ATLAS},
  note        = {Pure Rust LLM framework. v4.1.0: 22 crates, 600 tests,
                 OLMo-3-7B-Think 61.7 tok/s on A100 (BF16, full GPU attention).
                 StigmergicHook GraphPalace bridge. Think suppression + anti-repetition defaults.}
}