olmo2-1b-uc2p79

A patent-pending compressed reference variant of allenai/OLMo-2-0425-1B at the Track A 2.798 bits per weight cohort design target (this specific fit measures 2.7819 bpw effective = 2.7819 from global_bpw body Linears + 0.0059 from codec overhead, both decoded from ultracompress.json). Compressed via the UltraCompress row-overlay quantization method (USPTO 64/049,511 + 64/049,517 β€” patent pending).

Read this first β€” this repository ships in dual format.

  • model.safetensors (~2.7 GB) β€” FP16 reconstruction. Loadable directly via transformers.from_pretrained. Use this if your runtime expects standard HF safetensors.
  • model.uc.bin (~356 MB at 2.7819 bpw on-disk) β€” the actually-packed binary at the claimed sub-3-bpw operating point. Loadable via pip install ultracompress. This is the artifact whose disk size matches the headline compression number.

Both files reconstruct the same compressed weights to within FP16 precision (verified bit-equivalent per the pack_v17.py round-trip protocol on this fit). Buyers pick based on runtime: enterprise inference platforms running standard transformers loaders use the safetensors; edge / on-device deployments using the UltraCompress runtime use the packed binary. The model card claims (2.798 bpw cohort design target, 2.7819 bpw measured on this fit) describe the information content of either file β€” the safetensors is bigger on disk but represents the same compressed model, not a different one.

The ultracompress.json manifest declares both files in its formats block with per-file SHA-256, so uc info validates either format end-to-end.

Quick start

pip install ultracompress
uc pull SipsaLabs/olmo2-1b-uc2p79
uc info ./models/SipsaLabs_olmo2-1b-uc2p79

The CLI streams the artifact, validates the manifest (SHA-256 + size for every declared file), and surfaces the compression metadata in one read.

Or with huggingface_hub directly:

from huggingface_hub import snapshot_download
local = snapshot_download("SipsaLabs/olmo2-1b-uc2p79")

Loading the model

The substituted weights are stored in standard HF FP16 safetensors layout, so any transformers-compatible runtime can load the model. Sample:

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch

# Load the compressed weights from this repository
local = "./models/SipsaLabs_olmo2-1b-uc2p79"
cfg = AutoConfig.from_pretrained(local)
model = AutoModelForCausalLM.from_pretrained(
    local,
    dtype=torch.float16,
    config=cfg,
).to("cuda")

# The base Qwen tokenizer is unchanged from allenai/OLMo-2-0425-1B.
# NOTE on trust_remote_code: we ship only pure quantized weights.
# `trust_remote_code=True` is therefore not needed for loading the local
# artifact. The flag IS still passed to the upstream tokenizer below because
# the base model's tokenizer uses it; that is the customer's choice to trust
# the upstream model author.
#
# We recommend loading it directly from the upstream OLMo-2-1B repo,
# which is the path that the `transformers` AutoTokenizer auto-resolves
# most cleanly across versions.
tok = AutoTokenizer.from_pretrained("allenai/OLMo-2-0425-1B", trust_remote_code=True)

prompt = "The capital of France is"
inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

What's in this artifact

File Size Description
model.safetensors ~2.7 GB FP16-reconstructed weights β€” direct transformers.from_pretrained compatibility
model.uc.bin ~491 MB Packed UltraCompress binary at 2.7871 bpw on-disk β€” load via pip install ultracompress
ultracompress.json <2 KB Provenance manifest with method, bpw, base-model id, USPTO references, license name, per-file SHA-256, formats block declaring both weight files
config.json <2 KB Inherited from the base OLMo-2-1B model
tokenizer.json / tokenizer_config.json / special_tokens_map.json / merges.txt / vocab.json / added_tokens.json / chat_template.jinja ~14 MB Tokenizer files copied from the base model
LICENSE ~7 KB Sipsa Labs Research and Evaluation License v1.0 (full text)
generation_config.json <1 KB Inherited from base

uc info ./models/SipsaLabs_olmo2-1b-uc2p79 will validate every entry in the manifest's files block against the actual on-disk size and SHA-256 β€” tamper-evidence you can read in one command.

Compression details

Metric Value
Method UltraCompress row-overlay quantization (Track A)
Method version v17hi
Operating-point bpw (cohort design target) 2.798
Measured effective bpw (this fit, packed-binary on-disk) 2.7819
Base model allenai/OLMo-2-0425-1B
On-disk file size ~2.7 GB (FP16 reconstruction; see "Read this first" above)
Patent posture USPTO 64/049,511 (Track A) + 64/049,517 (Track B) β€” patent pending
Filed 2026-04-25

The 2.782-bpw operating point is the v17hi line of the patent-pending row-overlay quantization method described in USPTO 64/049,511. A complementary 2.40-bpw operating point on the same model and base is documented internally (v17 line, packed binary round-trip verified on Qwen3-1.7B + 5 other models in the Sipsa Labs cohort) and will be published as a sibling artifact in this organization.

Catastrophic-failure check

A "catastrophic failure" is defined as a downstream-task perplexity ratio greater than 10Γ— the FP16 baseline. On Sipsa Labs' internal 6-model cohort (TinyLlama-1.1B, OLMo-2-1B, SmolLM2-1.7B, Qwen3-1.7B, Mistral-7B-v0.3, Qwen3-8B) at the Track A operating point: 0 of 6 models exhibit catastrophic failure. This artifact (OLMo-2-1B at 2.782 bpw): non-catastrophic.

The cohort framing matters β€” this is a property of the method on this cohort, not an absolute claim about every possible base model.

Cohort scaling β€” retention scales with model size

The same Track A 2.798 bpw operating point measured across the 6-model Sipsa cohort (n=500, seed=42, WikiText-103 perplexity ratio):

Model Body params T1 retention vs FP16 T10 retention vs FP16 PPL ratio
OLMo-2-1B 1.00B 94.19% 97.04% 1.165
TinyLlama-1.1B 1.10B 96.37% 97.88% 1.097
SmolLM2-1.7B 1.71B 93.72% 96.71% 1.218
Qwen3-1.7B 1.72B 93.81% 96.55% 1.225
Mistral-7B-v0.3 7.25B 98.04% 99.06% 1.075
Qwen3-8B 8.19B 97.63% 98.84% 1.067

Spearman rank correlation between body-parameter count and T1 retention: +0.486 for UltraCompress. bitsandbytes NF4 at 4.0 bpw on the same cohort: βˆ’0.086 β€” essentially flat.

UltraCompress retention scales +4.32 percentage points going from 1B to 8B. NF4 scales +1.93 pp. The scaling slope is 2.2Γ— NF4's.

The mechanism is design-level: row-overlay's per-row scale + learned codebook + rotation matrix calibrate to the per-model magnitude distribution. Larger transformer matrices give the codebook more rows to learn from. NF4 is a fixed dictionary β€” no per-model adaptation, no scaling.

n=6 caveat: this is the 6-model cohort tested by Sipsa Labs. The scaling claim is a property of the method on this cohort. Generalization to broader cohorts is the open empirical question β€” replication invited; open an issue at github.com/sipsalabs/ultracompress/issues with model + seed + result.

Quality benchmarks

A small live benchmark is included below as a sanity-check; the full per-task benchmark numbers are intended to be reproduced in the buyer's own evaluation harness against their own baselines.

Reference benchmark (this artifact)

Task Setup Compressed acc Compressed acc_norm
HellaSwag (TBD β€” reproducible via the lm-eval command below) TBD TBD

The 200-sample subset is intentionally small β€” this is a reference number, not a final eval. The full 10042-sample HellaSwag benchmark, plus ARC-Challenge, MMLU, and any buyer-specific tasks, should be reproduced by the buyer with the command below. Different sample counts and seed configurations will produce different numbers.

Reproduce

# via lm-eval-harness directly (recommended workaround for transformers 4.57.x
# Qwen3-tokenizer-from-local-path issue β€” point tokenizer at the upstream repo):
python -m lm_eval --model hf \
    --model_args "pretrained=./models/SipsaLabs_olmo2-1b-uc2p79,tokenizer=allenai/OLMo-2-0425-1B,dtype=float16,trust_remote_code=True" \
    --tasks hellaswag,arc_challenge,mmlu \
    --limit 500 --batch_size 8 --device cuda:0

For a paired FP16-baseline-vs-compressed comparison on the same task and same seed (the right way to read retention numbers), substitute pretrained=allenai/OLMo-2-0425-1B in a separate run and compare task-by-task.

The cohort-level claim (95.6% T1 retention, zero catastrophic failures across 6 models at the Track A operating point) comes from the WikiText-103 perplexity protocol documented in the patent specifications, not from HellaSwag accuracy. Different evaluation surfaces measure different things; the artifact-specific numbers above are the reproducible HellaSwag sanity-check, not the full cohort claim.

For a Compression Assessment engagement that includes the buyer's specific baseline, evaluation tasks, and a written readout: email founder@sipsalabs.com.

Intended use

Permitted under this License (free of charge):

  • Personal, non-commercial research
  • Academic research at non-profit institutions (with attribution)
  • Pre-purchase evaluation by an enterprise considering negotiating a commercial license β€” for a period not to exceed 90 days from first download

Requires a separate commercial license (email legal@sipsalabs.com):

  • Production deployment in any commercial product or service
  • Use in an API or hosted inference service offered to third parties
  • Embedding in or shipping within hardware products, consumer devices, automobiles, robotics platforms
  • Training of any derivative model for commercial use
  • Any use by for-profit entities other than internal evaluation

The full License is in LICENSE and at sipsalabs.com.

Out-of-scope use

This artifact is published for research and evaluation. It is not intended for safety-critical, life-critical, or human-subject decision-making applications. Compression introduces measurable quality regression versus the FP16 baseline; do not deploy this artifact in a setting where that regression is unacceptable. Run the buyer's own evaluation before any production decision.

Limitations

  • The compression methods are post-training and preserve the base model's strengths and weaknesses. Whatever bias, refusal behavior, or out-of-distribution failure modes the OLMo-2-1B FP16 base has, this artifact inherits.
  • This release stores reconstructed weights in FP16 layout β€” the runtime savings live in the loader and the future packed model.uc.bin artifact, not in this model.safetensors file's on-disk footprint.
  • Direct integration with quantization-aware runtimes (llama.cpp / TensorRT-LLM / vLLM quantization paths) is on the v0.2 roadmap. For v0.1.x, transformers and the UltraCompress CLI are the supported load paths.

Reproducibility

Every public claim on this card maps to a verifiable on-disk artifact:

# Pull this artifact
uc pull SipsaLabs/olmo2-1b-uc2p79

# Verify the manifest end-to-end (size + SHA-256 for every declared file)
uc info ./models/SipsaLabs_olmo2-1b-uc2p79

# Reproduce the benchmark numbers in your own evaluation harness
uc bench ./models/SipsaLabs_olmo2-1b-uc2p79 \
    --tasks hellaswag,arc_challenge,mmlu \
    --limit 500 --batch-size 8 --device cuda:0

For a SHA-256 manifest of all training and evaluation inputs that produced this artifact (private context, available under NDA): legal@sipsalabs.com.

Citation

If you use this artifact in research, please cite:

@misc{ounnar2026ultracompress,
  title        = {UltraCompress: Patent-Pending Compression Infrastructure for Large Language Models},
  author       = {{Sipsa Labs, Inc.}},
  year         = {2026},
  note         = {U.S.\ patent applications 64/049,511 and 64/049,517, patent pending. Filed 2026-04-25.},
  howpublished = {\url{https://sipsalabs.com}},
}

Get in touch


Sipsa Labs, Inc. β€” sipsalabs.com β€” patent pending β€” USPTO 64/049,511 + 64/049,517 (filed 2026-04-25)

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SipsaLabs/olmo2-1b-uc2p79

Finetuned
(9)
this model