How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MicheRomChis/orchid-1.0",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Orchid 1.0

First competitive LLM trained and aligned in Colombia โ€” a 2B ternary-weight language model fine-tuned from Microsoft BitNet b1.58-2B-4T on a single RTX 3050 laptop (4 GB VRAM). Orchid is multilingual (inherits BitNet's broad language coverage; alignment fine-tuning focused on English and Spanish), aligned for unbiased responses using ORPO, and designed to run on consumer hardware without cloud dependency.

Open In Colab โ† Try it free on GPU (~3 min setup)

Inference note: Orchid uses the BitNet I2_S (ternary) format with a separate LoRA adapter. Standard llama.cpp cannot serve this combination correctly. Use ternative.cpp โ€” the custom C++ inference engine built for this model.


Model Files

File Size Purpose
ggml-model-i2_s.gguf ~1.1 GB BitNet b1.58-2B-4T base (I2_S ternary format)
dpo_aligned-lora.gguf ~90 MB ORPO-3 aligned LoRA adapter (F32, 420 tensors)

Download both files to run Orchid. The base GGUF contains the ternary weights; the adapter applies the alignment fine-tuning at runtime without re-quantizing.


Quick Start

1. Download

huggingface-cli download MicheRomChis/orchid-1.0 \
  ggml-model-i2_s.gguf dpo_aligned-lora.gguf \
  --local-dir ./orchid-models

2. Build ternative

# Linux / macOS โ€” requires cmake 3.18+ and a C++17 compiler
git clone --depth 1 https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build --parallel
cd ..

# Windows (PowerShell) โ€” requires MSVC 2022 or MinGW + cmake 3.18+
git clone --depth 1 https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release; cmake --build build --parallel
cd ..

GPU build (NVIDIA CUDA 12.x): add -DTERNATIVE_CUDA=ON to the cmake command.

3. Generate text

# Linux / macOS
./ternative/build/ternative \
  --model ./orchid-models/ggml-model-i2_s.gguf \
  --lora  ./orchid-models/dpo_aligned-lora.gguf \
  --prompt "ยฟCuรกl es la capital de Colombia?" \
  --max-tokens 200

# Windows (PowerShell)
.\ternative\build\Release\ternative.exe `
  --model .\orchid-models\ggml-model-i2_s.gguf `
  --lora  .\orchid-models\dpo_aligned-lora.gguf `
  --prompt "What is photosynthesis? Think step by step." `
  --max-tokens 300

4. Run as OpenAI-compatible server

# Linux / macOS
./ternative/build/ternative \
  --model ./orchid-models/ggml-model-i2_s.gguf \
  --lora  ./orchid-models/dpo_aligned-lora.gguf \
  --server --port 8080

# Windows (PowerShell)
.\ternative\build\Release\ternative.exe `
  --model .\orchid-models\ggml-model-i2_s.gguf `
  --lora  .\orchid-models\dpo_aligned-lora.gguf `
  --server --port 8080

Then use any OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
    model="orchid",
    messages=[{"role": "user", "content": "Explain quantum entanglement simply."}]
)
print(response.choices[0].message.content)

Why ternative.cpp?

Standard inference stacks cannot serve LoRA-fine-tuned ternary models correctly:

Engine I2_S base Runtime LoRA I2_S + LoRA
llama.cpp โš ๏ธ type-36 error โœ“ (Q4/Q8 only) โœ—
bitnet.cpp โœ“ โœ— no adapter path โœ—
ternative.cpp โœ“ โœ“ full precision โœ“

The problem: merging a LoRA adapter into an I2_S base and re-quantizing rounds every delta to zero โ€” the fine-tuning is silently discarded. ternative.cpp avoids this by de-quantizing the I2_S base to F32, applying the LoRA delta at full precision, and casting to F16 for inference.


Benchmark Results

Standard Benchmarks (lm-eval-harness methodology, 50 samples each)

Scored via log-probability on live ternative.cpp server. Methodology matches lm-evaluation-harness exactly.

Benchmark Orchid 1.0 BitNet b1.58-2B (base) Delta
ARC-Challenge 56.0% 49.9% +6.1 pp
HellaSwag (length-norm) 52.0% 68.4% โˆ’16.4 pp
WinoGrande 74.0% โ€” โ€”
MMLU (57 subjects) 38.6% 53.2% โˆ’14.6 pp

The ARC-Challenge gain (+6.1 pp) confirms the reasoning fine-tuning transferred. HellaSwag and MMLU regressions are the expected ORPO alignment tax โ€” the model trades some factual-recall breadth for reasoning quality and bias mitigation, consistent with published DPO/ORPO literature.

WinoGrande at 74.0% is strong for 2B parameters โ€” comparable to the published score of Llama 3.2 3B (~74%).

Internal Benchmark v2 (semantic scoring, 100 questions, 8 categories)

Rank Model Score
1 Claude 3.5 Sonnet 89.5%
2 GPT-4o 89.2%
3 Orchid 1.0 87.9%
4 BitNet b1.58-2B base 84.2%
5 Kimi k1.5 82.2%
6 Qwen2.5-7B 78.4%

Orchid ranks #3 of 11 models on our internal benchmark, above all tested open-weight models including 7Bโ€“9B parameter models. Science: 100%, Math: 93.3%, Coding: 93.3%.

Note: the internal benchmark uses semantic similarity scoring and is a relative comparison tool, not a substitute for standard NLP benchmarks.


Training Details

All training was performed on a single NVIDIA RTX 3050 laptop GPU (4 GB VRAM, 16 GB RAM, Windows 11) โ€” no cloud compute.

Stage Method Data Duration
SFT-A LoRA r=16 Reasoning / chain-of-thought (50 samples, validation run) ~1 h
SFT-B LoRA r=16 5,500 samples (5k identity + 500 knowledge) ~88 h wall-clock
ORPO-2 LoRA r=8 2,038 preference pairs (debiasing + UltraFeedback) ~26 h
ORPO-3 LoRA r=8 2,104 preference pairs (Colombia identity focus) ~54 h

UltraFeedback note: The ORPO-2 stage includes a subset of preference pairs drawn from the UltraFeedback dataset (Cui et al., 2023, MIT License). We use the published preference pairs as a downstream consumer of the released dataset โ€” we did not commission GPT-4 annotations. Full attribution and licensing responsibility rests with the original dataset authors.

Memory techniques that made 4 GB training possible:

  • Pre-tokenize dataset before loading model (prevents startup OOM)
  • device_map="auto" โ€” GPU + CPU split via Accelerate
  • Gradient checkpointing + bf16=True
  • ORPO with ref_model=None โ€” saves ~1.2 GB vs DPO

Training scripts: github.com/MichelangeloRomeroChisco/orchid


Hardware Requirements

Minimum Recommended
GPU VRAM 0 (CPU-only works) 4 GB (RTX 3050 class)
RAM 8 GB 16 GB
Storage 1.3 GB 2 GB
OS Windows / Linux / macOS โ€”

GPU mode: all 30 transformer layers offload to GPU using mixed F16 + INT8 quantization (~3.3 GB VRAM). CPU mode: ~6 tok/s with AVX2.


Limitations

  • MMLU at 38.6% โ€” alignment tax from ORPO. Expected and documented in the technical paper.
  • Spanish coverage โ€” 80% on internal benchmark. Functional but not state-of-the-art.
  • Context window โ€” 4,096 tokens (inherited from BitNet base).
  • ternative.cpp required โ€” llama.cpp produces type-36 errors or silently wrong output.
  • Do not use BitsAndBytes โ€” stacking BNB quantization on top of BitNet's runtime ternary quantization is unsupported.
  • Identity requires system prompt โ€” without a system prompt Orchid may respond generically; ORPO baked the identity partially but not completely.

Technical Paper

Full methodology, training details, failure modes, and architecture analysis:

Orchid 1.0: A Reproducible Recipe for Aligned Ternary-Weight Language Models on Consumer Hardware


License

Apache 2.0 โ€” free for research and commercial use.

This model is a fine-tuned derivative of Microsoft BitNet b1.58-2B-4T (MIT License).

Copyright (c) Microsoft Corporation. BitNet b1.58-2B-4T is released under the MIT License. The MIT License requires this copyright notice to accompany any distribution of derivative works. Full license text: https://opensource.org/licenses/MIT


Citation

@software{orchid_2026,
  title   = {Orchid 1.0: First Competitive LLM Trained and Aligned in Colombia โ€” Ternary-Weight Fine-Tuning on Consumer Hardware},
  author  = {Romero Chisco, Michelangelo},
  year    = {2026},
  url     = {https://huggingface.co/MicheRomChis/orchid-1.0},
  license = {Apache-2.0},
  note    = {Fine-tuned from Microsoft BitNet b1.58-2B-4T}
}

Acknowledgments

  • Microsoft Research โ€” BitNet b1.58-2B-4T base model and architecture
  • The ggml / llama.cpp project โ€” GGUF format conventions
  • HuggingFace โ€” Training libraries (PEFT, TRL, Transformers, Accelerate)
Downloads last month
2,080
GGUF
Model size
21.6M params
Architecture
bitnet-b1.58
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MicheRomChis/orchid-1.0

Adapter
(7)
this model

Evaluation results