Instructions to use MicheRomChis/orchid-1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MicheRomChis/orchid-1.0 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MicheRomChis/orchid-1.0",
	filename="dpo_aligned-lora.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use MicheRomChis/orchid-1.0 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
llama-cli -hf MicheRomChis/orchid-1.0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
llama-cli -hf MicheRomChis/orchid-1.0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
./llama-cli -hf MicheRomChis/orchid-1.0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MicheRomChis/orchid-1.0

Use Docker

docker model run hf.co/MicheRomChis/orchid-1.0

LM Studio
Jan
Ollama
How to use MicheRomChis/orchid-1.0 with Ollama:
```
ollama run hf.co/MicheRomChis/orchid-1.0
```

Unsloth Studio new

How to use MicheRomChis/orchid-1.0 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MicheRomChis/orchid-1.0 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MicheRomChis/orchid-1.0 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MicheRomChis/orchid-1.0 to start chatting

Docker Model Runner
How to use MicheRomChis/orchid-1.0 with Docker Model Runner:
```
docker model run hf.co/MicheRomChis/orchid-1.0
```

Lemonade

How to use MicheRomChis/orchid-1.0 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MicheRomChis/orchid-1.0

Run and chat with the model

lemonade run user.orchid-1.0-{{QUANT_TAG}}

List all available models

lemonade list

Orchid 1.0

First competitive LLM trained and aligned in Colombia — a 2B ternary-weight language model fine-tuned from Microsoft BitNet b1.58-2B-4T on a single RTX 3050 laptop (4 GB VRAM). Orchid is multilingual (inherits BitNet's broad language coverage; alignment fine-tuning focused on English and Spanish), aligned for unbiased responses using ORPO, and designed to run on consumer hardware without cloud dependency.

← Try it free on GPU (~3 min setup)

Inference note: Orchid uses the BitNet I2_S (ternary) format with a separate LoRA adapter. Standard llama.cpp cannot serve this combination correctly. Use ternative.cpp — the custom C++ inference engine built for this model.

Model Files

File	Size	Purpose
`ggml-model-i2_s.gguf`	~1.1 GB	BitNet b1.58-2B-4T base (I2_S ternary format)
`dpo_aligned-lora.gguf`	~90 MB	ORPO-3 aligned LoRA adapter (F32, 420 tensors)

Download both files to run Orchid. The base GGUF contains the ternary weights; the adapter applies the alignment fine-tuning at runtime without re-quantizing.

Quick Start

1. Download

huggingface-cli download MicheRomChis/orchid-1.0 \
  ggml-model-i2_s.gguf dpo_aligned-lora.gguf \
  --local-dir ./orchid-models

2. Build ternative

# Linux / macOS — requires cmake 3.18+ and a C++17 compiler
git clone --depth 1 https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build --parallel
cd ..

# Windows (PowerShell) — requires MSVC 2022 or MinGW + cmake 3.18+
git clone --depth 1 https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release; cmake --build build --parallel
cd ..

GPU build (NVIDIA CUDA 12.x): add -DTERNATIVE_CUDA=ON to the cmake command.

3. Generate text

# Linux / macOS
./ternative/build/ternative \
  --model ./orchid-models/ggml-model-i2_s.gguf \
  --lora  ./orchid-models/dpo_aligned-lora.gguf \
  --prompt "¿Cuál es la capital de Colombia?" \
  --max-tokens 200

# Windows (PowerShell)
.\ternative\build\Release\ternative.exe `
  --model .\orchid-models\ggml-model-i2_s.gguf `
  --lora  .\orchid-models\dpo_aligned-lora.gguf `
  --prompt "What is photosynthesis? Think step by step." `
  --max-tokens 300

4. Run as OpenAI-compatible server

# Linux / macOS
./ternative/build/ternative \
  --model ./orchid-models/ggml-model-i2_s.gguf \
  --lora  ./orchid-models/dpo_aligned-lora.gguf \
  --server --port 8080

# Windows (PowerShell)
.\ternative\build\Release\ternative.exe `
  --model .\orchid-models\ggml-model-i2_s.gguf `
  --lora  .\orchid-models\dpo_aligned-lora.gguf `
  --server --port 8080

Then use any OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
    model="orchid",
    messages=[{"role": "user", "content": "Explain quantum entanglement simply."}]
)
print(response.choices[0].message.content)

Why ternative.cpp?

Standard inference stacks cannot serve LoRA-fine-tuned ternary models correctly:

Engine	I2_S base	Runtime LoRA	I2_S + LoRA
llama.cpp	⚠️ type-36 error	✓ (Q4/Q8 only)	✗
bitnet.cpp	✓	✗ no adapter path	✗
ternative.cpp	✓	✓ full precision	✓

The problem: merging a LoRA adapter into an I2_S base and re-quantizing rounds every delta to zero — the fine-tuning is silently discarded. ternative.cpp avoids this by de-quantizing the I2_S base to F32, applying the LoRA delta at full precision, and casting to F16 for inference.

Benchmark Results

Standard Benchmarks (lm-eval-harness methodology, 50 samples each)

Scored via log-probability on live ternative.cpp server. Methodology matches lm-evaluation-harness exactly.

Benchmark	Orchid 1.0	BitNet b1.58-2B (base)	Delta
ARC-Challenge	56.0%	49.9%	+6.1 pp
HellaSwag (length-norm)	52.0%	68.4%	−16.4 pp
WinoGrande	74.0%	—	—
MMLU (57 subjects)	38.6%	53.2%	−14.6 pp

The ARC-Challenge gain (+6.1 pp) confirms the reasoning fine-tuning transferred. HellaSwag and MMLU regressions are the expected ORPO alignment tax — the model trades some factual-recall breadth for reasoning quality and bias mitigation, consistent with published DPO/ORPO literature.

WinoGrande at 74.0% is strong for 2B parameters — comparable to the published score of Llama 3.2 3B (~74%).

Internal Benchmark v2 (semantic scoring, 100 questions, 8 categories)

Rank	Model	Score
1	Claude 3.5 Sonnet	89.5%
2	GPT-4o	89.2%
3	Orchid 1.0	87.9%
4	BitNet b1.58-2B base	84.2%
5	Kimi k1.5	82.2%
6	Qwen2.5-7B	78.4%

Orchid ranks #3 of 11 models on our internal benchmark, above all tested open-weight models including 7B–9B parameter models. Science: 100%, Math: 93.3%, Coding: 93.3%.

Note: the internal benchmark uses semantic similarity scoring and is a relative comparison tool, not a substitute for standard NLP benchmarks.

Training Details

All training was performed on a single NVIDIA RTX 3050 laptop GPU (4 GB VRAM, 16 GB RAM, Windows 11) — no cloud compute.

Stage	Method	Data	Duration
SFT-A	LoRA r=16	Reasoning / chain-of-thought (50 samples, validation run)	~1 h
SFT-B	LoRA r=16	5,500 samples (5k identity + 500 knowledge)	~88 h wall-clock
ORPO-2	LoRA r=8	2,038 preference pairs (debiasing + UltraFeedback)	~26 h
ORPO-3	LoRA r=8	2,104 preference pairs (Colombia identity focus)	~54 h

UltraFeedback note: The ORPO-2 stage includes a subset of preference pairs drawn from the UltraFeedback dataset (Cui et al., 2023, MIT License). We use the published preference pairs as a downstream consumer of the released dataset — we did not commission GPT-4 annotations. Full attribution and licensing responsibility rests with the original dataset authors.

Memory techniques that made 4 GB training possible:

Pre-tokenize dataset before loading model (prevents startup OOM)
device_map="auto" — GPU + CPU split via Accelerate
Gradient checkpointing + bf16=True
ORPO with ref_model=None — saves ~1.2 GB vs DPO

Training scripts: github.com/MichelangeloRomeroChisco/orchid

Hardware Requirements

	Minimum	Recommended
GPU VRAM	0 (CPU-only works)	4 GB (RTX 3050 class)
RAM	8 GB	16 GB
Storage	1.3 GB	2 GB
OS	Windows / Linux / macOS	—

GPU mode: all 30 transformer layers offload to GPU using mixed F16 + INT8 quantization (~3.3 GB VRAM). CPU mode: ~6 tok/s with AVX2.

Limitations

MMLU at 38.6% — alignment tax from ORPO. Expected and documented in the technical paper.
Spanish coverage — 80% on internal benchmark. Functional but not state-of-the-art.
Context window — 4,096 tokens (inherited from BitNet base).
ternative.cpp required — llama.cpp produces type-36 errors or silently wrong output.
Do not use BitsAndBytes — stacking BNB quantization on top of BitNet's runtime ternary quantization is unsupported.
Identity requires system prompt — without a system prompt Orchid may respond generically; ORPO baked the identity partially but not completely.

Technical Paper

Full methodology, training details, failure modes, and architecture analysis:

Orchid 1.0: A Reproducible Recipe for Aligned Ternary-Weight Language Models on Consumer Hardware

License

Apache 2.0 — free for research and commercial use.

This model is a fine-tuned derivative of Microsoft BitNet b1.58-2B-4T (MIT License).

Copyright (c) Microsoft Corporation. BitNet b1.58-2B-4T is released under the MIT License. The MIT License requires this copyright notice to accompany any distribution of derivative works. Full license text: https://opensource.org/licenses/MIT

Citation

@software{orchid_2026,
  title   = {Orchid 1.0: First Competitive LLM Trained and Aligned in Colombia — Ternary-Weight Fine-Tuning on Consumer Hardware},
  author  = {Romero Chisco, Michelangelo},
  year    = {2026},
  url     = {https://huggingface.co/MicheRomChis/orchid-1.0},
  license = {Apache-2.0},
  note    = {Fine-tuned from Microsoft BitNet b1.58-2B-4T}
}

Acknowledgments

Microsoft Research — BitNet b1.58-2B-4T base model and architecture
The ggml / llama.cpp project — GGUF format conventions
HuggingFace — Training libraries (PEFT, TRL, Transformers, Accelerate)

Downloads last month: 2,080

GGUF

Model size

21.6M params

Architecture

bitnet-b1.58

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MicheRomChis/orchid-1.0

Base model

microsoft/bitnet-b1.58-2B-4T

Adapter

(7)

this model

Evaluation results

Accuracy on ARC-Challenge
test set self-reported

56.000
Accuracy (normalized) on HellaSwag
validation set self-reported

52.000
Accuracy on WinoGrande
validation set self-reported

74.000
Accuracy on MMLU
test set self-reported

38.600