Instructions to use MicheRomChis/orchid-1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use MicheRomChis/orchid-1.0 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="MicheRomChis/orchid-1.0", filename="dpo_aligned-lora.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use MicheRomChis/orchid-1.0 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf MicheRomChis/orchid-1.0 # Run inference directly in the terminal: llama-cli -hf MicheRomChis/orchid-1.0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf MicheRomChis/orchid-1.0 # Run inference directly in the terminal: llama-cli -hf MicheRomChis/orchid-1.0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf MicheRomChis/orchid-1.0 # Run inference directly in the terminal: ./llama-cli -hf MicheRomChis/orchid-1.0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf MicheRomChis/orchid-1.0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf MicheRomChis/orchid-1.0
Use Docker
docker model run hf.co/MicheRomChis/orchid-1.0
- LM Studio
- Jan
- Ollama
How to use MicheRomChis/orchid-1.0 with Ollama:
ollama run hf.co/MicheRomChis/orchid-1.0
- Unsloth Studio new
How to use MicheRomChis/orchid-1.0 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MicheRomChis/orchid-1.0 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MicheRomChis/orchid-1.0 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for MicheRomChis/orchid-1.0 to start chatting
- Docker Model Runner
How to use MicheRomChis/orchid-1.0 with Docker Model Runner:
docker model run hf.co/MicheRomChis/orchid-1.0
- Lemonade
How to use MicheRomChis/orchid-1.0 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull MicheRomChis/orchid-1.0
Run and chat with the model
lemonade run user.orchid-1.0-{{QUANT_TAG}}List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)Orchid 1.0
First competitive LLM trained and aligned in Colombia โ a 2B ternary-weight language model fine-tuned from Microsoft BitNet b1.58-2B-4T on a single RTX 3050 laptop (4 GB VRAM). Orchid is multilingual (inherits BitNet's broad language coverage; alignment fine-tuning focused on English and Spanish), aligned for unbiased responses using ORPO, and designed to run on consumer hardware without cloud dependency.
โ Try it free on GPU (~3 min setup)
Inference note: Orchid uses the BitNet I2_S (ternary) format with a separate LoRA adapter. Standard llama.cpp cannot serve this combination correctly. Use ternative.cpp โ the custom C++ inference engine built for this model.
Model Files
| File | Size | Purpose |
|---|---|---|
ggml-model-i2_s.gguf |
~1.1 GB | BitNet b1.58-2B-4T base (I2_S ternary format) |
dpo_aligned-lora.gguf |
~90 MB | ORPO-3 aligned LoRA adapter (F32, 420 tensors) |
Download both files to run Orchid. The base GGUF contains the ternary weights; the adapter applies the alignment fine-tuning at runtime without re-quantizing.
Quick Start
1. Download
huggingface-cli download MicheRomChis/orchid-1.0 \
ggml-model-i2_s.gguf dpo_aligned-lora.gguf \
--local-dir ./orchid-models
2. Build ternative
# Linux / macOS โ requires cmake 3.18+ and a C++17 compiler
git clone --depth 1 https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build --parallel
cd ..
# Windows (PowerShell) โ requires MSVC 2022 or MinGW + cmake 3.18+
git clone --depth 1 https://github.com/michelangeloromerochisco/ternative
cd ternative
cmake -B build -DCMAKE_BUILD_TYPE=Release; cmake --build build --parallel
cd ..
GPU build (NVIDIA CUDA 12.x): add -DTERNATIVE_CUDA=ON to the cmake command.
3. Generate text
# Linux / macOS
./ternative/build/ternative \
--model ./orchid-models/ggml-model-i2_s.gguf \
--lora ./orchid-models/dpo_aligned-lora.gguf \
--prompt "ยฟCuรกl es la capital de Colombia?" \
--max-tokens 200
# Windows (PowerShell)
.\ternative\build\Release\ternative.exe `
--model .\orchid-models\ggml-model-i2_s.gguf `
--lora .\orchid-models\dpo_aligned-lora.gguf `
--prompt "What is photosynthesis? Think step by step." `
--max-tokens 300
4. Run as OpenAI-compatible server
# Linux / macOS
./ternative/build/ternative \
--model ./orchid-models/ggml-model-i2_s.gguf \
--lora ./orchid-models/dpo_aligned-lora.gguf \
--server --port 8080
# Windows (PowerShell)
.\ternative\build\Release\ternative.exe `
--model .\orchid-models\ggml-model-i2_s.gguf `
--lora .\orchid-models\dpo_aligned-lora.gguf `
--server --port 8080
Then use any OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="orchid",
messages=[{"role": "user", "content": "Explain quantum entanglement simply."}]
)
print(response.choices[0].message.content)
Why ternative.cpp?
Standard inference stacks cannot serve LoRA-fine-tuned ternary models correctly:
| Engine | I2_S base | Runtime LoRA | I2_S + LoRA |
|---|---|---|---|
| llama.cpp | โ ๏ธ type-36 error | โ (Q4/Q8 only) | โ |
| bitnet.cpp | โ | โ no adapter path | โ |
| ternative.cpp | โ | โ full precision | โ |
The problem: merging a LoRA adapter into an I2_S base and re-quantizing rounds every delta to zero โ the fine-tuning is silently discarded. ternative.cpp avoids this by de-quantizing the I2_S base to F32, applying the LoRA delta at full precision, and casting to F16 for inference.
Benchmark Results
Standard Benchmarks (lm-eval-harness methodology, 50 samples each)
Scored via log-probability on live ternative.cpp server. Methodology matches lm-evaluation-harness exactly.
| Benchmark | Orchid 1.0 | BitNet b1.58-2B (base) | Delta |
|---|---|---|---|
| ARC-Challenge | 56.0% | 49.9% | +6.1 pp |
| HellaSwag (length-norm) | 52.0% | 68.4% | โ16.4 pp |
| WinoGrande | 74.0% | โ | โ |
| MMLU (57 subjects) | 38.6% | 53.2% | โ14.6 pp |
The ARC-Challenge gain (+6.1 pp) confirms the reasoning fine-tuning transferred. HellaSwag and MMLU regressions are the expected ORPO alignment tax โ the model trades some factual-recall breadth for reasoning quality and bias mitigation, consistent with published DPO/ORPO literature.
WinoGrande at 74.0% is strong for 2B parameters โ comparable to the published score of Llama 3.2 3B (~74%).
Internal Benchmark v2 (semantic scoring, 100 questions, 8 categories)
| Rank | Model | Score |
|---|---|---|
| 1 | Claude 3.5 Sonnet | 89.5% |
| 2 | GPT-4o | 89.2% |
| 3 | Orchid 1.0 | 87.9% |
| 4 | BitNet b1.58-2B base | 84.2% |
| 5 | Kimi k1.5 | 82.2% |
| 6 | Qwen2.5-7B | 78.4% |
Orchid ranks #3 of 11 models on our internal benchmark, above all tested open-weight models including 7Bโ9B parameter models. Science: 100%, Math: 93.3%, Coding: 93.3%.
Note: the internal benchmark uses semantic similarity scoring and is a relative comparison tool, not a substitute for standard NLP benchmarks.
Training Details
All training was performed on a single NVIDIA RTX 3050 laptop GPU (4 GB VRAM, 16 GB RAM, Windows 11) โ no cloud compute.
| Stage | Method | Data | Duration |
|---|---|---|---|
| SFT-A | LoRA r=16 | Reasoning / chain-of-thought (50 samples, validation run) | ~1 h |
| SFT-B | LoRA r=16 | 5,500 samples (5k identity + 500 knowledge) | ~88 h wall-clock |
| ORPO-2 | LoRA r=8 | 2,038 preference pairs (debiasing + UltraFeedback) | ~26 h |
| ORPO-3 | LoRA r=8 | 2,104 preference pairs (Colombia identity focus) | ~54 h |
UltraFeedback note: The ORPO-2 stage includes a subset of preference pairs drawn from the UltraFeedback dataset (Cui et al., 2023, MIT License). We use the published preference pairs as a downstream consumer of the released dataset โ we did not commission GPT-4 annotations. Full attribution and licensing responsibility rests with the original dataset authors.
Memory techniques that made 4 GB training possible:
- Pre-tokenize dataset before loading model (prevents startup OOM)
device_map="auto"โ GPU + CPU split via Accelerate- Gradient checkpointing +
bf16=True - ORPO with
ref_model=Noneโ saves ~1.2 GB vs DPO
Training scripts: github.com/MichelangeloRomeroChisco/orchid
Hardware Requirements
| Minimum | Recommended | |
|---|---|---|
| GPU VRAM | 0 (CPU-only works) | 4 GB (RTX 3050 class) |
| RAM | 8 GB | 16 GB |
| Storage | 1.3 GB | 2 GB |
| OS | Windows / Linux / macOS | โ |
GPU mode: all 30 transformer layers offload to GPU using mixed F16 + INT8 quantization (~3.3 GB VRAM). CPU mode: ~6 tok/s with AVX2.
Limitations
- MMLU at 38.6% โ alignment tax from ORPO. Expected and documented in the technical paper.
- Spanish coverage โ 80% on internal benchmark. Functional but not state-of-the-art.
- Context window โ 4,096 tokens (inherited from BitNet base).
- ternative.cpp required โ llama.cpp produces type-36 errors or silently wrong output.
- Do not use BitsAndBytes โ stacking BNB quantization on top of BitNet's runtime ternary quantization is unsupported.
- Identity requires system prompt โ without a system prompt Orchid may respond generically; ORPO baked the identity partially but not completely.
Technical Paper
Full methodology, training details, failure modes, and architecture analysis:
Orchid 1.0: A Reproducible Recipe for Aligned Ternary-Weight Language Models on Consumer Hardware
License
Apache 2.0 โ free for research and commercial use.
This model is a fine-tuned derivative of Microsoft BitNet b1.58-2B-4T (MIT License).
Copyright (c) Microsoft Corporation. BitNet b1.58-2B-4T is released under the MIT License. The MIT License requires this copyright notice to accompany any distribution of derivative works. Full license text: https://opensource.org/licenses/MIT
Citation
@software{orchid_2026,
title = {Orchid 1.0: First Competitive LLM Trained and Aligned in Colombia โ Ternary-Weight Fine-Tuning on Consumer Hardware},
author = {Romero Chisco, Michelangelo},
year = {2026},
url = {https://huggingface.co/MicheRomChis/orchid-1.0},
license = {Apache-2.0},
note = {Fine-tuned from Microsoft BitNet b1.58-2B-4T}
}
Acknowledgments
- Microsoft Research โ BitNet b1.58-2B-4T base model and architecture
- The ggml / llama.cpp project โ GGUF format conventions
- HuggingFace โ Training libraries (PEFT, TRL, Transformers, Accelerate)
- Downloads last month
- 2,080
We're not able to determine the quantization variants.
Model tree for MicheRomChis/orchid-1.0
Base model
microsoft/bitnet-b1.58-2B-4TEvaluation results
- Accuracy on ARC-Challengetest set self-reported56.000
- Accuracy (normalized) on HellaSwagvalidation set self-reported52.000
- Accuracy on WinoGrandevalidation set self-reported74.000
- Accuracy on MMLUtest set self-reported38.600
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="MicheRomChis/orchid-1.0", filename="", )