File size: 10,050 Bytes
4902e43 e27095b e615986 e27095b 6e43df7 e27095b 6e43df7 e27095b 6e43df7 e615986 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 |
---
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/5wwTu4HUsWHqOA7mL-WFJ.png
---
# RHOMBUS — Official Hugging Face Organization
> **Clean geometry. Bold ideas. Practical AI.**
> Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.
<p align="center">
<img src="ummyihttps://dmage.com/180x180/000/fff.png&text=RH

</p>
---
## 🔷 Who We Are
**Rhombus** is an independent AI research & engineering studio focused on **small, efficient, and reasoning-strong models**. We prototype new architectures, build high-quality datasets, and ship production tools that work **offline**, on **low compute**, and in **real-world constraints**.
Our guiding principles:
* **Geometry over noise**: clear structure, measurable outcomes, minimal bloat.
* **Small-first**: design models that outperform their size class.
* **Reasoning-centric**: prioritise logic, reliability, and controllability.
* **Accessible**: reproducible, transparent, documented for students & startups.
---
## 🧭 Mission
1. **Re-think model architecture** beyond classic Transformers for efficiency and robustness.
2. **Compress intelligence** — make 50M–2B parameter models reason like much larger ones.
3. **Democratise training** with tooling that runs on consumer GPUs and CPU-only environments.
4. **Ship pragmatic AI** — tools that solve real problems in coding, data, education, and research.
---
## 📦 Key Projects (Active/Planned)
### 🧪 Architectures & Models
* **Brahma** — a *post-transformer* research line targeting minimal compute with strong reasoning and robustness.
*Goal:* beat GPT-2 class baselines with a fraction of compute.
* **Water v0.x** — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
* **Karta 135M** — fine-tuned SmolLM-based series for compact instruction following.
* **Kishor** — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills.
*v3 target:* \~2.2B params, balanced for edge + server.
### 🖼️ Generative & Multimodal
* **Klaa** — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
* **Rhombus TTS** *(R\&D)* — lightweight text‑to‑speech optimized for clarity on consumer GPUs.
### 🧰 Tooling
* **Rhombus CorpusForge** (aka *DataCrafter*) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
* **Project Fruit** — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.
---
## 🗺️ Roadmap Snapshot
* **2025**: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
* **2026**: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
* **2027**: Brahma v1 reference; inference SDK; offline QA/coding assistant.
* **2028–2030**: Scaled Brahma family; unified multimodal small models; education-first deployments.
> Detailed per-quarter milestones live in the organization Projects board.
---
## 🧩 Organization Layout
We keep repos **single-purpose**, well‑documented, and tagged.
```
Rhombus/
├─ brahma/ # core research, papers, reference impls
├─ water/ # Water v0.x experimental models (Brahma-based)
├─ kishor/ # multilingual reasoning LLMs
├─ karta-135m/ # smol fine-tunes (instruction)
├─ klaa/ # text-to-image models & training
├─ corpusforge/ # dataset factory & CLI
├─ project-fruit/ # data classification + curation pipelines
├─ eval/ # evaluation harness & leaderboards
├─ datasets/ # dataset cards, loaders, governance
└─ docs/ # org-wide specs, style guides, templates
```
### Tagging & Naming
* Repos: `area-name` (e.g., `architecture-brahma`, `tooling-corpusforge`).
* Branches: `main` (stable), `dev` (active), `exp/<topic>` (short‑lived).
* Releases: semantic tags `vMAJOR.MINOR.PATCH` + training/build metadata.
---
## 📊 Evaluation & Benchmarks
We care about **reasoning** over raw next-token loss. Our standard evals:
* **Language**: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
* **Coding**: HumanEval+, MBPP, Codeforces-style synthetic stacks
* **Safety**: jailbreak suites, refusal correctness, harmful content filters
* **Image** (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness
> *We publish exact prompts, seeds, decoders, and compute for reproducibility.*
---
## 🔐 Safety, Security & Governance
* **Alignment**: instruction tuning with preference data; safety rail prompts; content filters on output.
* **Security**: supply-chain checksums, signed releases, deterministic builds when possible.
* **Privacy**: strict dataset licensing review; PII scrubbing; opt‑out channels.
* **Ethics**: transparent data sources; clear intended use; red‑line misuse policy.
---
## 📄 Licenses
* **Code**: Apache-2.0 (preferred) or MIT when noted.
* **Models**: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
* **Datasets**: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.
> Each repo contains a clear `LICENSE` and `NOTICE` with third‑party attributions.
---
## 🧪 Reproducibility Policy
For every release we strive to provide:
* **Training recipe**: data mix, token count, curriculum, batch schedulers.
* **Compute**: GPU/TPU type, hours, energy notes.
* **Exact checkpoints**: with SHA256, quantized variants, and safetensors.
* **Configs**: tokenizer, architecture params, decoder settings.
---
## 🧱 Contribution Guide (Quick Start)
### 1) Discuss
Open an **issue** in the relevant repo with a clear proposal. Use the `proposal` template.
### 2) Develop
* Fork the repo, create `exp/<topic>` branch
* Follow code style (ruff/black for Py, ortho tests, mypy optional)
* Add/Update docs and unit tests
### 3) Submit
Open a PR to `dev` with:
* motivation, design notes
* benchmarks (even small‑scale)
* safety considerations
### 4) Review & Merge
* 2 approvals minimum for core repos
* CI must pass (lint, tests, basic eval sanity)
> See `CONTRIBUTING.md` in each repo for details.
---
## 🧾 Templates
Below are **copy‑ready** card templates you can use across Rhombus repositories.
### 📘 Model Card (template)
```yaml
---
license: apache-2.0
language:
- en
- hi
library_name: transformers
tags:
- reasoning
- small-language-model
- multilingual
- rhombus
- brahma
model-index:
- name: <MODEL_NAME>
results:
- task: {type: text-generation}
dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
metrics:
- name: MMLU
type: mmlu
value: <score>
- name: ARC
type: arc
value: <score>
---
# <MODEL_NAME>
## Summary
One‑paragraph description, positioning, and key capabilities.
## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:** …
## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>
## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.
## Safety
Known limitations, bias notes, and refusal behavior.
## License
Apache-2.0 (see `LICENSE`).
```
### 📗 Dataset Card (template)
```yaml
---
license: cc-by-4.0
tags:
- dataset
- rhombus
- instruction
language:
- en
- hi
pretty_name: <DATASET_NAME>
---
# <DATASET_NAME>
## Summary
High‑level description and purpose.
## Source & Collection
List all sources, filters, dedup steps, and justification.
## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).
## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples
## Licensing
Origin licenses with links; redistribution terms.
## Ethical Considerations
PII policy, redactions, opt‑out mechanism.
```
### 📙 Space Card (template)
````markdown
# <SPACE_NAME>
Interactive demo for `<MODEL_NAME>`.
## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU
## Run Locally
```bash
pip install -r requirements.txt
python app.py
````
```
---
## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR
---
## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: security@rhombus.ai (PGP available)
- **General**: hello@rhombus.ai
- **Updates**: Follow our HF org and star repos to get release notifications.
> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.
---
## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:
[](#)
[](#)
[](#)
[](#)
[](#)
---
## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs
---
<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>
```
---
title: README
emoji: 🐠
colorFrom: red
colorTo: green
sdk: static
pinned: true
---
---
license: apache-2.0
emoji: 📚
colorFrom: gray
colorTo: gray
--- |