File size: 10,050 Bytes
4902e43
 
 
 
 
e27095b
 
 
 
 
 
 
 
e615986
 
e27095b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e43df7
e27095b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e43df7
 
 
 
 
e27095b
6e43df7
e615986
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
---

pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/5wwTu4HUsWHqOA7mL-WFJ.png
---

# RHOMBUS — Official Hugging Face Organization

> **Clean geometry. Bold ideas. Practical AI.**
> Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.

<p align="center">
  <img src="ummyihttps://dmage.com/180x180/000/fff.png&text=RH

![image/png](https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/DWHIcstIU3ngbF0TfWqsU.png)

</p>

---

## 🔷 Who We Are

**Rhombus** is an independent AI research & engineering studio focused on **small, efficient, and reasoning-strong models**. We prototype new architectures, build high-quality datasets, and ship production tools that work **offline**, on **low compute**, and in **real-world constraints**.

Our guiding principles:

* **Geometry over noise**: clear structure, measurable outcomes, minimal bloat.
* **Small-first**: design models that outperform their size class.
* **Reasoning-centric**: prioritise logic, reliability, and controllability.
* **Accessible**: reproducible, transparent, documented for students & startups.

---

## 🧭 Mission

1. **Re-think model architecture** beyond classic Transformers for efficiency and robustness.
2. **Compress intelligence** — make 50M–2B parameter models reason like much larger ones.
3. **Democratise training** with tooling that runs on consumer GPUs and CPU-only environments.
4. **Ship pragmatic AI** — tools that solve real problems in coding, data, education, and research.

---

## 📦 Key Projects (Active/Planned)

### 🧪 Architectures & Models

* **Brahma** — a *post-transformer* research line targeting minimal compute with strong reasoning and robustness.
  *Goal:* beat GPT-2 class baselines with a fraction of compute.
* **Water v0.x** — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
* **Karta 135M** — fine-tuned SmolLM-based series for compact instruction following.
* **Kishor** — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills.
  *v3 target:* \~2.2B params, balanced for edge + server.

### 🖼️ Generative & Multimodal

* **Klaa** — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
* **Rhombus TTS** *(R\&D)* — lightweight text‑to‑speech optimized for clarity on consumer GPUs.

### 🧰 Tooling

* **Rhombus CorpusForge** (aka *DataCrafter*) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
* **Project Fruit** — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.

---

## 🗺️ Roadmap Snapshot

* **2025**: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
* **2026**: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
* **2027**: Brahma v1 reference; inference SDK; offline QA/coding assistant.
* **2028–2030**: Scaled Brahma family; unified multimodal small models; education-first deployments.

> Detailed per-quarter milestones live in the organization Projects board.

---

## 🧩 Organization Layout

We keep repos **single-purpose**, well‑documented, and tagged.

```
Rhombus/
├─ brahma/                   # core research, papers, reference impls
├─ water/                    # Water v0.x experimental models (Brahma-based)
├─ kishor/                   # multilingual reasoning LLMs
├─ karta-135m/               # smol fine-tunes (instruction)
├─ klaa/                     # text-to-image models & training
├─ corpusforge/              # dataset factory & CLI
├─ project-fruit/            # data classification + curation pipelines
├─ eval/                     # evaluation harness & leaderboards
├─ datasets/                 # dataset cards, loaders, governance
└─ docs/                     # org-wide specs, style guides, templates
```

### Tagging & Naming

* Repos: `area-name` (e.g., `architecture-brahma`, `tooling-corpusforge`).
* Branches: `main` (stable), `dev` (active), `exp/<topic>` (short‑lived).
* Releases: semantic tags `vMAJOR.MINOR.PATCH` + training/build metadata.

---

## 📊 Evaluation & Benchmarks

We care about **reasoning** over raw next-token loss. Our standard evals:

* **Language**: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
* **Coding**: HumanEval+, MBPP, Codeforces-style synthetic stacks
* **Safety**: jailbreak suites, refusal correctness, harmful content filters
* **Image** (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness

> *We publish exact prompts, seeds, decoders, and compute for reproducibility.*

---

## 🔐 Safety, Security & Governance

* **Alignment**: instruction tuning with preference data; safety rail prompts; content filters on output.
* **Security**: supply-chain checksums, signed releases, deterministic builds when possible.
* **Privacy**: strict dataset licensing review; PII scrubbing; opt‑out channels.
* **Ethics**: transparent data sources; clear intended use; red‑line misuse policy.

---

## 📄 Licenses

* **Code**: Apache-2.0 (preferred) or MIT when noted.
* **Models**: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
* **Datasets**: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.

> Each repo contains a clear `LICENSE` and `NOTICE` with third‑party attributions.

---

## 🧪 Reproducibility Policy

For every release we strive to provide:

* **Training recipe**: data mix, token count, curriculum, batch schedulers.
* **Compute**: GPU/TPU type, hours, energy notes.
* **Exact checkpoints**: with SHA256, quantized variants, and safetensors.
* **Configs**: tokenizer, architecture params, decoder settings.

---

## 🧱 Contribution Guide (Quick Start)

### 1) Discuss

Open an **issue** in the relevant repo with a clear proposal. Use the `proposal` template.

### 2) Develop

* Fork the repo, create `exp/<topic>` branch
* Follow code style (ruff/black for Py, ortho tests, mypy optional)
* Add/Update docs and unit tests

### 3) Submit

Open a PR to `dev` with:

* motivation, design notes
* benchmarks (even small‑scale)
* safety considerations

### 4) Review & Merge

* 2 approvals minimum for core repos
* CI must pass (lint, tests, basic eval sanity)

> See `CONTRIBUTING.md` in each repo for details.

---

## 🧾 Templates

Below are **copy‑ready** card templates you can use across Rhombus repositories.

### 📘 Model Card (template)

```yaml
---
license: apache-2.0
language:
  - en
  - hi
library_name: transformers
tags:
  - reasoning
  - small-language-model
  - multilingual
  - rhombus
  - brahma
model-index:
  - name: <MODEL_NAME>
    results:
      - task: {type: text-generation}
        dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
        metrics:
          - name: MMLU
            type: mmlu
            value: <score>
          - name: ARC
            type: arc
            value: <score>
---

# <MODEL_NAME>

## Summary
One‑paragraph description, positioning, and key capabilities.

## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:**

## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>

## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.

## Safety
Known limitations, bias notes, and refusal behavior.

## License
Apache-2.0 (see `LICENSE`).
```

### 📗 Dataset Card (template)

```yaml
---
license: cc-by-4.0
tags:
  - dataset
  - rhombus
  - instruction
language:
  - en
  - hi
pretty_name: <DATASET_NAME>
---

# <DATASET_NAME>

## Summary
High‑level description and purpose.

## Source & Collection
List all sources, filters, dedup steps, and justification.

## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).

## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples

## Licensing
Origin licenses with links; redistribution terms.

## Ethical Considerations
PII policy, redactions, opt‑out mechanism.
```

### 📙 Space Card (template)

````markdown
# <SPACE_NAME>

Interactive demo for `<MODEL_NAME>`.

## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU

## Run Locally
```bash
pip install -r requirements.txt
python app.py
````

```

---

## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR

---

## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: security@rhombus.ai (PGP available)  
- **General**: hello@rhombus.ai  
- **Updates**: Follow our HF org and star repos to get release notifications.

> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.

---

## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:

[![HF Spaces](https://img.shields.io/badge/🤗-Spaces-blue.svg)](#)
[![Models](https://img.shields.io/badge/Models-Release-brightgreen.svg)](#)
[![Datasets](https://img.shields.io/badge/Datasets-Live-orange.svg)](#)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-black.svg)](#)
[![Twitter Follow](https://img.shields.io/twitter/follow/rhombus_ai?style=social)](#)

---

## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs

---

<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>

```

---

title: README
emoji: 🐠
colorFrom: red
colorTo: green
sdk: static
pinned: true
---
---
license: apache-2.0
emoji: 📚
colorFrom: gray
colorTo: gray
---