|
|
---
|
|
|
pinned: true
|
|
|
thumbnail: >-
|
|
|
https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/5wwTu4HUsWHqOA7mL-WFJ.png
|
|
|
---
|
|
|
# RHOMBUS — Official Hugging Face Organization |
|
|
|
|
|
> **Clean geometry. Bold ideas. Practical AI.** |
|
|
> Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters. |
|
|
|
|
|
<p align="center"> |
|
|
<img src="ummyihttps://dmage.com/180x180/000/fff.png&text=RH |
|
|
|
|
|
 |
|
|
|
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔷 Who We Are |
|
|
|
|
|
**Rhombus** is an independent AI research & engineering studio focused on **small, efficient, and reasoning-strong models**. We prototype new architectures, build high-quality datasets, and ship production tools that work **offline**, on **low compute**, and in **real-world constraints**. |
|
|
|
|
|
Our guiding principles: |
|
|
|
|
|
* **Geometry over noise**: clear structure, measurable outcomes, minimal bloat. |
|
|
* **Small-first**: design models that outperform their size class. |
|
|
* **Reasoning-centric**: prioritise logic, reliability, and controllability. |
|
|
* **Accessible**: reproducible, transparent, documented for students & startups. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧭 Mission |
|
|
|
|
|
1. **Re-think model architecture** beyond classic Transformers for efficiency and robustness. |
|
|
2. **Compress intelligence** — make 50M–2B parameter models reason like much larger ones. |
|
|
3. **Democratise training** with tooling that runs on consumer GPUs and CPU-only environments. |
|
|
4. **Ship pragmatic AI** — tools that solve real problems in coding, data, education, and research. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Key Projects (Active/Planned) |
|
|
|
|
|
### 🧪 Architectures & Models |
|
|
|
|
|
* **Brahma** — a *post-transformer* research line targeting minimal compute with strong reasoning and robustness. |
|
|
*Goal:* beat GPT-2 class baselines with a fraction of compute. |
|
|
* **Water v0.x** — early Brahma-based series (5M → 100M) proving the concept with rigorous evals. |
|
|
* **Karta 135M** — fine-tuned SmolLM-based series for compact instruction following. |
|
|
* **Kishor** — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills. |
|
|
*v3 target:* \~2.2B params, balanced for edge + server. |
|
|
|
|
|
### 🖼️ Generative & Multimodal |
|
|
|
|
|
* **Klaa** — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control. |
|
|
* **Rhombus TTS** *(R\&D)* — lightweight text‑to‑speech optimized for clarity on consumer GPUs. |
|
|
|
|
|
### 🧰 Tooling |
|
|
|
|
|
* **Rhombus CorpusForge** (aka *DataCrafter*) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training. |
|
|
* **Project Fruit** — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🗺️ Roadmap Snapshot |
|
|
|
|
|
* **2025**: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases. |
|
|
* **2026**: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3. |
|
|
* **2027**: Brahma v1 reference; inference SDK; offline QA/coding assistant. |
|
|
* **2028–2030**: Scaled Brahma family; unified multimodal small models; education-first deployments. |
|
|
|
|
|
> Detailed per-quarter milestones live in the organization Projects board. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Organization Layout |
|
|
|
|
|
We keep repos **single-purpose**, well‑documented, and tagged. |
|
|
|
|
|
``` |
|
|
Rhombus/ |
|
|
├─ brahma/ # core research, papers, reference impls |
|
|
├─ water/ # Water v0.x experimental models (Brahma-based) |
|
|
├─ kishor/ # multilingual reasoning LLMs |
|
|
├─ karta-135m/ # smol fine-tunes (instruction) |
|
|
├─ klaa/ # text-to-image models & training |
|
|
├─ corpusforge/ # dataset factory & CLI |
|
|
├─ project-fruit/ # data classification + curation pipelines |
|
|
├─ eval/ # evaluation harness & leaderboards |
|
|
├─ datasets/ # dataset cards, loaders, governance |
|
|
└─ docs/ # org-wide specs, style guides, templates |
|
|
``` |
|
|
|
|
|
### Tagging & Naming |
|
|
|
|
|
* Repos: `area-name` (e.g., `architecture-brahma`, `tooling-corpusforge`). |
|
|
* Branches: `main` (stable), `dev` (active), `exp/<topic>` (short‑lived). |
|
|
* Releases: semantic tags `vMAJOR.MINOR.PATCH` + training/build metadata. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation & Benchmarks |
|
|
|
|
|
We care about **reasoning** over raw next-token loss. Our standard evals: |
|
|
|
|
|
* **Language**: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English) |
|
|
* **Coding**: HumanEval+, MBPP, Codeforces-style synthetic stacks |
|
|
* **Safety**: jailbreak suites, refusal correctness, harmful content filters |
|
|
* **Image** (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness |
|
|
|
|
|
> *We publish exact prompts, seeds, decoders, and compute for reproducibility.* |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔐 Safety, Security & Governance |
|
|
|
|
|
* **Alignment**: instruction tuning with preference data; safety rail prompts; content filters on output. |
|
|
* **Security**: supply-chain checksums, signed releases, deterministic builds when possible. |
|
|
* **Privacy**: strict dataset licensing review; PII scrubbing; opt‑out channels. |
|
|
* **Ethics**: transparent data sources; clear intended use; red‑line misuse policy. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📄 Licenses |
|
|
|
|
|
* **Code**: Apache-2.0 (preferred) or MIT when noted. |
|
|
* **Models**: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile. |
|
|
* **Datasets**: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source. |
|
|
|
|
|
> Each repo contains a clear `LICENSE` and `NOTICE` with third‑party attributions. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Reproducibility Policy |
|
|
|
|
|
For every release we strive to provide: |
|
|
|
|
|
* **Training recipe**: data mix, token count, curriculum, batch schedulers. |
|
|
* **Compute**: GPU/TPU type, hours, energy notes. |
|
|
* **Exact checkpoints**: with SHA256, quantized variants, and safetensors. |
|
|
* **Configs**: tokenizer, architecture params, decoder settings. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧱 Contribution Guide (Quick Start) |
|
|
|
|
|
### 1) Discuss |
|
|
|
|
|
Open an **issue** in the relevant repo with a clear proposal. Use the `proposal` template. |
|
|
|
|
|
### 2) Develop |
|
|
|
|
|
* Fork the repo, create `exp/<topic>` branch |
|
|
* Follow code style (ruff/black for Py, ortho tests, mypy optional) |
|
|
* Add/Update docs and unit tests |
|
|
|
|
|
### 3) Submit |
|
|
|
|
|
Open a PR to `dev` with: |
|
|
|
|
|
* motivation, design notes |
|
|
* benchmarks (even small‑scale) |
|
|
* safety considerations |
|
|
|
|
|
### 4) Review & Merge |
|
|
|
|
|
* 2 approvals minimum for core repos |
|
|
* CI must pass (lint, tests, basic eval sanity) |
|
|
|
|
|
> See `CONTRIBUTING.md` in each repo for details. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧾 Templates |
|
|
|
|
|
Below are **copy‑ready** card templates you can use across Rhombus repositories. |
|
|
|
|
|
### 📘 Model Card (template) |
|
|
|
|
|
```yaml |
|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
library_name: transformers |
|
|
tags: |
|
|
- reasoning |
|
|
- small-language-model |
|
|
- multilingual |
|
|
- rhombus |
|
|
- brahma |
|
|
model-index: |
|
|
- name: <MODEL_NAME> |
|
|
results: |
|
|
- task: {type: text-generation} |
|
|
dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>} |
|
|
metrics: |
|
|
- name: MMLU |
|
|
type: mmlu |
|
|
value: <score> |
|
|
- name: ARC |
|
|
type: arc |
|
|
value: <score> |
|
|
--- |
|
|
|
|
|
# <MODEL_NAME> |
|
|
|
|
|
## Summary |
|
|
One‑paragraph description, positioning, and key capabilities. |
|
|
|
|
|
## Intended Use |
|
|
- **Primary:** education, coding assistant, offline QA |
|
|
- **Out of scope / disallowed:** … |
|
|
|
|
|
## Training |
|
|
- **Tokens:** <N> |
|
|
- **Data mix:** <list> |
|
|
- **Compute:** <GPU/TPU, hours> |
|
|
|
|
|
## Evaluation |
|
|
Report exact prompts, seeds, decoders, and links to scripts. |
|
|
|
|
|
## Safety |
|
|
Known limitations, bias notes, and refusal behavior. |
|
|
|
|
|
## License |
|
|
Apache-2.0 (see `LICENSE`). |
|
|
``` |
|
|
|
|
|
### 📗 Dataset Card (template) |
|
|
|
|
|
```yaml |
|
|
--- |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- dataset |
|
|
- rhombus |
|
|
- instruction |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
pretty_name: <DATASET_NAME> |
|
|
--- |
|
|
|
|
|
# <DATASET_NAME> |
|
|
|
|
|
## Summary |
|
|
High‑level description and purpose. |
|
|
|
|
|
## Source & Collection |
|
|
List all sources, filters, dedup steps, and justification. |
|
|
|
|
|
## Processing |
|
|
Tokenization, chunking, cleaning, quality enhancement (if any). |
|
|
|
|
|
## Structure |
|
|
- **Splits:** train/val/test sizes |
|
|
- **Fields:** schema and examples |
|
|
|
|
|
## Licensing |
|
|
Origin licenses with links; redistribution terms. |
|
|
|
|
|
## Ethical Considerations |
|
|
PII policy, redactions, opt‑out mechanism. |
|
|
``` |
|
|
|
|
|
### 📙 Space Card (template) |
|
|
|
|
|
````markdown |
|
|
# <SPACE_NAME> |
|
|
|
|
|
Interactive demo for `<MODEL_NAME>`. |
|
|
|
|
|
## Features |
|
|
- Prompt presets |
|
|
- Safety toggle |
|
|
- Quantized checkpoints for CPU |
|
|
|
|
|
## Run Locally |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
python app.py |
|
|
```` |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🛠️ Tooling & Dev Environment |
|
|
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI) |
|
|
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf |
|
|
- **Eval**: Eleuther evals, custom harness in `eval/` |
|
|
- **CI**: pre‑commit, pytest, minimal eval sanity per PR |
|
|
|
|
|
--- |
|
|
|
|
|
## 📬 Contact & Community |
|
|
- **Issues**: GitHub/HF Issues (preferred) |
|
|
- **Security**: security@rhombus.ai (PGP available) |
|
|
- **General**: hello@rhombus.ai |
|
|
- **Updates**: Follow our HF org and star repos to get release notifications. |
|
|
|
|
|
> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏷️ Badges (optional examples) |
|
|
Add these to repo READMEs as needed: |
|
|
|
|
|
[](#) |
|
|
[](#) |
|
|
[](#) |
|
|
[](#) |
|
|
[](#) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧭 Quick Links |
|
|
- **Releases**: see repo tags |
|
|
- **Changelogs**: `CHANGELOG.md` in each project |
|
|
- **Docs**: `/docs` (org‑wide), per‑repo READMEs |
|
|
|
|
|
--- |
|
|
|
|
|
<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p> |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
title: README |
|
|
emoji: 🐠 |
|
|
colorFrom: red |
|
|
colorTo: green |
|
|
sdk: static |
|
|
pinned: true |
|
|
--- |
|
|
--- |
|
|
license: apache-2.0 |
|
|
emoji: 📚 |
|
|
colorFrom: gray |
|
|
colorTo: gray |
|
|
--- |