README / README.md
GODELEV's picture
Update README.md
4902e43 verified
---
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/5wwTu4HUsWHqOA7mL-WFJ.png
---
# RHOMBUS — Official Hugging Face Organization
> **Clean geometry. Bold ideas. Practical AI.**
> Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.
<p align="center">
<img src="ummyihttps://dmage.com/180x180/000/fff.png&text=RH
![image/png](https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/DWHIcstIU3ngbF0TfWqsU.png)
</p>
---
## 🔷 Who We Are
**Rhombus** is an independent AI research & engineering studio focused on **small, efficient, and reasoning-strong models**. We prototype new architectures, build high-quality datasets, and ship production tools that work **offline**, on **low compute**, and in **real-world constraints**.
Our guiding principles:
* **Geometry over noise**: clear structure, measurable outcomes, minimal bloat.
* **Small-first**: design models that outperform their size class.
* **Reasoning-centric**: prioritise logic, reliability, and controllability.
* **Accessible**: reproducible, transparent, documented for students & startups.
---
## 🧭 Mission
1. **Re-think model architecture** beyond classic Transformers for efficiency and robustness.
2. **Compress intelligence** — make 50M–2B parameter models reason like much larger ones.
3. **Democratise training** with tooling that runs on consumer GPUs and CPU-only environments.
4. **Ship pragmatic AI** — tools that solve real problems in coding, data, education, and research.
---
## 📦 Key Projects (Active/Planned)
### 🧪 Architectures & Models
* **Brahma** — a *post-transformer* research line targeting minimal compute with strong reasoning and robustness.
*Goal:* beat GPT-2 class baselines with a fraction of compute.
* **Water v0.x** — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
* **Karta 135M** — fine-tuned SmolLM-based series for compact instruction following.
* **Kishor** — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills.
*v3 target:* \~2.2B params, balanced for edge + server.
### 🖼️ Generative & Multimodal
* **Klaa** — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
* **Rhombus TTS** *(R\&D)* — lightweight text‑to‑speech optimized for clarity on consumer GPUs.
### 🧰 Tooling
* **Rhombus CorpusForge** (aka *DataCrafter*) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
* **Project Fruit** — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.
---
## 🗺️ Roadmap Snapshot
* **2025**: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
* **2026**: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
* **2027**: Brahma v1 reference; inference SDK; offline QA/coding assistant.
* **2028–2030**: Scaled Brahma family; unified multimodal small models; education-first deployments.
> Detailed per-quarter milestones live in the organization Projects board.
---
## 🧩 Organization Layout
We keep repos **single-purpose**, well‑documented, and tagged.
```
Rhombus/
├─ brahma/ # core research, papers, reference impls
├─ water/ # Water v0.x experimental models (Brahma-based)
├─ kishor/ # multilingual reasoning LLMs
├─ karta-135m/ # smol fine-tunes (instruction)
├─ klaa/ # text-to-image models & training
├─ corpusforge/ # dataset factory & CLI
├─ project-fruit/ # data classification + curation pipelines
├─ eval/ # evaluation harness & leaderboards
├─ datasets/ # dataset cards, loaders, governance
└─ docs/ # org-wide specs, style guides, templates
```
### Tagging & Naming
* Repos: `area-name` (e.g., `architecture-brahma`, `tooling-corpusforge`).
* Branches: `main` (stable), `dev` (active), `exp/<topic>` (short‑lived).
* Releases: semantic tags `vMAJOR.MINOR.PATCH` + training/build metadata.
---
## 📊 Evaluation & Benchmarks
We care about **reasoning** over raw next-token loss. Our standard evals:
* **Language**: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
* **Coding**: HumanEval+, MBPP, Codeforces-style synthetic stacks
* **Safety**: jailbreak suites, refusal correctness, harmful content filters
* **Image** (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness
> *We publish exact prompts, seeds, decoders, and compute for reproducibility.*
---
## 🔐 Safety, Security & Governance
* **Alignment**: instruction tuning with preference data; safety rail prompts; content filters on output.
* **Security**: supply-chain checksums, signed releases, deterministic builds when possible.
* **Privacy**: strict dataset licensing review; PII scrubbing; opt‑out channels.
* **Ethics**: transparent data sources; clear intended use; red‑line misuse policy.
---
## 📄 Licenses
* **Code**: Apache-2.0 (preferred) or MIT when noted.
* **Models**: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
* **Datasets**: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.
> Each repo contains a clear `LICENSE` and `NOTICE` with third‑party attributions.
---
## 🧪 Reproducibility Policy
For every release we strive to provide:
* **Training recipe**: data mix, token count, curriculum, batch schedulers.
* **Compute**: GPU/TPU type, hours, energy notes.
* **Exact checkpoints**: with SHA256, quantized variants, and safetensors.
* **Configs**: tokenizer, architecture params, decoder settings.
---
## 🧱 Contribution Guide (Quick Start)
### 1) Discuss
Open an **issue** in the relevant repo with a clear proposal. Use the `proposal` template.
### 2) Develop
* Fork the repo, create `exp/<topic>` branch
* Follow code style (ruff/black for Py, ortho tests, mypy optional)
* Add/Update docs and unit tests
### 3) Submit
Open a PR to `dev` with:
* motivation, design notes
* benchmarks (even small‑scale)
* safety considerations
### 4) Review & Merge
* 2 approvals minimum for core repos
* CI must pass (lint, tests, basic eval sanity)
> See `CONTRIBUTING.md` in each repo for details.
---
## 🧾 Templates
Below are **copy‑ready** card templates you can use across Rhombus repositories.
### 📘 Model Card (template)
```yaml
---
license: apache-2.0
language:
- en
- hi
library_name: transformers
tags:
- reasoning
- small-language-model
- multilingual
- rhombus
- brahma
model-index:
- name: <MODEL_NAME>
results:
- task: {type: text-generation}
dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
metrics:
- name: MMLU
type: mmlu
value: <score>
- name: ARC
type: arc
value: <score>
---
# <MODEL_NAME>
## Summary
One‑paragraph description, positioning, and key capabilities.
## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:**
## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>
## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.
## Safety
Known limitations, bias notes, and refusal behavior.
## License
Apache-2.0 (see `LICENSE`).
```
### 📗 Dataset Card (template)
```yaml
---
license: cc-by-4.0
tags:
- dataset
- rhombus
- instruction
language:
- en
- hi
pretty_name: <DATASET_NAME>
---
# <DATASET_NAME>
## Summary
High‑level description and purpose.
## Source & Collection
List all sources, filters, dedup steps, and justification.
## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).
## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples
## Licensing
Origin licenses with links; redistribution terms.
## Ethical Considerations
PII policy, redactions, opt‑out mechanism.
```
### 📙 Space Card (template)
````markdown
# <SPACE_NAME>
Interactive demo for `<MODEL_NAME>`.
## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU
## Run Locally
```bash
pip install -r requirements.txt
python app.py
````
```
---
## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR
---
## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: security@rhombus.ai (PGP available)
- **General**: hello@rhombus.ai
- **Updates**: Follow our HF org and star repos to get release notifications.
> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.
---
## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:
[![HF Spaces](https://img.shields.io/badge/🤗-Spaces-blue.svg)](#)
[![Models](https://img.shields.io/badge/Models-Release-brightgreen.svg)](#)
[![Datasets](https://img.shields.io/badge/Datasets-Live-orange.svg)](#)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-black.svg)](#)
[![Twitter Follow](https://img.shields.io/twitter/follow/rhombus_ai?style=social)](#)
---
## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs
---
<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>
```
---
title: README
emoji: 🐠
colorFrom: red
colorTo: green
sdk: static
pinned: true
---
---
license: apache-2.0
emoji: 📚
colorFrom: gray
colorTo: gray
---