Spaces:

Rhombus18
/

README

Configuration error

File size: 10,050 Bytes

---

pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/5wwTu4HUsWHqOA7mL-WFJ.png
---

# RHOMBUS — Official Hugging Face Organization

> **Clean geometry. Bold ideas. Practical AI.**
> Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.

<p align="center">
  <img src="ummyihttps://dmage.com/180x180/000/fff.png&text=RH

![image/png](https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/DWHIcstIU3ngbF0TfWqsU.png)

</p>

---

## 🔷 Who We Are

**Rhombus** is an independent AI research & engineering studio focused on **small, efficient, and reasoning-strong models**. We prototype new architectures, build high-quality datasets, and ship production tools that work **offline**, on **low compute**, and in **real-world constraints**.

Our guiding principles:

* **Geometry over noise**: clear structure, measurable outcomes, minimal bloat.
* **Small-first**: design models that outperform their size class.
* **Reasoning-centric**: prioritise logic, reliability, and controllability.
* **Accessible**: reproducible, transparent, documented for students & startups.

---

## 🧭 Mission

1. **Re-think model architecture** beyond classic Transformers for efficiency and robustness.
2. **Compress intelligence** — make 50M–2B parameter models reason like much larger ones.
3. **Democratise training** with tooling that runs on consumer GPUs and CPU-only environments.
4. **Ship pragmatic AI** — tools that solve real problems in coding, data, education, and research.

---

## 📦 Key Projects (Active/Planned)

### 🧪 Architectures & Models

* **Brahma** — a *post-transformer* research line targeting minimal compute with strong reasoning and robustness.
  *Goal:* beat GPT-2 class baselines with a fraction of compute.
* **Water v0.x** — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
* **Karta 135M** — fine-tuned SmolLM-based series for compact instruction following.
* **Kishor** — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills.
  *v3 target:* \~2.2B params, balanced for edge + server.

### 🖼️ Generative & Multimodal

* **Klaa** — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
* **Rhombus TTS** *(R\&D)* — lightweight text‑to‑speech optimized for clarity on consumer GPUs.

### 🧰 Tooling

* **Rhombus CorpusForge** (aka *DataCrafter*) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
* **Project Fruit** — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.

---

## 🗺️ Roadmap Snapshot

* **2025**: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
* **2026**: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
* **2027**: Brahma v1 reference; inference SDK; offline QA/coding assistant.
* **2028–2030**: Scaled Brahma family; unified multimodal small models; education-first deployments.

> Detailed per-quarter milestones live in the organization Projects board.

---

## 🧩 Organization Layout

We keep repos **single-purpose**, well‑documented, and tagged.

```
Rhombus/
├─ brahma/                   # core research, papers, reference impls
├─ water/                    # Water v0.x experimental models (Brahma-based)
├─ kishor/                   # multilingual reasoning LLMs
├─ karta-135m/               # smol fine-tunes (instruction)
├─ klaa/                     # text-to-image models & training
├─ corpusforge/              # dataset factory & CLI
├─ project-fruit/            # data classification + curation pipelines
├─ eval/                     # evaluation harness & leaderboards
├─ datasets/                 # dataset cards, loaders, governance
└─ docs/                     # org-wide specs, style guides, templates
```

### Tagging & Naming

* Repos: `area-name` (e.g., `architecture-brahma`, `tooling-corpusforge`).
* Branches: `main` (stable), `dev` (active), `exp/<topic>` (short‑lived).
* Releases: semantic tags `vMAJOR.MINOR.PATCH` + training/build metadata.

---

## 📊 Evaluation & Benchmarks

We care about **reasoning** over raw next-token loss. Our standard evals:

* **Language**: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
* **Coding**: HumanEval+, MBPP, Codeforces-style synthetic stacks
* **Safety**: jailbreak suites, refusal correctness, harmful content filters
* **Image** (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness

> *We publish exact prompts, seeds, decoders, and compute for reproducibility.*

---

## 🔐 Safety, Security & Governance

* **Alignment**: instruction tuning with preference data; safety rail prompts; content filters on output.
* **Security**: supply-chain checksums, signed releases, deterministic builds when possible.
* **Privacy**: strict dataset licensing review; PII scrubbing; opt‑out channels.
* **Ethics**: transparent data sources; clear intended use; red‑line misuse policy.

---

## 📄 Licenses

* **Code**: Apache-2.0 (preferred) or MIT when noted.
* **Models**: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
* **Datasets**: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.

> Each repo contains a clear `LICENSE` and `NOTICE` with third‑party attributions.

---

## 🧪 Reproducibility Policy

For every release we strive to provide:

* **Training recipe**: data mix, token count, curriculum, batch schedulers.
* **Compute**: GPU/TPU type, hours, energy notes.
* **Exact checkpoints**: with SHA256, quantized variants, and safetensors.
* **Configs**: tokenizer, architecture params, decoder settings.

---

## 🧱 Contribution Guide (Quick Start)

### 1) Discuss

Open an **issue** in the relevant repo with a clear proposal. Use the `proposal` template.

### 2) Develop

* Fork the repo, create `exp/<topic>` branch
* Follow code style (ruff/black for Py, ortho tests, mypy optional)
* Add/Update docs and unit tests

### 3) Submit

Open a PR to `dev` with:

* motivation, design notes
* benchmarks (even small‑scale)
* safety considerations

### 4) Review & Merge

* 2 approvals minimum for core repos
* CI must pass (lint, tests, basic eval sanity)

> See `CONTRIBUTING.md` in each repo for details.

---

## 🧾 Templates

Below are **copy‑ready** card templates you can use across Rhombus repositories.

### 📘 Model Card (template)

```yaml
---
license: apache-2.0
language:
  - en
  - hi
library_name: transformers
tags:
  - reasoning
  - small-language-model
  - multilingual
  - rhombus
  - brahma
model-index:
  - name: <MODEL_NAME>
    results:
      - task: {type: text-generation}
        dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
        metrics:
          - name: MMLU
            type: mmlu
            value: <score>
          - name: ARC
            type: arc
            value: <score>
---

# <MODEL_NAME>

## Summary
One‑paragraph description, positioning, and key capabilities.

## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:** …

## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>

## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.

## Safety
Known limitations, bias notes, and refusal behavior.

## License
Apache-2.0 (see `LICENSE`).
```

### 📗 Dataset Card (template)

```yaml
---
license: cc-by-4.0
tags:
  - dataset
  - rhombus
  - instruction
language:
  - en
  - hi
pretty_name: <DATASET_NAME>
---

# <DATASET_NAME>

## Summary
High‑level description and purpose.

## Source & Collection
List all sources, filters, dedup steps, and justification.

## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).

## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples

## Licensing
Origin licenses with links; redistribution terms.

## Ethical Considerations
PII policy, redactions, opt‑out mechanism.
```

### 📙 Space Card (template)

````markdown
# <SPACE_NAME>

Interactive demo for `<MODEL_NAME>`.

## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU

## Run Locally
```bash
pip install -r requirements.txt
python app.py
````

```

---

## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR

---

## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: security@rhombus.ai (PGP available)  
- **General**: hello@rhombus.ai  
- **Updates**: Follow our HF org and star repos to get release notifications.

> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.

---

## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:

[![HF Spaces](https://img.shields.io/badge/🤗-Spaces-blue.svg)](#)
[![Models](https://img.shields.io/badge/Models-Release-brightgreen.svg)](#)
[![Datasets](https://img.shields.io/badge/Datasets-Live-orange.svg)](#)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-black.svg)](#)
[![Twitter Follow](https://img.shields.io/twitter/follow/rhombus_ai?style=social)](#)

---

## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs

---

<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>

```

---

title: README
emoji: 🐠
colorFrom: red
colorTo: green
sdk: static
pinned: true
---
---
license: apache-2.0
emoji: 📚
colorFrom: gray
colorTo: gray
---