Spaces:

Rhombus18
/

README

Configuration error

App Files Files Community

README / README.md

GODELEV

Update README.md

4902e43 verified 5 months ago

preview code

raw

history blame contribute delete

10.1 kB

	---
	pinned: true
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/5wwTu4HUsWHqOA7mL-WFJ.png
	---
	# RHOMBUS — Official Hugging Face Organization

	> Clean geometry. Bold ideas. Practical AI.
	> Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.

	<p align="center">
	<img src="ummyihttps://dmage.com/180x180/000/fff.png&text=RH

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/67f03a82cb606619f36f9a51/DWHIcstIU3ngbF0TfWqsU.png)

	</p>

	---

	## 🔷 Who We Are

	Rhombus is an independent AI research & engineering studio focused on small, efficient, and reasoning-strong models. We prototype new architectures, build high-quality datasets, and ship production tools that work offline, on low compute, and in real-world constraints.

	Our guiding principles:

	* Geometry over noise: clear structure, measurable outcomes, minimal bloat.
	* Small-first: design models that outperform their size class.
	* Reasoning-centric: prioritise logic, reliability, and controllability.
	* Accessible: reproducible, transparent, documented for students & startups.

	---

	## 🧭 Mission

	1. Re-think model architecture beyond classic Transformers for efficiency and robustness.
	2. Compress intelligence — make 50M–2B parameter models reason like much larger ones.
	3. Democratise training with tooling that runs on consumer GPUs and CPU-only environments.
	4. Ship pragmatic AI — tools that solve real problems in coding, data, education, and research.

	---

	## 📦 Key Projects (Active/Planned)

	### 🧪 Architectures & Models

	* Brahma — a post-transformer research line targeting minimal compute with strong reasoning and robustness.
	Goal: beat GPT-2 class baselines with a fraction of compute.
	* Water v0.x — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
	* Karta 135M — fine-tuned SmolLM-based series for compact instruction following.
	* Kishor — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills.
	v3 target: \~2.2B params, balanced for edge + server.

	### 🖼️ Generative & Multimodal

	* Klaa — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
	* Rhombus TTS (R\&D) — lightweight text‑to‑speech optimized for clarity on consumer GPUs.

	### 🧰 Tooling

	* Rhombus CorpusForge (aka DataCrafter) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
	* Project Fruit — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.

	---

	## 🗺️ Roadmap Snapshot

	* 2025: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
	* 2026: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
	* 2027: Brahma v1 reference; inference SDK; offline QA/coding assistant.
	* 2028–2030: Scaled Brahma family; unified multimodal small models; education-first deployments.

	> Detailed per-quarter milestones live in the organization Projects board.

	---

	## 🧩 Organization Layout

	We keep repos single-purpose, well‑documented, and tagged.

	```
	Rhombus/
	├─ brahma/ # core research, papers, reference impls
	├─ water/ # Water v0.x experimental models (Brahma-based)
	├─ kishor/ # multilingual reasoning LLMs
	├─ karta-135m/ # smol fine-tunes (instruction)
	├─ klaa/ # text-to-image models & training
	├─ corpusforge/ # dataset factory & CLI
	├─ project-fruit/ # data classification + curation pipelines
	├─ eval/ # evaluation harness & leaderboards
	├─ datasets/ # dataset cards, loaders, governance
	└─ docs/ # org-wide specs, style guides, templates
	```

	### Tagging & Naming

	* Repos: `area-name` (e.g., `architecture-brahma`, `tooling-corpusforge`).
	* Branches: `main` (stable), `dev` (active), `exp/<topic>` (short‑lived).
	* Releases: semantic tags `vMAJOR.MINOR.PATCH` + training/build metadata.

	---

	## 📊 Evaluation & Benchmarks

	We care about reasoning over raw next-token loss. Our standard evals:

	* Language: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
	* Coding: HumanEval+, MBPP, Codeforces-style synthetic stacks
	* Safety: jailbreak suites, refusal correctness, harmful content filters
	* Image (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness

	> We publish exact prompts, seeds, decoders, and compute for reproducibility.

	---

	## 🔐 Safety, Security & Governance

	* Alignment: instruction tuning with preference data; safety rail prompts; content filters on output.
	* Security: supply-chain checksums, signed releases, deterministic builds when possible.
	* Privacy: strict dataset licensing review; PII scrubbing; opt‑out channels.
	* Ethics: transparent data sources; clear intended use; red‑line misuse policy.

	---

	## 📄 Licenses

	* Code: Apache-2.0 (preferred) or MIT when noted.
	* Models: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
	* Datasets: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.

	> Each repo contains a clear `LICENSE` and `NOTICE` with third‑party attributions.

	---

	## 🧪 Reproducibility Policy

	For every release we strive to provide:

	* Training recipe: data mix, token count, curriculum, batch schedulers.
	* Compute: GPU/TPU type, hours, energy notes.
	* Exact checkpoints: with SHA256, quantized variants, and safetensors.
	* Configs: tokenizer, architecture params, decoder settings.

	---

	## 🧱 Contribution Guide (Quick Start)

	### 1) Discuss

	Open an issue in the relevant repo with a clear proposal. Use the `proposal` template.

	### 2) Develop

	* Fork the repo, create `exp/<topic>` branch
	* Follow code style (ruff/black for Py, ortho tests, mypy optional)
	* Add/Update docs and unit tests

	### 3) Submit

	Open a PR to `dev` with:

	* motivation, design notes
	* benchmarks (even small‑scale)
	* safety considerations

	### 4) Review & Merge

	* 2 approvals minimum for core repos
	* CI must pass (lint, tests, basic eval sanity)

	> See `CONTRIBUTING.md` in each repo for details.

	---

	## 🧾 Templates

	Below are copy‑ready card templates you can use across Rhombus repositories.

	### 📘 Model Card (template)

	```yaml
	---
	license: apache-2.0
	language:
	- en
	- hi
	library_name: transformers
	tags:
	- reasoning
	- small-language-model
	- multilingual
	- rhombus
	- brahma
	model-index:
	- name: <MODEL_NAME>
	results:
	- task: {type: text-generation}
	dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
	metrics:
	- name: MMLU
	type: mmlu
	value: <score>
	- name: ARC
	type: arc
	value: <score>
	---

	# <MODEL_NAME>

	## Summary
	One‑paragraph description, positioning, and key capabilities.

	## Intended Use
	- Primary: education, coding assistant, offline QA
	- Out of scope / disallowed: …

	## Training
	- Tokens: <N>
	- Data mix: <list>
	- Compute: <GPU/TPU, hours>

	## Evaluation
	Report exact prompts, seeds, decoders, and links to scripts.

	## Safety
	Known limitations, bias notes, and refusal behavior.

	## License
	Apache-2.0 (see `LICENSE`).
	```

	### 📗 Dataset Card (template)

	```yaml
	---
	license: cc-by-4.0
	tags:
	- dataset
	- rhombus
	- instruction
	language:
	- en
	- hi
	pretty_name: <DATASET_NAME>
	---

	# <DATASET_NAME>

	## Summary
	High‑level description and purpose.

	## Source & Collection
	List all sources, filters, dedup steps, and justification.

	## Processing
	Tokenization, chunking, cleaning, quality enhancement (if any).

	## Structure
	- Splits: train/val/test sizes
	- Fields: schema and examples

	## Licensing
	Origin licenses with links; redistribution terms.

	## Ethical Considerations
	PII policy, redactions, opt‑out mechanism.
	```

	### 📙 Space Card (template)

	````markdown
	# <SPACE_NAME>

	Interactive demo for `<MODEL_NAME>`.

	## Features
	- Prompt presets
	- Safety toggle
	- Quantized checkpoints for CPU

	## Run Locally
	```bash
	pip install -r requirements.txt
	python app.py
	````

	```

	---

	## 🛠️ Tooling & Dev Environment
	- Languages: Python, Rust (perf‑critical), TypeScript (tools/UI)
	- Core libs: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
	- Eval: Eleuther evals, custom harness in `eval/`
	- CI: pre‑commit, pytest, minimal eval sanity per PR

	---

	## 📬 Contact & Community
	- Issues: GitHub/HF Issues (preferred)
	- Security: security@rhombus.ai (PGP available)
	- General: hello@rhombus.ai
	- Updates: Follow our HF org and star repos to get release notifications.

	> Want to collaborate on education‑grade AI that runs on modest hardware? We’d love to hear from you.

	---

	## 🏷️ Badges (optional examples)
	Add these to repo READMEs as needed:

	[![HF Spaces](https://img.shields.io/badge/🤗-Spaces-blue.svg)](#)
	[![Models](https://img.shields.io/badge/Models-Release-brightgreen.svg)](#)
	[![Datasets](https://img.shields.io/badge/Datasets-Live-orange.svg)](#)
	[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-black.svg)](#)
	[![Twitter Follow](https://img.shields.io/twitter/follow/rhombus_ai?style=social)](#)

	---

	## 🧭 Quick Links
	- Releases: see repo tags
	- Changelogs: `CHANGELOG.md` in each project
	- Docs: `/docs` (org‑wide), per‑repo READMEs

	---

	<p align="center">Made with ◇ by the Rhombus team — geometry over noise.</p>

	```

	---

	title: README
	emoji: 🐠
	colorFrom: red
	colorTo: green
	sdk: static
	pinned: true
	---
	---
	license: apache-2.0
	emoji: 📚
	colorFrom: gray
	colorTo: gray
	---