Chroma-4B / README.md

doc: Update GitHub Repo Link

a2a5f41 verified about 12 hours ago

6.79 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- Speech-to-Speech
	- Multimodal
	inference: false
	language:
	- en
	pipeline_tag: any-to-any
	---

	<div align="center">

	<p style="font-size: 28px; font-weight: bold; margin: 0;">
	FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning
	</p>
	</div>

	<div style="display: flex; justify-content: center; margin: 2rem 0; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;">
	<div style="display: flex; align-items: center; gap: 18px; font-size: 1.25rem;">
	<svg width="80" height="50" viewBox="0 0 48 28" fill="none" xmlns="http://www.w3.org/2000/svg" style="filter: drop-shadow(0 0 16px rgba(168, 85, 247, 0.5));">
	<path
	d="M14 21C10.134 21 7 17.866 7 14C7 10.134 10.134 7 14 7C17.5 7 20 9 21.5 11L26.5 17C28 19 30.5 21 34 21C37.866 21 41 17.866 41 14C41 10.134 37.866 7 34 7C30.5 7 28 9 26.5 11L21.5 17C20 19 17.5 21 14 21Z"
	stroke="url(#apple_chroma_gradient)"
	stroke-width="3.5"
	stroke-linecap="round"
	stroke-linejoin="round"
	/>
	<defs>
	<linearGradient id="apple_chroma_gradient" x1="7" y1="21" x2="41" y2="7" gradientUnits="userSpaceOnUse">
	<stop offset="0" stop-color="#A855F7" />
	<stop offset="0.5" stop-color="#EC4899" />
	<stop offset="1" stop-color="#F97316" />
	</linearGradient>
	</defs>
	</svg>
	<div style="display: flex; align-items: baseline; gap: 6px;">
	<span style="background: linear-gradient(to right, #A855F7, #EC4899, #F97316); -webkit-background-clip: text; -webkit-text-fill-color: transparent; font-weight: 800; font-size: 2.1rem; letter-spacing: -0.5px; white-space: nowrap;">
	Chroma
	</span>
	<span style="color: #94a3b8; font-size: 1.15rem; font-weight: 500; letter-spacing: 0.5px; white-space: nowrap;">
	by <span style="color: #e2e8f0;">Flash</span><span style="color: #60a5fa;">Labs</span>
	</span>
	</div>
	</div>
	</div>

	<div align="center">

	<h1>🚀 Get Started with <a href="https://www.flashlabs.ai/flashai-voice-agents">Voice Agents</a>!</h1>

	<p><strong>Production-ready voice AI solutions</strong> powered by Chroma \| <strong>Open-source model</strong> for developers & researchers</p>

	[![Voice Agents](https://img.shields.io/badge/🎯%20Voice%20Agents-blue?style=for-the-badge)](https://www.flashlabs.ai/flashai-voice-agents)
	[![Download Model](https://img.shields.io/badge/🤗%20Download%20Model-orange?style=for-the-badge)](https://huggingface.co/FlashLabs/Chroma-4B)
	[![Technical Report](https://img.shields.io/badge/📄%20Technical%20Report-red?style=for-the-badge)](https://arxiv.org/abs/2601.11141)
	[![GitHub Repository](https://img.shields.io/badge/GitHub%20Repository-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma)
	</div>

	## Model Description

	Chroma 1.0 is an advanced multimodal model developed by [FlashLabs](https://flashlabs.ai). It is designed to understand and generate content across multiple modalities, including text and audio. As a virtual human model, Chroma possesses the ability to process auditory inputs and respond with both text and synthesized speech, enabling natural voice interactions.

	- Model Type: Multimodal Causal Language Model
	- Developed by: FlashLabs
	- Language(s): English
	- License: Apache-2.0
	- Model Architecture:
	- Reasoner: Based on Qwen2.5-Omni-3B
	- Backbone: Based on Llama3 (16 layers, 2048 hidden size)
	- Decoder: Based on Llama3(4 layers, 1024 hidden size)
	- Codec: Mimi (24kHz sampling rate)

	## Model Architecture

	<img src="assets/model_architecture.png" alt="Model Architecture" width="800" />


	## Capabilities

	Chroma 1.0 is capable of:
	- Speech Understanding: Processing user audio input directly.
	- Multimodal Generation: Generating coherent text and speech responses simultaneously.
	- Voice Cloning: utilizing reference audio prompts to guide speech generation style.

	## Usage

	### Installation

	Ensure you have the necessary dependencies installed. You may need the latest versions of `transformers` and `torch`.

	```bash
	pip install transformers torch
	```

	### Loading the Model

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	model_id = "FlashLabs/Chroma-4B" # Or local path

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	device_map="auto"
	)

	# Load processor
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	```

	### Inference Example

	Here is how to perform a simple conversation with audio input and audio output:

	```python
	import torch
	from IPython.display import Audio



	# Construct conversation history
	system_prompt = (
	"You are Chroma, an advanced virtual human created by the FlashLabs. "
	"You possess the ability to understand auditory inputs and generate both text and speech."
	)
	conversation = [[
	{
	"role": "system",
	"content": [
	{"type": "text", "text": system_prompt}
	],
	},
	{
	"role": "user",
	"content": [
	# Input audio file path
	{"type": "audio", "audio": "assets/make_taco.wav"},
	],
	},
	]]

	# Provide reference audio/text for style or context
	prompt_text = ["War and bloodshed throughout the world."]
	prompt_audio = ["assets/reference_audio.wav"]

	# Process inputs
	inputs = processor(
	conversation,
	add_generation_prompt=True,
	tokenize=False,
	prompt_audio=prompt_audio,
	prompt_text=prompt_text
	)

	# Move inputs to device
	device = model.device
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# 2. Generate
	output = model.generate(
	**inputs,
	max_new_tokens=100,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	use_cache=True
	)

	# 3. Decode Audio
	# The model outputs raw tokens; we decode the audio part using the codec
	audio_values = model.codec_model.decode(output.permute(0, 2, 1)).audio_values

	# Save or play audio (e.g., in Jupyter)
	Audio(audio_values[0].cpu().detach().numpy(), rate=24_000)
	```

	### Citation

	If you use Chroma in your research, please cite:

	```bibtex
	@misc{chen2026flashlabschroma10realtime,
	title={FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning},
	author={Tanyu Chen and Tairan Chen and Kai Shen and Zhenghua Bao and Zhihui Zhang and Man Yuan and Yi Shi},
	year={2026},
	eprint={2601.11141},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2601.11141},
	}
	```

	### Contact

	For questions or issues, please contact: chroma@flashlabs.ai