Text Generation
Transformers
Safetensors
ouro
looped-language-model
reasoning
recurrent-depth
conversational
custom_code
Instructions to use ByteDance/Ouro-2.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ByteDance/Ouro-2.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ByteDance/Ouro-2.6B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ByteDance/Ouro-2.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ByteDance/Ouro-2.6B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ByteDance/Ouro-2.6B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance/Ouro-2.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ByteDance/Ouro-2.6B
- SGLang
How to use ByteDance/Ouro-2.6B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ByteDance/Ouro-2.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance/Ouro-2.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ByteDance/Ouro-2.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance/Ouro-2.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ByteDance/Ouro-2.6B with Docker Model Runner:
docker model run hf.co/ByteDance/Ouro-2.6B
Fix UniversalTransformerCache.get_mask_sizes for batched generation
#4
by KristianS7 - opened
- modeling_ouro.py +12 -0
modeling_ouro.py
CHANGED
|
@@ -214,6 +214,18 @@ class UniversalTransformerCache(Cache):
|
|
| 214 |
self.key_cache[idx] = key_entry.index_select(0, beam_idx.to(device))
|
| 215 |
self.value_cache[idx] = value_entry.index_select(0, beam_idx.to(device))
|
| 216 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
@property
|
| 218 |
def is_compileable(self) -> bool:
|
| 219 |
return False
|
|
|
|
| 214 |
self.key_cache[idx] = key_entry.index_select(0, beam_idx.to(device))
|
| 215 |
self.value_cache[idx] = value_entry.index_select(0, beam_idx.to(device))
|
| 216 |
|
| 217 |
+
def get_mask_sizes(self, cache_position: torch.Tensor, layer_idx: int = 0) -> tuple[int, int]:
|
| 218 |
+
"""Return (kv_length, kv_offset) accounting for cached tokens.
|
| 219 |
+
|
| 220 |
+
The inherited Cache.get_mask_sizes checks ``self.layers`` which is
|
| 221 |
+
always empty for UniversalTransformerCache, causing it to return
|
| 222 |
+
``(query_length, 0)`` instead of ``(cached_length + query_length, 0)``
|
| 223 |
+
during autoregressive decoding.
|
| 224 |
+
"""
|
| 225 |
+
query_length = cache_position.shape[0]
|
| 226 |
+
seq_length = self.get_seq_length(layer_idx)
|
| 227 |
+
return seq_length + query_length, 0
|
| 228 |
+
|
| 229 |
@property
|
| 230 |
def is_compileable(self) -> bool:
|
| 231 |
return False
|