Instructions to use NoesisLab/Spartacus-1B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NoesisLab/Spartacus-1B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NoesisLab/Spartacus-1B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("NoesisLab/Spartacus-1B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use NoesisLab/Spartacus-1B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NoesisLab/Spartacus-1B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Spartacus-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NoesisLab/Spartacus-1B-Instruct

SGLang

How to use NoesisLab/Spartacus-1B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NoesisLab/Spartacus-1B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Spartacus-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NoesisLab/Spartacus-1B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Spartacus-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use NoesisLab/Spartacus-1B-Instruct with Docker Model Runner:
```
docker model run hf.co/NoesisLab/Spartacus-1B-Instruct
```

OzTianlu commited on Feb 19

Commit

ff3696b

verified ·

1 Parent(s): 1924b81

Upload 2 files

Browse files

Files changed (1) hide show

MonoidForCausalLM.py +27 -20

MonoidForCausalLM.py CHANGED Viewed

@@ -454,31 +454,38 @@ class MonoidAttention(nn.Module):
             return self.o_proj(o), new_state
         # ══════════════════════════════════════════════════════════
-        # Inference prefill (use_cache=True, T>1): fused scan + readout
-        # 推理预填充 (use_cache=True, T>1): 融合扫描 + 读出
         # ══════════════════════════════════════════════════════════
-        # Avoids materializing full [B,H,T,d,d] states tensor.
-        # Peak memory: O(H·d²) instead of O(T·H·d²).
-        # 避免实体化完整的 [B,H,T,d,d] 状态张量。
-        # 峰值内存: O(H·d²) 而非 O(T·H·d²)。
         if use_cache:
-            S = self.h0.expand(B, -1, -1, -1).clone()          # [B,H,d,d]
-            log_acc = torch.zeros(B, H, 1, device=hidden_states.device, dtype=q.dtype)
-            o_parts = []
-            for t in range(T):
-                kv_t = torch.einsum('bhd, bhe -> bhde', k[:, :, t], v[:, :, t])
-                decay = torch.exp(log_alpha[:, :, t])           # [B,H,1]
-                while decay.dim() < S.dim():
-                    decay = decay.unsqueeze(-1)
-                S = S * decay + kv_t
-                o_parts.append(torch.einsum('bhd, bhde -> bhe', q[:, :, t], S))
-                log_acc = log_acc + log_alpha[:, :, t]
-            final_state = (log_acc, S)
             if monoid_cache is not None:
                 monoid_cache.update(self.layer_idx, final_state)
-            o = torch.stack(o_parts, dim=2)                     # [B,H,T,d]
             o = o.transpose(1, 2).contiguous().view(B, T, -1)
             return self.o_proj(o), final_state

             return self.o_proj(o), new_state
         # ══════════════════════════════════════════════════════════
+        # Inference prefill (use_cache=True, T>1): parallel scan + readout
+        # 推理预填充 (use_cache=True, T>1): 并行扫描 + 读出
         # ══════════════════════════════════════════════════════════
+        # Uses the same parallel_scan_with_state as training to leverage
+        # Triton CUDA kernel acceleration instead of O(T) Python loop.
+        # Memory: O(B·H·T·d²) — same as training path.
+        # 使用与训练相同的 parallel_scan_with_state 来利用
+        # Triton CUDA 核函数加速, 而非 O(T) 的 Python 循环。
+        # 内存: O(B·H·T·d²) — 与训练路径相同。
         if use_cache:
+            kv = torch.einsum('bhtd, bhte -> bhtde', k, v)      # [B,H,T,d,d]
+            states, (log_acc, S_T) = parallel_scan_with_state(log_alpha, kv)
+            # Add h0 contribution: S_t += (∏_{i=0}^{t} α_i) · h0
+            # 叠加 h0 贡献: S_t += (∏_{i=0}^{t} α_i) · h0
+            cum_log_alpha = torch.cumsum(log_alpha, dim=2)       # [B,H,T,1]
+            h0_decay = torch.exp(cum_log_alpha).unsqueeze(-1)    # [B,H,T,1,1]
+            states = states + h0_decay * self.h0.unsqueeze(2)    # broadcast h0 [1,H,1,d,d]
+            # Final state includes h0 contribution
+            # 最终状态包含 h0 贡献
+            total_h0_decay = torch.exp(log_acc).unsqueeze(-1)    # [B,H,1,1]
+            S_final = S_T + total_h0_decay * self.h0.squeeze(0)  # [B,H,d,d] (squeeze batch dim of h0)
+            # h0 is [1,H,d,d], squeeze(0) removed for clarity but expand also works
+            final_state = (log_acc, S_final)
             if monoid_cache is not None:
                 monoid_cache.update(self.layer_idx, final_state)
+            # Vectorized readout: o_t = q_t · S_t for all t
+            # 向量化读出: 一次性计算所有 t 的 o_t = q_t · S_t
+            o = torch.einsum('bhtd, bhtde -> bhte', q, states)   # [B,H,T,d]
             o = o.transpose(1, 2).contiguous().view(B, T, -1)
             return self.o_proj(o), final_state