Anneketh Vij commited on
Commit
7a0625a
·
verified ·
1 Parent(s): e2931d8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ - fr
7
+ - de
8
+ - it
9
+ - pt
10
+ - ru
11
+ - ar
12
+ - hi
13
+ - ko
14
+ - zh
15
+ library_name: transformers
16
+ base_model:
17
+ - arcee-ai/Trinity-Nano-Preview
18
+ base_model_relation: quantized
19
+ ---
20
+ <div align="center">
21
+ <picture>
22
+ <img
23
+ src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png"
24
+ alt="Arcee Trinity Nano Preview"
25
+ style="max-width: 100%; height: auto;"
26
+ >
27
+ </picture>
28
+ </div>
29
+
30
+ # Trinity Nano Preview FP8-Block
31
+
32
+ Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
33
+
34
+ This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such **may be unstable** in certain use cases, especially in this preview.
35
+
36
+ This is an *experimental* release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!
37
+
38
+ ***
39
+
40
+ Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
41
+
42
+ Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
43
+
44
+ More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
45
+
46
+ ***
47
+
48
+ **This repository contains the FP8 block-quantized weights of Trinity-Nano-Preview (FP8 weights and activations with per-block scaling).**
49
+
50
+ ## Model Details
51
+
52
+ * **Model Architecture:** AfmoeForCausalLM
53
+ * **Parameters:** 6B, 1B active
54
+ * **Experts:** 128 total, 8 active, 1 shared
55
+ * **Context length:** 128k
56
+ * **Training Tokens:** 10T
57
+ * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Nano-Preview#license)
58
+
59
+ ***
60
+
61
+ <div align="center">
62
+ <picture>
63
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
64
+ </picture>
65
+ </div>
66
+
67
+ ## Quantization Details
68
+
69
+ - **Scheme:** `FP8 Block` (FP8 weights and activations, per-block scaling with E8M0 scale format)
70
+ - **Format:** `compressed-tensors`
71
+ - **Intended use:** High-throughput FP8 deployment of Trinity-Nano-Preview with near-lossless quality, optimized for NVIDIA Hopper/Blackwell GPUs
72
+ - **Supported backends:** [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM), vLLM CUTLASS, Triton
73
+
74
+ ### Running our model
75
+
76
+ - [VLLM](https://huggingface.co/arcee-ai/Trinity-Nano-Preview-FP8-Block#vllm)
77
+ - [Transformers](https://huggingface.co/arcee-ai/Trinity-Nano-Preview-FP8-Block#transformers)
78
+
79
+ ## VLLM
80
+
81
+ Supported in VLLM release 0.18.0+ with DeepGEMM FP8 MoE acceleration.
82
+
83
+ ```
84
+ # pip
85
+ pip install "vllm>=0.18.0"
86
+ ```
87
+
88
+ Serving the model with DeepGEMM enabled:
89
+
90
+ ```
91
+ VLLM_USE_DEEP_GEMM=1 vllm serve arcee-ai/Trinity-Nano-Preview-FP8-Block \
92
+ --trust-remote-code \
93
+ --max-model-len 4096 \
94
+ --enable-auto-tool-choice \
95
+ --reasoning-parser deepseek_r1 \
96
+ --tool-call-parser hermes
97
+ ```
98
+
99
+ Serving without DeepGEMM (falls back to CUTLASS/Triton):
100
+
101
+ ```
102
+ vllm serve arcee-ai/Trinity-Nano-Preview-FP8-Block \
103
+ --trust-remote-code \
104
+ --max-model-len 4096 \
105
+ --enable-auto-tool-choice \
106
+ --reasoning-parser deepseek_r1 \
107
+ --tool-call-parser hermes
108
+ ```
109
+
110
+ ## Transformers
111
+
112
+ Use the `main` transformers branch
113
+
114
+ ```
115
+ git clone https://github.com/huggingface/transformers.git
116
+ cd transformers
117
+
118
+ # pip
119
+ pip install '.[torch]'
120
+
121
+ # uv
122
+ uv pip install '.[torch]'
123
+ ```
124
+
125
+ ```python
126
+ from transformers import AutoTokenizer, AutoModelForCausalLM
127
+ import torch
128
+
129
+ model_id = "arcee-ai/Trinity-Nano-Preview-FP8-Block"
130
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
131
+ model = AutoModelForCausalLM.from_pretrained(
132
+ model_id,
133
+ torch_dtype=torch.bfloat16,
134
+ device_map="auto",
135
+ trust_remote_code=True
136
+ )
137
+
138
+ messages = [
139
+ {"role": "user", "content": "Who are you?"},
140
+ ]
141
+
142
+ input_ids = tokenizer.apply_chat_template(
143
+ messages,
144
+ add_generation_prompt=True,
145
+ return_tensors="pt"
146
+ ).to(model.device)
147
+
148
+ outputs = model.generate(
149
+ input_ids,
150
+ max_new_tokens=256,
151
+ do_sample=True,
152
+ temperature=0.5,
153
+ top_k=50,
154
+ top_p=0.95
155
+ )
156
+
157
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
158
+ print(response)
159
+ ```
160
+
161
+ ## License
162
+
163
+ Trinity-Nano-Preview-FP8-Block is released under the Apache-2.0 license.