File size: 6,130 Bytes
3eee1a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54028f4
3eee1a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
language:
- en
- ko
library_name: transformers
license: other
license_name: upstage-solar-license
pipeline_tag: text-generation
tags:
- upstage
- solar
- moe
- 100b
- llm
- nota
- quantization
---

# **Solar-Open-100B-NotaMoeQuant-Int4**

This repository provides **Upstage’s flagship model, [Solar-Open-100B](https://huggingface.co/upstage/Solar-Open-100B)**, packaged with [**Nota AI**](https://www.nota.ai/)’s proprietary quantization technique specifically developed for Mixture-of-Experts (MoE)-based LLMs. Unlike conventional quantization methods, this approach incorporates a novel method designed to mitigate representation distortion that can occur when experts are mixed under quantization in MoE architectures.

## Overview

- **Base model:** [Solar-Open-100B](https://huggingface.co/upstage/Solar-Open-100B) 
- **Quantization:** Int4 weight-only
- **Packing format:** `auto_round:auto_gptq` (ensuring backend compatibility with PyTorch and vLLM)
- **Quantization group size:** 128
- **Supported tensor parallel sizes:** {1,2}
- **Hardware Requirements:** 
    * **Minimum:** 2 x NVIDIA A100 (80GB)    

## License
This repository contains both model weights and code,
which are licensed under different terms:

1. MODEL WEIGHTS (*.safetensors)
   Licensed under **Upstage Solar License**
   See: https://huggingface.co/upstage/Solar-Open-100B/blob/main/LICENSE

2. CODE (*.py, *.json, *.jinja files)
   Licensed under **Apache License 2.0**
   See: https://www.apache.org/licenses/LICENSE-2.0


## Performance

- English

|                 |**Solar-Open-100B**|**Nota MoE Quantization (Ours)**|**AutoRound**|**cyankiwi AWQ**|
|---              | ---               | ---                            | ---         | ---            |
|PPL (WikiText-2)↓|6.06               |**6.81**                        |7.12         |30.52           |
|PPL (C4)↓        |20.37              |**20.84**                       |20.94        |50.16           |
|PIQA↑            |82.37              |**82.75**                       |82.05        |78.94           |
|BoolQ↑           |84.89              |84.86                           |**85.29**    |68.87           |
|ARC-E↑           |87.25              |**86.48**                       |85.77        |83.12           |
|ARC-C↑           |61.43              |**61.69**                       |60.84        |56.40           |
|TruthfulQA↑      |59.25              |**60.14**                       |59.18        |52.38           |
|WinoGrande↑      |76.09              |**75.77**                       |**75.77**    |68.59           | 

- Korean

|                 |**Solar-Open-100B**|**Nota MoE Quantization (Ours)**|**AutoRound**|**cyankiwi AWQ**|
|---              | ---               | ---                            | ---         | ---            |
|HRM8K↑           |81.52              |80.68                           |**81.56**    |32.67           |
|MMLU-ProX-Lite↑  |55.44              |**51.84**                       |51.26        |6.19            |
|KoBEST↑          |62.00              |**62.80**                       |61.80        |61.80           |
|CLiCK↑           |71.33              |**70.03**                       |69.77        |51.18           |

- Model weigth memory footprint

|**Solar-Open-100B**|**Nota MoE Quantization (Ours)**|**cyankiwi AWQ**|
| ---               | ---                            | ---            |
|191.2 GB           |51.9 GB                         |57.0 GB         |


* Note 
  - ↑ / ↓ denote the direction of improvement: higher is better (↑), lower is better (↓).
  - Cyankiwi AWQ is a publicly available [INT4 (4-bit AWQ) quantized version of Solar-Open-100B](cyankiwi/Solar-Open-100B-AWQ-4bit)
  - Because we used a smaller thinking budget, the results for HRM8K and CLiCK are slightly lower than the numbers reported in the original Solar-Open-100B repository.
  - Memory refers to the pure VRAM footprint occupied only by the model weights.

## Inference
### Transformers

Install the required dependencies:

```bash
pip install -U transformers kernels torch accelerate auto-round==0.8.0
```

Run inference with the following code:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "nota-ai/Solar-Open-100B-NotaMoEQuant-Int4"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare input
messages = [{"role": "user", "content": "who are you?"}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Generate response
generated_ids = model.generate(
    **inputs,
    max_new_tokens=4096,
    temperature=0.8,
    top_p=0.95,
    top_k=50,
    do_sample=True,
)
generated_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
print(generated_text)
```

### vLLM
Create and activate a Python virtual environment
```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
```

Install Solar Open's optimized vLLM
```bash
VLLM_PRECOMPILED_WHEEL_LOCATION="https://github.com/vllm-project/vllm/releases/download/v0.12.0/vllm-0.12.0-cp38-abi3-manylinux_2_31_x86_64.whl" \
VLLM_USE_PRECOMPILED=1 \
uv pip install git+https://github.com/UpstageAI/vllm.git@v0.12.0-solar-open
```

Start the vLLM server (For 2 GPUs)
```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
vllm serve nota-ai/Solar-Open-100B-NotaMoEQuant-Int4 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser solar_open \
    --reasoning-parser solar_open \
    --logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
    --logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
    --tensor-parallel-size 2 \
    --max-num-seqs 64 \
    --gpu-memory-utilization 0.8
```