File size: 6,130 Bytes
3eee1a6 54028f4 3eee1a6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
language:
- en
- ko
library_name: transformers
license: other
license_name: upstage-solar-license
pipeline_tag: text-generation
tags:
- upstage
- solar
- moe
- 100b
- llm
- nota
- quantization
---
# **Solar-Open-100B-NotaMoeQuant-Int4**
This repository provides **Upstage’s flagship model, [Solar-Open-100B](https://huggingface.co/upstage/Solar-Open-100B)**, packaged with [**Nota AI**](https://www.nota.ai/)’s proprietary quantization technique specifically developed for Mixture-of-Experts (MoE)-based LLMs. Unlike conventional quantization methods, this approach incorporates a novel method designed to mitigate representation distortion that can occur when experts are mixed under quantization in MoE architectures.
## Overview
- **Base model:** [Solar-Open-100B](https://huggingface.co/upstage/Solar-Open-100B)
- **Quantization:** Int4 weight-only
- **Packing format:** `auto_round:auto_gptq` (ensuring backend compatibility with PyTorch and vLLM)
- **Quantization group size:** 128
- **Supported tensor parallel sizes:** {1,2}
- **Hardware Requirements:**
* **Minimum:** 2 x NVIDIA A100 (80GB)
## License
This repository contains both model weights and code,
which are licensed under different terms:
1. MODEL WEIGHTS (*.safetensors)
Licensed under **Upstage Solar License**
See: https://huggingface.co/upstage/Solar-Open-100B/blob/main/LICENSE
2. CODE (*.py, *.json, *.jinja files)
Licensed under **Apache License 2.0**
See: https://www.apache.org/licenses/LICENSE-2.0
## Performance
- English
| |**Solar-Open-100B**|**Nota MoE Quantization (Ours)**|**AutoRound**|**cyankiwi AWQ**|
|--- | --- | --- | --- | --- |
|PPL (WikiText-2)↓|6.06 |**6.81** |7.12 |30.52 |
|PPL (C4)↓ |20.37 |**20.84** |20.94 |50.16 |
|PIQA↑ |82.37 |**82.75** |82.05 |78.94 |
|BoolQ↑ |84.89 |84.86 |**85.29** |68.87 |
|ARC-E↑ |87.25 |**86.48** |85.77 |83.12 |
|ARC-C↑ |61.43 |**61.69** |60.84 |56.40 |
|TruthfulQA↑ |59.25 |**60.14** |59.18 |52.38 |
|WinoGrande↑ |76.09 |**75.77** |**75.77** |68.59 |
- Korean
| |**Solar-Open-100B**|**Nota MoE Quantization (Ours)**|**AutoRound**|**cyankiwi AWQ**|
|--- | --- | --- | --- | --- |
|HRM8K↑ |81.52 |80.68 |**81.56** |32.67 |
|MMLU-ProX-Lite↑ |55.44 |**51.84** |51.26 |6.19 |
|KoBEST↑ |62.00 |**62.80** |61.80 |61.80 |
|CLiCK↑ |71.33 |**70.03** |69.77 |51.18 |
- Model weigth memory footprint
|**Solar-Open-100B**|**Nota MoE Quantization (Ours)**|**cyankiwi AWQ**|
| --- | --- | --- |
|191.2 GB |51.9 GB |57.0 GB |
* Note
- ↑ / ↓ denote the direction of improvement: higher is better (↑), lower is better (↓).
- Cyankiwi AWQ is a publicly available [INT4 (4-bit AWQ) quantized version of Solar-Open-100B](cyankiwi/Solar-Open-100B-AWQ-4bit)
- Because we used a smaller thinking budget, the results for HRM8K and CLiCK are slightly lower than the numbers reported in the original Solar-Open-100B repository.
- Memory refers to the pure VRAM footprint occupied only by the model weights.
## Inference
### Transformers
Install the required dependencies:
```bash
pip install -U transformers kernels torch accelerate auto-round==0.8.0
```
Run inference with the following code:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "nota-ai/Solar-Open-100B-NotaMoEQuant-Int4"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Prepare input
messages = [{"role": "user", "content": "who are you?"}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Generate response
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.8,
top_p=0.95,
top_k=50,
do_sample=True,
)
generated_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
print(generated_text)
```
### vLLM
Create and activate a Python virtual environment
```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
```
Install Solar Open's optimized vLLM
```bash
VLLM_PRECOMPILED_WHEEL_LOCATION="https://github.com/vllm-project/vllm/releases/download/v0.12.0/vllm-0.12.0-cp38-abi3-manylinux_2_31_x86_64.whl" \
VLLM_USE_PRECOMPILED=1 \
uv pip install git+https://github.com/UpstageAI/vllm.git@v0.12.0-solar-open
```
Start the vLLM server (For 2 GPUs)
```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
vllm serve nota-ai/Solar-Open-100B-NotaMoEQuant-Int4 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser solar_open \
--reasoning-parser solar_open \
--logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
--logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
--tensor-parallel-size 2 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.8
```
|