|
|
--- |
|
|
base_model: |
|
|
- naver-hyperclovax/HyperCLOVAX-SEED-Think-32B |
|
|
--- |
|
|
|
|
|
Thanks to naver-hyperclovax |
|
|
|
|
|
|
|
|
|
|
|
# HyperCLOVA X SEED 32B Think - 4bit Quantized |
|
|
This is a 4-bit quantized version of [naver-hyperclovax/HyperCLOVAX-SEED-Think-32B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-32B) using bitsandbytes NF4 quantization with double quantization for optimal memory efficiency. |
|
|
|
|
|
## Model Overview |
|
|
HyperCLOVA X SEED 32B Think is an advanced vision-language thinking model that extends the SEED Think 14B line. |
|
|
|
|
|
|
|
|
## Quantization Details |
|
|
Quantization Method: bitsandbytes NF4 (NormalFloat 4-bit) |
|
|
Compute dtype: bfloat16 |
|
|
Storage dtype: uint8 |
|
|
Double Quantization: Enabled |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Installation |
|
|
|
|
|
### Requirements |
|
|
|
|
|
```bash |
|
|
pip install torch transformers bitsandbytes accelerate |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_id = "jjjssjs/HyperCLOVAX-SEED-Think-32B-4bit" |
|
|
|
|
|
# Load tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
model_id, |
|
|
trust_remote_code=True, |
|
|
fix_mistral_reges=True |
|
|
) |
|
|
|
|
|
# Load quantized model (quantization config is in config.json) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="auto", |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16, |
|
|
) |
|
|
|
|
|
# Generate |
|
|
inputs = tokenizer("양자역학이 뭐야?", return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=100, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Usage Examples |
|
|
|
|
|
### Basic Text Generation |
|
|
|
|
|
```python |
|
|
prompt = "Explain quantum computing in simple terms." |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=200, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Image Understanding |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
|
|
|
# Load image |
|
|
image = Image.open("example.jpg") |
|
|
|
|
|
# Prepare inputs |
|
|
text = "Describe this image in detail." |
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate response |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=150, |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### Multi-turn Conversation |
|
|
|
|
|
```python |
|
|
conversation = [ |
|
|
{"role": "user", "content": "What is machine learning?"}, |
|
|
{"role": "assistant", "content": "Machine learning is..."}, |
|
|
{"role": "user", "content": "Can you give me an example?"} |
|
|
] |
|
|
|
|
|
# Process conversation |
|
|
inputs = tokenizer.apply_chat_template( |
|
|
conversation, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
outputs = model.generate(inputs, max_new_tokens=200) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
|
|
|
**Features:** |
|
|
- Reasoning mode with `<think>...</think>` output |
|
|
- Multi-turn conversation support |
|
|
- Image/Video understanding |
|
|
- Korean-centric reasoning |
|
|
- Long-context understanding (128K tokens) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Performance Considerations |
|
|
|
|
|
### Advantages of 4-bit Quantization |
|
|
|
|
|
- **Memory Efficient**: Fits on consumer GPUs |
|
|
- **Fast Loading**: ~8 seconds vs minutes for full precision |
|
|
- **Cost Effective**: No need for expensive A100 80GB GPUs |
|
|
- **Practical Deployment**: Suitable for edge devices and personal use |
|
|
|
|
|
### Trade-offs |
|
|
|
|
|
- **Slight Quality Loss**: Minor degradation in output quality compared to full precision |
|
|
- **Inference Speed**: ~4.5 tokens/sec (may vary by hardware) |
|
|
- **Precision**: 4-bit weights vs 16-bit (original) |
|
|
|
|
|
## Known Issues |
|
|
|
|
|
- Tokenizer warning about regex pattern (can be ignored or fixed with `fix_mistral_regex=True`) |
|
|
- Some vision packages may show import warnings (does not affect text-only inference) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Benchmark Results |
|
|
**Note**: Quantized model benchmarks pending. Performance may differ slightly from the original model. |
|
|
For original model benchmarks, see: [HyperCLOVAX-SEED-Think-32B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-32B) |
|
|
|
|
|
## License |
|
|
This model is licensed under the **HyperCLOVA X SEED 32B Think Model License Agreement**. |
|
|
|
|
|
|