--- base_model: - naver-hyperclovax/HyperCLOVAX-SEED-Think-32B --- Thanks to naver-hyperclovax # HyperCLOVA X SEED 32B Think - 4bit Quantized This is a 4-bit quantized version of [naver-hyperclovax/HyperCLOVAX-SEED-Think-32B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-32B) using bitsandbytes NF4 quantization with double quantization for optimal memory efficiency. ## Model Overview HyperCLOVA X SEED 32B Think is an advanced vision-language thinking model that extends the SEED Think 14B line. ## Quantization Details Quantization Method: bitsandbytes NF4 (NormalFloat 4-bit) Compute dtype: bfloat16 Storage dtype: uint8 Double Quantization: Enabled ## Installation ### Requirements ```bash pip install torch transformers bitsandbytes accelerate ``` ### Quick Start ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "jjjssjs/HyperCLOVAX-SEED-Think-32B-4bit" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True, fix_mistral_reges=True ) # Load quantized model (quantization config is in config.json) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16, ) # Generate inputs = tokenizer("양자역학이 뭐야?", return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Usage Examples ### Basic Text Generation ```python prompt = "Explain quantum computing in simple terms." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=200, temperature=0.7, top_p=0.9, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Image Understanding ```python from PIL import Image # Load image image = Image.open("example.jpg") # Prepare inputs text = "Describe this image in detail." inputs = tokenizer(text, return_tensors="pt").to(model.device) # Generate response outputs = model.generate( **inputs, max_new_tokens=150, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Multi-turn Conversation ```python conversation = [ {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is..."}, {"role": "user", "content": "Can you give me an example?"} ] # Process conversation inputs = tokenizer.apply_chat_template( conversation, return_tensors="pt" ).to(model.device) outputs = model.generate(inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` **Features:** - Reasoning mode with `...` output - Multi-turn conversation support - Image/Video understanding - Korean-centric reasoning - Long-context understanding (128K tokens) ## Performance Considerations ### Advantages of 4-bit Quantization - **Memory Efficient**: Fits on consumer GPUs - **Fast Loading**: ~8 seconds vs minutes for full precision - **Cost Effective**: No need for expensive A100 80GB GPUs - **Practical Deployment**: Suitable for edge devices and personal use ### Trade-offs - **Slight Quality Loss**: Minor degradation in output quality compared to full precision - **Inference Speed**: ~4.5 tokens/sec (may vary by hardware) - **Precision**: 4-bit weights vs 16-bit (original) ## Known Issues - Tokenizer warning about regex pattern (can be ignored or fixed with `fix_mistral_regex=True`) - Some vision packages may show import warnings (does not affect text-only inference) ## Benchmark Results **Note**: Quantized model benchmarks pending. Performance may differ slightly from the original model. For original model benchmarks, see: [HyperCLOVAX-SEED-Think-32B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-32B) ## License This model is licensed under the **HyperCLOVA X SEED 32B Think Model License Agreement**.