Update README.md
Browse files
README.md
CHANGED
|
@@ -253,6 +253,61 @@ print(generated_text)
|
|
| 253 |
|
| 254 |
---
|
| 255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
## Limitations
|
| 257 |
- **Bias**: Outputs may reflect biases present in the original DeepSeek model or training dataset.
|
| 258 |
- **Context Length**: Limited to 1,000 tokens per sequence.
|
|
|
|
| 253 |
|
| 254 |
---
|
| 255 |
|
| 256 |
+
### **System Requirements**
|
| 257 |
+
|
| 258 |
+
|
| 259 |
+
| Precision | **Total VRAM Usage** | **VRAM Per GPU (with 2 GPUs)** | **VRAM Per GPU (with 4 GPUs)** |
|
| 260 |
+
|------------|----------------------|-------------------------------|-------------------------------|
|
| 261 |
+
| **FP32 (Full Precision)** | ~24GB | ~12GB | ~6GB |
|
| 262 |
+
| **FP16 (Half Precision)** | **~14GB** | **~7GB** | **~3.5GB** |
|
| 263 |
+
| **8-bit Quantization** | ~8GB | ~4GB | ~2GB |
|
| 264 |
+
| **4-bit Quantization** | ~4GB | ~2GB | ~1GB |
|
| 265 |
+
|
| 266 |
+
**Important Notes:**
|
| 267 |
+
- **Multi-GPU setups** distribute model memory usage across available GPUs.
|
| 268 |
+
- Using **`device_map="auto"`** in `transformers` automatically balances memory across devices.
|
| 269 |
+
- **Quantized versions (8-bit, 4-bit)** are planned for lower VRAM requirements.
|
| 270 |
+
|
| 271 |
+
---
|
| 272 |
+
|
| 273 |
+
### **Loading the Model in 4-bit and 8-bit Quantization**
|
| 274 |
+
To reduce memory usage, you can load the model using **4-bit or 8-bit quantization** via **bitsandbytes**.
|
| 275 |
+
|
| 276 |
+
#### **Install Required Dependencies**
|
| 277 |
+
```bash
|
| 278 |
+
pip install transformers accelerate bitsandbytes
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
#### **Load Model in 8-bit Quantization**
|
| 282 |
+
```python
|
| 283 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
| 284 |
+
|
| 285 |
+
model_name = "luvGPT/deepseek-uncensored-lore"
|
| 286 |
+
|
| 287 |
+
# Define quantization config for 8-bit loading
|
| 288 |
+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
| 289 |
+
|
| 290 |
+
# Load tokenizer
|
| 291 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 292 |
+
|
| 293 |
+
# Load model in 8-bit mode
|
| 294 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 295 |
+
model_name,
|
| 296 |
+
device_map="auto",
|
| 297 |
+
quantization_config=quantization_config
|
| 298 |
+
)
|
| 299 |
+
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
---
|
| 303 |
+
|
| 304 |
+
### **Future Work**
|
| 305 |
+
- **GGUF Format Support**: We plan to provide a **GGUF-quantized version** of this model, making it compatible with **llama.cpp** and other lightweight inference frameworks.
|
| 306 |
+
- **Fine-tuning & Alignment**: Exploring reinforcement learning and user feedback loops to improve storytelling accuracy and coherence.
|
| 307 |
+
- **Optimized Inference**: Integrating FlashAttention and Triton optimizations for even faster performance.
|
| 308 |
+
|
| 309 |
+
|
| 310 |
+
|
| 311 |
## Limitations
|
| 312 |
- **Bias**: Outputs may reflect biases present in the original DeepSeek model or training dataset.
|
| 313 |
- **Context Length**: Limited to 1,000 tokens per sequence.
|