Mistral-Merged (Compressed QLoRA Model)
Overview
This model is a compressed and fine-tuned version of:
mistralai/Voxtral-Mini-4B-Realtime-2602
The objective of this project is to reduce inference cost, memory usage, and energy consumption while maintaining acceptable output quality.
The model is optimized for:
- Efficient inference
- Low GPU memory usage
- vLLM deployment
- Energy-aware benchmarking
Compression Techniques Used
The following compression and optimization techniques were applied:
1. QLoRA (Quantized Low-Rank Adaptation)
- Parameter-efficient fine-tuning
- Only a small subset of trainable parameters updated
- Reduces training memory significantly
2. 8-bit Quantization
Implemented using:
BitsAndBytesConfig(load_in_8bit=True)
Benefits:
- Lower VRAM usage
- Faster loading
- Reduced energy consumption
3. LoRA Adapters
LoRA adapters were trained and merged into the base model.
Configuration:
- Rank (
r): 32 - Alpha: 32
- Dropout: 0.05
Target modules:
- q_proj
- k_proj
- v_proj
- o_proj
Training Details
Dataset
Training dataset:
golden_set_global
Task:
- Exact text copying / continuation
- Multilingual sequence reproduction
Epochs
- 5 epochs
Optimizer
- AdamW
Learning rate:
- 2e-4
Inference Configuration
The model is intended to run using:
- vLLM
vllm serve --config vllm_config.yaml
Evaluation
Evaluation metrics used:
- Semantic Similarity Accuracy
- Word Error Rate (WER)
- Energy Consumption (CodeCarbon)
The model was benchmarked on multilingual text reproduction tasks.
- Downloads last month
- 65
Model tree for madhurithika22/mistral-compressed
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602