ministral3b-neuron / README.md
kunhunjon's picture
Upload folder using huggingface_hub
7e2974f verified
# Ministral-3B-3B-Reasoning Neuron Model
This is a pre-compiled AWS Neuron version of [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512) for inference on AWS Inferentia2/Trainium instances.
## Model Details
- **Base Model**: mistralai/Ministral-3B-3B-Reasoning-2512
- **Architecture**: Ministral3 with YARN rope scaling
- **Tensor Parallel**: 2
- **Batch Size**: 1
- **Sequence Length**: 4096
- **Dtype**: bfloat16
## Requirements
- AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge)
- Python 3.10+
- optimum-neuron
- neuronx-distributed
- transformers
## Installation
```bash
pip install optimum-neuron transformers huggingface_hub
```
## Usage
### Method 1: Using the helper function
```python
from huggingface_hub import hf_hub_download
# Download and execute the custom module to register model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())
# Load model and tokenizer
model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron")
# Generate text
inputs = tokenizer("What is 2+2?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Method 2: Manual loading
```python
from huggingface_hub import hf_hub_download
# First, register the custom model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())
# Then load using optimum-neuron
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer
model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron")
# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Important Notes
1. **Custom Code Required**: This model requires executing the `ministral3_neuron.py` file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry.
2. **Hardware Requirements**: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger.
3. **Sequence Length**: The model is compiled for a maximum sequence length of 4096 tokens.
## Model Configuration
The model was exported with the following neuron configuration:
```json
{
"batch_size": 1,
"sequence_length": 4096,
"tp_degree": 2,
"torch_dtype": "bfloat16",
"on_device_sampling": true,
"fused_qkv": true
}
```
## Files
- `model.pt` - Compiled Neuron model with weights
- `config.json` - Model configuration
- `neuron_config.json` - Neuron compilation configuration
- `ministral3_neuron.py` - Custom code for model registration
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` - Tokenizer files
- `chat_template.jinja` - Chat template
## License
Please refer to the original model's license at [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512).
## Acknowledgments
This model was compiled using [optimum-neuron](https://github.com/huggingface/optimum-neuron) for AWS Neuron devices.