| # Ministral-3B-3B-Reasoning Neuron Model | |
| This is a pre-compiled AWS Neuron version of [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512) for inference on AWS Inferentia2/Trainium instances. | |
| ## Model Details | |
| - **Base Model**: mistralai/Ministral-3B-3B-Reasoning-2512 | |
| - **Architecture**: Ministral3 with YARN rope scaling | |
| - **Tensor Parallel**: 2 | |
| - **Batch Size**: 1 | |
| - **Sequence Length**: 4096 | |
| - **Dtype**: bfloat16 | |
| ## Requirements | |
| - AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge) | |
| - Python 3.10+ | |
| - optimum-neuron | |
| - neuronx-distributed | |
| - transformers | |
| ## Installation | |
| ```bash | |
| pip install optimum-neuron transformers huggingface_hub | |
| ``` | |
| ## Usage | |
| ### Method 1: Using the helper function | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| # Download and execute the custom module to register model classes | |
| exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read()) | |
| # Load model and tokenizer | |
| model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron") | |
| # Generate text | |
| inputs = tokenizer("What is 2+2?", return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=50) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### Method 2: Manual loading | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| # First, register the custom model classes | |
| exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read()) | |
| # Then load using optimum-neuron | |
| from optimum.neuron import NeuronModelForCausalLM | |
| from transformers import AutoTokenizer | |
| model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron") | |
| tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron") | |
| # Generate | |
| inputs = tokenizer("Hello, how are you?", return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=100) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Important Notes | |
| 1. **Custom Code Required**: This model requires executing the `ministral3_neuron.py` file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry. | |
| 2. **Hardware Requirements**: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger. | |
| 3. **Sequence Length**: The model is compiled for a maximum sequence length of 4096 tokens. | |
| ## Model Configuration | |
| The model was exported with the following neuron configuration: | |
| ```json | |
| { | |
| "batch_size": 1, | |
| "sequence_length": 4096, | |
| "tp_degree": 2, | |
| "torch_dtype": "bfloat16", | |
| "on_device_sampling": true, | |
| "fused_qkv": true | |
| } | |
| ``` | |
| ## Files | |
| - `model.pt` - Compiled Neuron model with weights | |
| - `config.json` - Model configuration | |
| - `neuron_config.json` - Neuron compilation configuration | |
| - `ministral3_neuron.py` - Custom code for model registration | |
| - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` - Tokenizer files | |
| - `chat_template.jinja` - Chat template | |
| ## License | |
| Please refer to the original model's license at [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512). | |
| ## Acknowledgments | |
| This model was compiled using [optimum-neuron](https://github.com/huggingface/optimum-neuron) for AWS Neuron devices. | |