# Ministral-3B-3B-Reasoning Neuron Model This is a pre-compiled AWS Neuron version of [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512) for inference on AWS Inferentia2/Trainium instances. ## Model Details - **Base Model**: mistralai/Ministral-3B-3B-Reasoning-2512 - **Architecture**: Ministral3 with YARN rope scaling - **Tensor Parallel**: 2 - **Batch Size**: 1 - **Sequence Length**: 4096 - **Dtype**: bfloat16 ## Requirements - AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge) - Python 3.10+ - optimum-neuron - neuronx-distributed - transformers ## Installation ```bash pip install optimum-neuron transformers huggingface_hub ``` ## Usage ### Method 1: Using the helper function ```python from huggingface_hub import hf_hub_download # Download and execute the custom module to register model classes exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read()) # Load model and tokenizer model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron") # Generate text inputs = tokenizer("What is 2+2?", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Method 2: Manual loading ```python from huggingface_hub import hf_hub_download # First, register the custom model classes exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read()) # Then load using optimum-neuron from optimum.neuron import NeuronModelForCausalLM from transformers import AutoTokenizer model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron") tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron") # Generate inputs = tokenizer("Hello, how are you?", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Important Notes 1. **Custom Code Required**: This model requires executing the `ministral3_neuron.py` file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry. 2. **Hardware Requirements**: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger. 3. **Sequence Length**: The model is compiled for a maximum sequence length of 4096 tokens. ## Model Configuration The model was exported with the following neuron configuration: ```json { "batch_size": 1, "sequence_length": 4096, "tp_degree": 2, "torch_dtype": "bfloat16", "on_device_sampling": true, "fused_qkv": true } ``` ## Files - `model.pt` - Compiled Neuron model with weights - `config.json` - Model configuration - `neuron_config.json` - Neuron compilation configuration - `ministral3_neuron.py` - Custom code for model registration - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` - Tokenizer files - `chat_template.jinja` - Chat template ## License Please refer to the original model's license at [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512). ## Acknowledgments This model was compiled using [optimum-neuron](https://github.com/huggingface/optimum-neuron) for AWS Neuron devices.