Ministral-3B-3B-Reasoning Neuron Model
This is a pre-compiled AWS Neuron version of mistralai/Ministral-3B-3B-Reasoning-2512 for inference on AWS Inferentia2/Trainium instances.
Model Details
- Base Model: mistralai/Ministral-3B-3B-Reasoning-2512
- Architecture: Ministral3 with YARN rope scaling
- Tensor Parallel: 2
- Batch Size: 1
- Sequence Length: 4096
- Dtype: bfloat16
Requirements
- AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge)
- Python 3.10+
- optimum-neuron
- neuronx-distributed
- transformers
Installation
pip install optimum-neuron transformers huggingface_hub
Usage
Method 1: Using the helper function
from huggingface_hub import hf_hub_download
# Download and execute the custom module to register model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())
# Load model and tokenizer
model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron")
# Generate text
inputs = tokenizer("What is 2+2?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Method 2: Manual loading
from huggingface_hub import hf_hub_download
# First, register the custom model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())
# Then load using optimum-neuron
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer
model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron")
# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Important Notes
Custom Code Required: This model requires executing the
ministral3_neuron.pyfile before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry.Hardware Requirements: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger.
Sequence Length: The model is compiled for a maximum sequence length of 4096 tokens.
Model Configuration
The model was exported with the following neuron configuration:
{
"batch_size": 1,
"sequence_length": 4096,
"tp_degree": 2,
"torch_dtype": "bfloat16",
"on_device_sampling": true,
"fused_qkv": true
}
Files
model.pt- Compiled Neuron model with weightsconfig.json- Model configurationneuron_config.json- Neuron compilation configurationministral3_neuron.py- Custom code for model registrationtokenizer.json,tokenizer_config.json,special_tokens_map.json- Tokenizer fileschat_template.jinja- Chat template
License
Please refer to the original model's license at mistralai/Ministral-3B-3B-Reasoning-2512.
Acknowledgments
This model was compiled using optimum-neuron for AWS Neuron devices.
- Downloads last month
- 1