Ministral-3B-3B-Reasoning Neuron Model

This is a pre-compiled AWS Neuron version of mistralai/Ministral-3B-3B-Reasoning-2512 for inference on AWS Inferentia2/Trainium instances.

Model Details

Base Model: mistralai/Ministral-3B-3B-Reasoning-2512
Architecture: Ministral3 with YARN rope scaling
Tensor Parallel: 2
Batch Size: 1
Sequence Length: 4096
Dtype: bfloat16

Requirements

AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge)
Python 3.10+
optimum-neuron
neuronx-distributed
transformers

Installation

pip install optimum-neuron transformers huggingface_hub

Usage

Method 1: Using the helper function

from huggingface_hub import hf_hub_download

# Download and execute the custom module to register model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

# Load model and tokenizer
model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron")

# Generate text
inputs = tokenizer("What is 2+2?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Method 2: Manual loading

from huggingface_hub import hf_hub_download

# First, register the custom model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

# Then load using optimum-neuron
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron")

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Important Notes

Custom Code Required: This model requires executing the ministral3_neuron.py file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry.
Hardware Requirements: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger.
Sequence Length: The model is compiled for a maximum sequence length of 4096 tokens.

Model Configuration

The model was exported with the following neuron configuration:

{
  "batch_size": 1,
  "sequence_length": 4096,
  "tp_degree": 2,
  "torch_dtype": "bfloat16",
  "on_device_sampling": true,
  "fused_qkv": true
}

Files

model.pt - Compiled Neuron model with weights
config.json - Model configuration
neuron_config.json - Neuron compilation configuration
ministral3_neuron.py - Custom code for model registration
tokenizer.json, tokenizer_config.json, special_tokens_map.json - Tokenizer files
chat_template.jinja - Chat template

License

Please refer to the original model's license at mistralai/Ministral-3B-3B-Reasoning-2512.

Acknowledgments

This model was compiled using optimum-neuron for AWS Neuron devices.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support