YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Ministral-3B-3B-Reasoning Neuron Model

This is a pre-compiled AWS Neuron version of mistralai/Ministral-3B-3B-Reasoning-2512 for inference on AWS Inferentia2/Trainium instances.

Model Details

  • Base Model: mistralai/Ministral-3B-3B-Reasoning-2512
  • Architecture: Ministral3 with YARN rope scaling
  • Tensor Parallel: 2
  • Batch Size: 1
  • Sequence Length: 4096
  • Dtype: bfloat16

Requirements

  • AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge)
  • Python 3.10+
  • optimum-neuron
  • neuronx-distributed
  • transformers

Installation

pip install optimum-neuron transformers huggingface_hub

Usage

Method 1: Using the helper function

from huggingface_hub import hf_hub_download

# Download and execute the custom module to register model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

# Load model and tokenizer
model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron")

# Generate text
inputs = tokenizer("What is 2+2?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Method 2: Manual loading

from huggingface_hub import hf_hub_download

# First, register the custom model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

# Then load using optimum-neuron
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron")

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Important Notes

  1. Custom Code Required: This model requires executing the ministral3_neuron.py file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry.

  2. Hardware Requirements: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger.

  3. Sequence Length: The model is compiled for a maximum sequence length of 4096 tokens.

Model Configuration

The model was exported with the following neuron configuration:

{
  "batch_size": 1,
  "sequence_length": 4096,
  "tp_degree": 2,
  "torch_dtype": "bfloat16",
  "on_device_sampling": true,
  "fused_qkv": true
}

Files

  • model.pt - Compiled Neuron model with weights
  • config.json - Model configuration
  • neuron_config.json - Neuron compilation configuration
  • ministral3_neuron.py - Custom code for model registration
  • tokenizer.json, tokenizer_config.json, special_tokens_map.json - Tokenizer files
  • chat_template.jinja - Chat template

License

Please refer to the original model's license at mistralai/Ministral-3B-3B-Reasoning-2512.

Acknowledgments

This model was compiled using optimum-neuron for AWS Neuron devices.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support