# Ministral-3B-3B-Reasoning Neuron Model

This is a pre-compiled AWS Neuron version of [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512) for inference on AWS Inferentia2/Trainium instances.

## Model Details

- **Base Model**: mistralai/Ministral-3B-3B-Reasoning-2512
- **Architecture**: Ministral3 with YARN rope scaling
- **Tensor Parallel**: 2
- **Batch Size**: 1
- **Sequence Length**: 4096
- **Dtype**: bfloat16

## Requirements

- AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge)
- Python 3.10+
- optimum-neuron
- neuronx-distributed
- transformers

## Installation

```bash
pip install optimum-neuron transformers huggingface_hub
```

## Usage

### Method 1: Using the helper function

```python
from huggingface_hub import hf_hub_download

# Download and execute the custom module to register model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

# Load model and tokenizer
model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron")

# Generate text
inputs = tokenizer("What is 2+2?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Method 2: Manual loading

```python
from huggingface_hub import hf_hub_download

# First, register the custom model classes
exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

# Then load using optimum-neuron
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron")

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Important Notes

1. **Custom Code Required**: This model requires executing the `ministral3_neuron.py` file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry.

2. **Hardware Requirements**: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger.

3. **Sequence Length**: The model is compiled for a maximum sequence length of 4096 tokens.

## Model Configuration

The model was exported with the following neuron configuration:

```json
{
  "batch_size": 1,
  "sequence_length": 4096,
  "tp_degree": 2,
  "torch_dtype": "bfloat16",
  "on_device_sampling": true,
  "fused_qkv": true
}
```

## Files

- `model.pt` - Compiled Neuron model with weights
- `config.json` - Model configuration
- `neuron_config.json` - Neuron compilation configuration
- `ministral3_neuron.py` - Custom code for model registration
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` - Tokenizer files
- `chat_template.jinja` - Chat template

## License

Please refer to the original model's license at [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512).

## Acknowledgments

This model was compiled using [optimum-neuron](https://github.com/huggingface/optimum-neuron) for AWS Neuron devices.