kunhunjon
/

ministral3b-neuron

Model card Files Files and versions

ministral3b-neuron / README.md

kunhunjon's picture

Upload folder using huggingface_hub

7e2974f verified about 2 months ago

|

history blame contribute delete

3.36 kB

	# Ministral-3B-3B-Reasoning Neuron Model

	This is a pre-compiled AWS Neuron version of [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512) for inference on AWS Inferentia2/Trainium instances.

	## Model Details

	- Base Model: mistralai/Ministral-3B-3B-Reasoning-2512
	- Architecture: Ministral3 with YARN rope scaling
	- Tensor Parallel: 2
	- Batch Size: 1
	- Sequence Length: 4096
	- Dtype: bfloat16

	## Requirements

	- AWS Inferentia2 or Trainium instance (e.g., inf2.xlarge, inf2.8xlarge, trn1.2xlarge)
	- Python 3.10+
	- optimum-neuron
	- neuronx-distributed
	- transformers

	## Installation

	```bash
	pip install optimum-neuron transformers huggingface_hub
	```

	## Usage

	### Method 1: Using the helper function

	```python
	from huggingface_hub import hf_hub_download

	# Download and execute the custom module to register model classes
	exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

	# Load model and tokenizer
	model, tokenizer = load_ministral3("YOUR_USERNAME/ministral3-neuron")

	# Generate text
	inputs = tokenizer("What is 2+2?", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Method 2: Manual loading

	```python
	from huggingface_hub import hf_hub_download

	# First, register the custom model classes
	exec(open(hf_hub_download("YOUR_USERNAME/ministral3-neuron", "ministral3_neuron.py")).read())

	# Then load using optimum-neuron
	from optimum.neuron import NeuronModelForCausalLM
	from transformers import AutoTokenizer

	model = NeuronModelForCausalLM.from_pretrained("YOUR_USERNAME/ministral3-neuron")
	tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/ministral3-neuron")

	# Generate
	inputs = tokenizer("Hello, how are you?", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Important Notes

	1. Custom Code Required: This model requires executing the `ministral3_neuron.py` file before loading. This file registers the Ministral3 model architecture in optimum-neuron's model registry.

	2. Hardware Requirements: This model is compiled for tensor parallelism of 2, requiring at least 2 Neuron cores. Use inf2.xlarge or larger.

	3. Sequence Length: The model is compiled for a maximum sequence length of 4096 tokens.

	## Model Configuration

	The model was exported with the following neuron configuration:

	```json
	{
	"batch_size": 1,
	"sequence_length": 4096,
	"tp_degree": 2,
	"torch_dtype": "bfloat16",
	"on_device_sampling": true,
	"fused_qkv": true
	}
	```

	## Files

	- `model.pt` - Compiled Neuron model with weights
	- `config.json` - Model configuration
	- `neuron_config.json` - Neuron compilation configuration
	- `ministral3_neuron.py` - Custom code for model registration
	- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` - Tokenizer files
	- `chat_template.jinja` - Chat template

	## License

	Please refer to the original model's license at [mistralai/Ministral-3B-3B-Reasoning-2512](https://huggingface.co/mistralai/Ministral-3B-3B-Reasoning-2512).

	## Acknowledgments

	This model was compiled using [optimum-neuron](https://github.com/huggingface/optimum-neuron) for AWS Neuron devices.