Upload folder using huggingface_hub

d59149b verified 4 months ago

5.24 kB

	# Phi-3.5 Mini Instruct - Quantized ONNX Model (Consolidated)

	## 🚀 Model Overview
	This is Microsoft's Phi-3.5-mini-instruct model, quantized to INT8 and optimized for Qualcomm Snapdragon NPU deployment. This version consolidates all files into a single directory for easier deployment.

	## 📊 Model Specifications
	- Base Model: microsoft/Phi-3.5-mini-instruct
	- Size: 7292.4 MB (quantized from 7.3GB original)
	- Compression: 50% size reduction
	- Format: ONNX INT8 quantized with external data
	- Files: 203 files total
	- Target: Qualcomm Snapdragon NPUs

	## 🔧 Quick Start

	### Installation
	```bash
	pip install onnxruntime transformers numpy
	```

	### Basic Usage
	```python
	import onnxruntime as ort
	from transformers import AutoTokenizer
	import numpy as np

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)

	# Load ONNX model
	session = ort.InferenceSession("model.onnx")

	# Prepare input
	text = "Hello, what is artificial intelligence?"
	inputs = tokenizer(text, return_tensors="np", max_length=64, truncation=True, padding="max_length")

	# Run inference
	outputs = session.run(None, {"input_ids": inputs["input_ids"]})
	logits = outputs[0]

	print(f"Input: {text}")
	print(f"Output shape: {logits.shape}")
	```

	### Text Generation Example
	```python
	def generate_response(prompt, max_new_tokens=50):
	# Tokenize
	inputs = tokenizer(prompt, return_tensors="np", max_length=64, truncation=True)
	input_ids = inputs["input_ids"]

	generated_tokens = []

	for _ in range(max_new_tokens):
	# Get model prediction
	outputs = session.run(None, {"input_ids": input_ids})
	logits = outputs[0]

	# Get next token (greedy)
	next_token_id = np.argmax(logits[0, -1, :])
	generated_tokens.append(next_token_id)

	# Stop on EOS
	if next_token_id == tokenizer.eos_token_id:
	break

	# Add to input for next iteration
	input_ids = np.concatenate([input_ids, [[next_token_id]]], axis=1)

	# Decode response
	response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
	return response

	# Example
	response = generate_response("What is machine learning?")
	print(f"Response: {response}")
	```

	## 🧪 Testing Script
	```python
	#!/usr/bin/env python3
	import onnxruntime as ort
	from transformers import AutoTokenizer
	import numpy as np

	def test_model():
	print("🔄 Loading model...")
	tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
	session = ort.InferenceSession("model.onnx")

	test_cases = [
	"Hello, how are you?",
	"What is the capital of France?",
	"Explain artificial intelligence in simple terms."
	]

	for i, text in enumerate(test_cases, 1):
	print(f"\n{i}. Input: {text}")

	inputs = tokenizer(text, return_tensors="np", max_length=64,
	truncation=True, padding="max_length")
	outputs = session.run(None, {"input_ids": inputs["input_ids"]})

	print(f" ✅ Output shape: {outputs[0].shape}")

	print("\n🎉 All tests passed!")

	if __name__ == "__main__":
	test_model()
	```

	## ⚡ Performance Expectations
	- Inference Speed: 2-3x faster than CPU on Snapdragon NPUs
	- Memory Usage: ~4GB RAM required
	- Tokens/Second: 8-15 on Snapdragon 8cx Gen 2
	- Latency: <100ms for short sequences

	## 📁 File Structure
	```
	model.onnx # Main ONNX model file
	tokenizer.json # Tokenizer vocabulary
	tokenizer_config.json # Tokenizer configuration
	config.json # Model configuration
	onnx__MatMul_* # External weight data files (129 files)
	*.weight # Additional model weights
	```

	## ⚠️ Important Notes

	1. All Files Required: Keep all files in the same directory. The model.onnx file references external data files.

	2. Memory Requirements: Ensure you have at least 4GB of available RAM.

	3. Qualcomm NPU Setup: For optimal performance on Qualcomm hardware:
	```python
	# Use QNN execution provider (when available)
	providers = ['QNNExecutionProvider', 'CPUExecutionProvider']
	session = ort.InferenceSession("model.onnx", providers=providers)
	```

	## 🚀 Deployment on Qualcomm Devices

	### Windows on ARM
	1. Copy all files to your device
	2. Install ONNX Runtime: `pip install onnxruntime`
	3. Run the test script to verify

	### Android (with QNN SDK)
	1. Use ONNX Runtime Mobile with QNN support
	2. Package all files in your app bundle
	3. Initialize with QNN execution provider

	## 🐛 Troubleshooting

	Model fails to load:
	- Ensure all files are in the same directory
	- Check that you have sufficient RAM (4GB+)

	Slow inference:
	- Try enabling graph optimizations:
	```python
	sess_options = ort.SessionOptions()
	sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
	session = ort.InferenceSession("model.onnx", sess_options)
	```

	Out of memory:
	- Reduce sequence length: `max_length=32`
	- Process smaller batches

	## 📄 License
	This model inherits the license from microsoft/Phi-3.5-mini-instruct.

	---
	Quantized and optimized for Qualcomm Snapdragon NPU deployment