| # Qwen3 Model with Falcon Tokenizer - Usage Guide | |
| ## Model Details | |
| - **Architecture**: Qwen3 (Grouped Query Attention, RMS Norm, Q/K Norm, RoPE) | |
| - **Tokenizer**: Falcon-H1-0.5B-Instruct (32K vocabulary) | |
| - **Special Tokens**: | |
| - BOS: <|begin_of_text|> (id: 17) | |
| - EOS: <|end_of_text|> (id: 11) | |
| - PAD: <|pad|> (id: 0) | |
| ## Important Notes | |
| 1. This model combines Qwen3 architecture with Falcon tokenizer | |
| 2. The vocabulary size is 32K (Falcon standard) | |
| 3. Model uses Qwen3-specific features like q_norm/k_norm layers | |
| 4. All token IDs should be within 0-32767 range | |
| ## Batch Processing Tips | |
| - Use conservative batch sizes (start with 1-4) | |
| - Ensure all sequences are properly padded | |
| - Monitor CUDA memory usage | |
| - Use torch.no_grad() for inference | |
| ## Common Issues | |
| - If CUDA errors occur, check token IDs are < 32768 | |
| - Ensure proper padding with <|pad|> token | |
| - Use consistent tokenization for batches | |