File size: 1,868 Bytes

f96f72f

# Qwen3 8M Model with Falcon-H1-0.5B-Instruct Tokenizer

## Model Description
This is an 8M parameter Qwen3 model architecture combined with the Falcon-H1-0.5B-Instruct tokenizer (32K vocabulary).

- **Architecture**: Qwen3 (Grouped Query Attention, RMS Normalization, Q/K Normalization, RoPE)
- **Tokenizer**: Falcon-H1-0.5B-Instruct (32K vocab)
- **Parameters**: 2,183,552
- **Precision**: BF16
- **Format**: SafeTensors
- **Vocabulary Size**: 32768

## Configuration
- vocab_size: 32768

- hidden_size: 64
- num_attention_heads: 4
- num_key_value_heads: 2

- num_hidden_layers: 2

- intermediate_size: 160
- head_dim: 16

- max_position_embeddings: 4096



## Special Tokens

- BOS: <|begin_of_text|> (id: 17)

- EOS: <|end_of_text|> (id: 11)

- PAD: <|pad|> (id: 0)



## Usage

```python

from transformers import Qwen3ForCausalLM, AutoTokenizer



model = Qwen3ForCausalLM.from_pretrained("./workspace/qwen3-8m-falcon-tokenizer")
tokenizer = AutoTokenizer.from_pretrained("./workspace/qwen3-8m-falcon-tokenizer")



# Generate text

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))



# Batch processing (start small)

texts = ["Hello", "How are you", "Good morning"]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():

    outputs = model.generate(**inputs, max_new_tokens=20)
```



## Important Notes

- Model uses Qwen3 architecture with Falcon tokenizer (32K vocabulary)

- All token IDs must be < 32768 to avoid CUDA errors

- Start with small batch sizes (1-4) and gradually increase

- Use proper padding to prevent dimension mismatches

- Model initialized with random weights - requires fine-tuning

- Compatible with Qwen3 APIs but uses Falcon vocabulary