vuminhtue
/

qwen3_sentiment_tinystories

Text Generation

Model card Files Files and versions

qwen3_sentiment_tinystories / README.md

vuminhtue's picture

Upload README.md with huggingface_hub

79ae26f verified 6 months ago

|

history blame contribute delete

2.76 kB

	---
	language: en
	license: mit
	tags:
	- pytorch
	- text-generation
	- qwen3
	- tinystories
	---

	# Qwen3-0.6B Pre-trained on TinyStories

	This is a Qwen3-0.6B model pre-trained on the TinyStories dataset for 200k iterations.

	## Model Details

	- Architecture: Qwen3-0.6B
	- Training Data: TinyStories dataset from HuggingFace
	- Training Iterations: 200,000
	- Parameters: ~596M unique parameters
	- Tokenizer: GPT-2 tokenizer (tiktoken)
	- Training Loss: Available in training history

	## Quick Start

	### Download the Model

	```python
	from huggingface_hub import hf_hub_download
	import torch

	# Download model weights
	model_path = hf_hub_download(
	repo_id="vuminhtue/qwen3-200k-tinystories",
	filename="Qwen3_200k_model_params.pt"
	)

	# Download config
	config_path = hf_hub_download(
	repo_id="vuminhtue/qwen3-200k-tinystories",
	filename="config.json"
	)
	```

	### Load and Use

	```python
	import torch
	import tiktoken
	from Qwen3_model import Qwen3Model # You need this file from the original code

	# Set up configuration
	QWEN3_CONFIG = {
	"vocab_size": 151936,
	"context_length": 40960,
	"emb_dim": 1024,
	"n_heads": 16,
	"n_layers": 28,
	"hidden_dim": 3072,
	"head_dim": 128,
	"qk_norm": True,
	"n_kv_groups": 8,
	"rope_base": 1000000.0,
	"dtype": torch.bfloat16,
	}

	# Load model
	model = Qwen3Model(QWEN3_CONFIG)
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.load_state_dict(torch.load(model_path, map_location=device))
	model = model.to(device)
	model.eval()

	# Generate text
	tokenizer = tiktoken.get_encoding("gpt2")
	# Your generation code here...
	```

	## Training Details

	- Optimizer: AdamW with weight decay (0.1)
	- Learning Rate: 1e-4 with warmup and cosine decay
	- Batch Size: 32 with gradient accumulation (32 steps)
	- Context Length: 128 tokens
	- Mixed Precision: bfloat16 training

	## Model Architecture

	- Grouped Query Attention (GQA) with 8 KV groups
	- RoPE (Rotary Position Embeddings)
	- RMSNorm for normalization
	- SiLU activation function
	- 28 transformer layers

	## Performance

	The model was trained on TinyStories, a dataset of simple stories for children. It can generate coherent short stories in a similar style.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{qwen3-tinystories-2025,
	author = {Tue Vu},
	title = {Qwen3-0.6B Pre-trained on TinyStories},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/vuminhtue/qwen3-200k-tinystories}},
	}
	```

	## License

	MIT License

	## Contact

	For questions or issues, please open an issue on the HuggingFace model page.