iko-002 / README.md

Upload iko-2: GPT-2 Medium with Reddit style + iko-1 knowledge (TIES merge)

264f293 verified 10 days ago

2.51 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- iko
	- gpt2-medium
	- conversational
	- reddit
	- qlora
	- ties-merge
	pipeline_tag: text-generation
	base_model: gpt2-medium
	datasets:
	- dolma
	- fineweb
	---

	# iko-2 (355M)

	iko-2 is the second model in the iko series — a GPT-2 Medium (355M parameters) language model that combines:

	1. iko-1 knowledge (GPT-2 124M fine-tuned on 700K FineWeb documents) via distillation
	2. Reddit conversational style from the Dolma v1.6 Reddit corpus

	## Training Details

	### Architecture
	- Base model: GPT-2 Medium (355M parameters)
	- Training method: 4-bit QLoRA with gradient checkpointing
	- LoRA config: r=32, alpha=64, targets: ['c_attn', 'c_proj', 'c_fc']
	- Merge strategy: TIES (TrIm, Elect Sign, and merge) with 80% density

	### Training Data
	- Reddit Dolma v1.6 (~10000 examples, 85% of training mix)
	- iko-1 distillation corpus (~1800 synthetic examples, 15% replay)
	- SuRe (Synthetic Replay) for catastrophic forgetting prevention

	### Hyperparameters
	- Learning rate: 4e-05 with cosine schedule
	- Layer-wise LR: embeddings 0.1×, bottom 0.3×, middle 1.0×, top 0.8×
	- Warmup: 80 steps
	- Effective batch size: 16
	- Sequence length: 512
	- Optimizer: 8-bit AdamW
	- Training time: 15 minutes on T4 GPU

	### Knowledge Transfer Pipeline
	```
	GPT-2 (124M) → [FineWeb fine-tune] → iko-1
	↓ distillation
	GPT-2 Medium (355M) → [QLoRA + Reddit + Replay] → [TIES merge] → iko-2
	```

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("iko-01/iko-002")
	tokenizer = AutoTokenizer.from_pretrained("iko-01/iko-002")

	input_text = "The best thing about learning is"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Model Series
	\| Model \| Parameters \| Training Data \| Method \|
	\|-------\|-----------\|---------------\|--------\|
	\| iko-1 \| 124M \| FineWeb (700K docs) \| QLoRA on GPT-2 \|
	\| iko-2 \| 355M \| Reddit + iko-1 distillation \| QLoRA + TIES merge on GPT-2 Medium \|

	## Limitations
	- This model inherits biases present in Reddit data and GPT-2's pretraining corpus
	- Not suitable for production use without additional safety fine-tuning
	- Generated text may contain informal language reflecting Reddit's conversational style

	## License
	Apache 2.0