| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - iko |
| | - gpt2-medium |
| | - conversational |
| | - reddit |
| | - qlora |
| | - ties-merge |
| | pipeline_tag: text-generation |
| | base_model: gpt2-medium |
| | datasets: |
| | - dolma |
| | - fineweb |
| | --- |
| | |
| | # iko-2 (355M) |
| |
|
| | **iko-2** is the second model in the iko series β a GPT-2 Medium (355M parameters) language model that combines: |
| |
|
| | 1. **iko-1 knowledge** (GPT-2 124M fine-tuned on 700K FineWeb documents) via distillation |
| | 2. **Reddit conversational style** from the Dolma v1.6 Reddit corpus |
| |
|
| | ## Training Details |
| |
|
| | ### Architecture |
| | - **Base model:** GPT-2 Medium (355M parameters) |
| | - **Training method:** 4-bit QLoRA with gradient checkpointing |
| | - **LoRA config:** r=32, alpha=64, targets: ['c_attn', 'c_proj', 'c_fc'] |
| | - **Merge strategy:** TIES (TrIm, Elect Sign, and merge) with 80% density |
| | |
| | ### Training Data |
| | - **Reddit Dolma v1.6** (~10000 examples, 85% of training mix) |
| | - **iko-1 distillation corpus** (~1800 synthetic examples, 15% replay) |
| | - **SuRe (Synthetic Replay)** for catastrophic forgetting prevention |
| | |
| | ### Hyperparameters |
| | - Learning rate: 4e-05 with cosine schedule |
| | - Layer-wise LR: embeddings 0.1Γ, bottom 0.3Γ, middle 1.0Γ, top 0.8Γ |
| | - Warmup: 80 steps |
| | - Effective batch size: 16 |
| | - Sequence length: 512 |
| | - Optimizer: 8-bit AdamW |
| | - Training time: 15 minutes on T4 GPU |
| | |
| | ### Knowledge Transfer Pipeline |
| | ``` |
| | GPT-2 (124M) β [FineWeb fine-tune] β iko-1 |
| | β distillation |
| | GPT-2 Medium (355M) β [QLoRA + Reddit + Replay] β [TIES merge] β iko-2 |
| | ``` |
| | |
| | ## Usage |
| | |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained("iko-01/iko-002") |
| | tokenizer = AutoTokenizer.from_pretrained("iko-01/iko-002") |
| | |
| | input_text = "The best thing about learning is" |
| | inputs = tokenizer(input_text, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| | |
| | ## Model Series |
| | | Model | Parameters | Training Data | Method | |
| | |-------|-----------|---------------|--------| |
| | | iko-1 | 124M | FineWeb (700K docs) | QLoRA on GPT-2 | |
| | | **iko-2** | **355M** | **Reddit + iko-1 distillation** | **QLoRA + TIES merge on GPT-2 Medium** | |
| | |
| | ## Limitations |
| | - This model inherits biases present in Reddit data and GPT-2's pretraining corpus |
| | - Not suitable for production use without additional safety fine-tuning |
| | - Generated text may contain informal language reflecting Reddit's conversational style |
| | |
| | ## License |
| | Apache 2.0 |
| | |