🌟 Twinkel LLM - 72M Parameters (v0.1-alpha)

Twinkel LLM is an experimental 72M parameter language model created by Kunal Pandey as a learning project.

⚠️ Status: Early experimental release (v0.1-alpha)

🚀 Quick Start (CPU Inference)

⚠️ Important: This model currently works best on CPU. GPU inference has known issues that are being resolved in future versions.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained(
    "Kunal7370944861/Twinkel-LLM-72M",
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Kunal7370944861/Twinkel-LLM-72M",
    trust_remote_code=True,
    torch_dtype=torch.float32,
    device_map="cpu"  # Force CPU for stability
)

# Generate response
def chat(message):
    messages = [{"role": "user", "content": message}]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        return_token_type_ids=False  # Important!
    )
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test
response = chat("What is Python?")
print(response)

📋 Model Details

  • Parameters: 72M (72 million)
  • Architecture: Custom decoder-only transformer
    • Hidden size: 448
    • Layers: 6
    • Attention: Grouped Query Attention (GQA)
    • FFN: SwiGLU activation
    • Position encoding: RoPE
  • Context length: 512 tokens
  • Tokenizer: SmolLM3 tokenizer (128K vocab)
  • Training: Pre-trained on C4 + instruction fine-tuning
  • Creator: Kunal Pandey
  • License: Apache 2.0

⚠️ Known Limitations

  1. GPU Inference Issues

    • Model currently has compatibility issues with GPU inference
    • CUDA assert errors occur during GPU loading
    • Workaround: Use CPU inference (as shown above)
    • Fix is planned for v0.2
  2. Model Size

    • Only 72M parameters (much smaller than production models)
    • Limited knowledge and reasoning capabilities
    • May produce inconsistent or incorrect responses
  3. Context Window

    • Limited to 512 tokens
    • Cannot handle long conversations or documents
  4. Response Quality

    • Experimental model, responses may be:
      • Off-topic or irrelevant
      • Repetitive
      • Factually incorrect
    • Not suitable for production use
  5. Language

    • Primarily English
    • Limited multilingual support

🎯 Intended Use

This is an experimental educational project suitable for:

✅ Learning about LLM architecture
✅ Understanding model training and fine-tuning
✅ Experimenting with small language models
✅ CPU-based inference testing

NOT suitable for:

  • Production applications
  • Critical or safety-sensitive tasks
  • High-quality text generation
  • GPU-accelerated inference (until v0.2)

🛠️ Training Details

Pre-training

  • Dataset: C4 (English)
  • Steps: 20,000
  • Batch size: 32 (effective)
  • Hardware: Kaggle P100 GPU
  • Optimization: AdamW with mixed precision

Fine-tuning

  • Dataset: Custom instruction dataset (~70K samples)
  • Epochs: 2-3
  • Learning rate: 1e-4
  • Hardware: Kaggle P100 GPU

🐛 Troubleshooting

GPU CUDA Error

AcceleratorError: CUDA error: device-side assert triggered

Solution: Force CPU inference:

model = AutoModelForCausalLM.from_pretrained(
    "Kunal7370944861/Twinkel-LLM-72M",
    trust_remote_code=True,
    device_map="cpu"  # Add this
)

token_type_ids Error

ValueError: The following `model_kwargs` are not used: ['token_type_ids']

Solution: Disable token_type_ids:

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    return_token_type_ids=False  # Add this
)

📊 Performance

This is an experimental model with limited capabilities:

  • Size: 72M parameters (vs billions in production models)
  • Quality: Basic responses, may be off-topic
  • Speed (CPU): ~5-10 tokens/second on standard CPU
  • Reliability: Experimental, expect issues

🔮 Future Plans

Version 0.2 (Planned):

  • ✅ Fix GPU compatibility issues
  • ✅ Improve response quality
  • ✅ Add proper identity training
  • ✅ Increase context length
  • ✅ Better instruction following

🙏 Acknowledgments

  • Creator: Kunal Pandey
  • Tokenizer: Based on SmolLM3 (Hugging Face)
  • Training data: C4 dataset (AllenAI)
  • Inspiration: SmolLM project

📜 License

Apache 2.0 - Free for commercial and research use.

⚠️ Disclaimer

This is an experimental educational project. The model:

  • May produce incorrect, biased, or inappropriate content
  • Has not been safety-tested or aligned
  • Should not be used in production environments
  • Is provided "as-is" without warranties

Use at your own risk for experimental and educational purposes only.

📧 Contact

For questions, issues, or feedback, please open an issue on the model repository.


Model Status: 🚧 Experimental Alpha
Created by: Kunal Pandey
Version: 0.1-alpha
Last updated: January 2026

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support