File size: 4,698 Bytes

---
language: en
license: mit
library_name: pytorch
pipeline_tag: text-generation
tags:
- deepseek
- cpu-optimized
- transformer
- language-model
- tinystories
- grouped-query-attention
- rotary-position-embeddings
- rmsnorm
- swiglu
datasets:
- roneneldan/TinyStories
---

# Shoonya Model v0.2 - DeepSeek CPU-Optimized

This model is a CPU-optimized version of the Shoonya language model, incorporating techniques from the DeepSeek team for efficient inference on CPU hardware.

## Model Description

**Shoonya Model v0.2** is a lightweight transformer-based language model designed for efficient CPU inference. It incorporates architectural optimizations inspired by DeepSeek's research to achieve better performance on CPU hardware while maintaining good generation quality.

### Model Details

- **Developed by:** VaidhyaMegha
- **Model type:** Transformer-based language model
- **Language(s):** English
- **Training Data:** TinyStories dataset
- **Parameters:** 16.41M
- **Context Length:** 512 tokens
- **Hidden Size:** 256
- **Attention Heads:** 8
- **Key-Value Heads:** 4
- **Hidden Layers:** 6
- **License:** MIT
- **Repository:** [GitHub - VaidhyaMegha/Shoonya](https://github.com/VaidhyaMegha/Shoonya)

## DeepSeek CPU Optimizations

This model incorporates the following optimizations from the DeepSeek team:

1. **Grouped-Query Attention (GQA)** with a 2:1 ratio - Reduces memory usage and computational cost by sharing key and value projections across multiple query heads
2. **Rotary Position Embeddings (RoPE)** - Provides better positional encoding with improved extrapolation to longer sequences
3. **RMSNorm** - Offers improved training stability compared to LayerNorm
4. **SwiGLU activation** - Provides better performance in feed-forward networks compared to standard GELU
5. **Sliding Window Attention** with window size 256 - Reduces memory usage for longer sequences by limiting attention to a local window
6. **ONNX export** - Enables optimized runtime on various hardware platforms

## Intended Uses & Limitations

**Intended Uses:**
- Educational purposes to understand transformer architecture and optimizations
- Research on efficient language model deployment
- Text generation for simple creative writing tasks
- Baseline for further fine-tuning on specific tasks

**Limitations:**
- The model is trained on a limited dataset (TinyStories) and has a relatively small parameter count
- It may not perform well on complex reasoning tasks or specialized domains
- The model has not been extensively evaluated for biases or harmful outputs

## Training Procedure

### Training Data

The model was trained on the [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories), which contains simple stories suitable for young children, generated by GPT-3.5/4.

### Training Hyperparameters

- **Optimizer:** AdamW
- **Learning Rate:** 5e-5
- **Batch Size:** 4
- **Weight Decay:** 0.01
- **Warmup Steps:** 100
- **Gradient Accumulation Steps:** 4
- **Training Device:** CPU (Mac Mini M4)
- **Training Epochs:** 5

## Note on Quantization

The quantized version of this model is not included due to PyTorch quantization limitations on Mac M-series chips. See quantization_note.md for instructions on how to quantize the model on a compatible system.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("VaidhyaMegha/Shoonya")
tokenizer = AutoTokenizer.from_pretrained("VaidhyaMegha/Shoonya")

# Generate text
input_text = "Once upon a time"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=100, temperature=0.7, top_p=0.9, repetition_penalty=1.1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```

## Evaluation Results

The model achieved the following metrics during training:
- **Final Loss:** 7.21
- **Final Perplexity:** 1358.28

## Ethical Considerations

This model is trained on the TinyStories dataset, which was designed to be suitable for children and contains simple, non-harmful content. However, as with any language model, it may still produce unexpected or potentially problematic outputs. Users should exercise caution and implement appropriate content filtering if deploying this model in production environments.

## Citations

```bibtex
@article{eldan2023tinystories,
  title={{TinyStories: How Small Can Language Models Be and Still Speak Coherent English?}},
  author={Eldan, Ronen and Li, Yuanzhi},
  journal={arXiv preprint arXiv:2305.07759},
  year={2023}
}
```

## License

This model is released under the MIT License.