|
|
--- |
|
|
language: |
|
|
- sr |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-generation |
|
|
- reasoning |
|
|
- serbian |
|
|
- asterisk |
|
|
- aspp |
|
|
- hybrid-architecture |
|
|
- multilingual |
|
|
datasets: |
|
|
- ODA-Mixture-100k |
|
|
- ultrachat_200k_serbian |
|
|
metrics: |
|
|
- accuracy |
|
|
- perplexity |
|
|
base_model: Geilim-1B-Instruct |
|
|
model-index: |
|
|
- name: Geilim-1B-SR-Instruct |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# Geilim-1B-SR-Instruct |
|
|
|
|
|
<div align="center"> |
|
|
<h3>π·πΈ Serbian Reasoning Model - AI Democratization Project</h3> |
|
|
<p><em>Bringing advanced reasoning capabilities to Serbian language</em></p> |
|
|
</div> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Geilim-1B-SR-Instruct** is a 1.3B parameter Serbian reasoning model that combines: |
|
|
- **Base**: Geilim-1B-Instruct (1B parameters, Llama-3 architecture, 16 layers) |
|
|
- **Architecture**: Asterisk hybrid ASPP + Attention |
|
|
- **Training**: 50% ODA-Mixture-100k (reasoning) + 50% UltraChat Serbian (conversations) |
|
|
- **Goal**: Democratize AI by bringing reasoning to underrepresented languages |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- β
**Hybrid Architecture**: All 16 layers use ASPP + standard Attention |
|
|
- β
**Graph-based Reasoning**: Union-Find structure with 6-step iterative propagation |
|
|
- β
**Ο-flow Refinement**: 4-step continuous flow dynamics for enhanced reasoning |
|
|
- β
**Bilingual**: Serbian language with preserved English reasoning capabilities |
|
|
- β
**Efficient**: ~1.3B total parameters, trainable on 2x consumer GPUs |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
``` |
|
|
Input β Embedding |
|
|
β |
|
|
Layers 0-15: Hybrid ASPP + Attention (ALL 16 layers) |
|
|
ββ ASPP Branch (Union-Find graph reasoning) |
|
|
β ββ 6-step iterative propagation |
|
|
β ββ Hidden dim: 512 (reduced from 2048) |
|
|
β ββ Ο-flow: 4-step refinement |
|
|
ββ Attention Branch (standard self-attention) |
|
|
β |
|
|
Gated Fusion: output = gate * ASPP(x) + (1-gate) * Attention(x) |
|
|
β |
|
|
Output β LM Head |
|
|
``` |
|
|
|
|
|
### Technical Specifications |
|
|
|
|
|
- **Parameters**: ~1.3B (1B base + 300M ASPP/Ο-flow) |
|
|
- **Layers**: 16 (all hybrid) |
|
|
- **Hidden Size**: 2048 |
|
|
- **Attention Heads**: 32 |
|
|
- **KV Heads**: 8 (GQA) |
|
|
- **Vocabulary**: 128,256 tokens |
|
|
- **Context Length**: 131,072 tokens (with RoPE scaling) |
|
|
- **Precision**: bfloat16 |
|
|
|
|
|
### ASPP Configuration |
|
|
|
|
|
- **Hidden Dim**: 512 (dimensionality reduction) |
|
|
- **Iteration Steps**: 6 |
|
|
- **Dropout**: 0.15 |
|
|
- **Graph Structure**: Union-Find (parent-only connections) |
|
|
|
|
|
### Ο-flow Configuration |
|
|
|
|
|
- **Steps**: 4 |
|
|
- **Scale**: 0.4 |
|
|
- **Gating**: Adaptive per-token |
|
|
- **Purpose**: Multi-step refinement in probability space |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
1. **Serbian Language Tasks**: |
|
|
- Conversational AI in Serbian |
|
|
- Question answering in Serbian |
|
|
- Text generation and completion |
|
|
|
|
|
2. **Reasoning Tasks**: |
|
|
- Mathematical problem solving |
|
|
- Code generation and debugging |
|
|
- Step-by-step logical reasoning |
|
|
|
|
|
3. **Bilingual Applications**: |
|
|
- Serbian-English translation assistance |
|
|
- Cross-lingual reasoning tasks |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Production-critical applications without further testing |
|
|
- Tasks requiring real-time factual accuracy (model may hallucinate) |
|
|
- Languages other than Serbian and English (limited support) |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers accelerate |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "NoesisLab/Geilim-1B-SR-Instruct" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
# Serbian conversation |
|
|
messages = [ |
|
|
{"role": "user", "content": "Kakvu ulogu igraju nagrade i pozitivno pojaΔanje u dresuri Bigla i kako se mogu efikasno koristiti bez podsticanja loΕ‘eg ponaΕ‘anja?"} |
|
|
] |
|
|
|
|
|
# Apply chat template |
|
|
input_text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=200, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
repetition_penalty=1.1, |
|
|
do_sample=True, |
|
|
) |
|
|
|
|
|
# Decode |
|
|
response = tokenizer.decode( |
|
|
outputs[0][inputs['input_ids'].shape[1]:], |
|
|
skip_special_tokens=True |
|
|
) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Recommended Generation Parameters |
|
|
|
|
|
```python |
|
|
generation_config = { |
|
|
"max_new_tokens": 200, |
|
|
"temperature": 0.7, # Balance creativity and coherence |
|
|
"top_p": 0.9, # Nucleus sampling |
|
|
"repetition_penalty": 1.1, # Reduce repetition |
|
|
"do_sample": True, |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset Composition |
|
|
|
|
|
The model was trained on a balanced mix of two datasets: |
|
|
|
|
|
#### 1. ODA-Mixture-100k (50% - Reasoning Data) |
|
|
|
|
|
**101,306 reasoning samples** across three domains: |
|
|
|
|
|
- **Math** (50,244 samples): AM-Thinking-v1-Distilled-math |
|
|
- Mathematical problem solving with step-by-step reasoning |
|
|
- Format: instruction β response (reasoning trace) β final answer |
|
|
|
|
|
- **Code** (50,245 samples): AM-Thinking-v1-Distilled-code |
|
|
- Programming problems with detailed solutions |
|
|
- Code generation, debugging, and explanation tasks |
|
|
|
|
|
- **General** (817 samples): LIMO |
|
|
- General reasoning tasks |
|
|
- Logic puzzles, common sense reasoning |
|
|
|
|
|
#### 2. UltraChat Serbian (50% - Language Data) |
|
|
|
|
|
**207,588 high-quality Serbian conversations**: |
|
|
|
|
|
- Translated from UltraChat 200k |
|
|
- Multi-turn dialogues covering diverse topics |
|
|
- Topics: science, culture, daily life, reasoning, education |
|
|
- Format: `messages_srb` (Serbian), `messages_eng` (English reference) |
|
|
|
|
|
### Data Mixing Strategy |
|
|
|
|
|
- **Balanced 50/50 split**: Preserve reasoning while learning Serbian |
|
|
- **Automatic sampling**: Match smaller dataset size |
|
|
- **Total samples**: ~100k (sampled from 202k available) |
|
|
- **Train/Test split**: 95% / 5% |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- **Epochs**: 2 |
|
|
- **Batch Size**: 2 per device |
|
|
- **Gradient Accumulation**: 8 steps (effective batch size = 16) |
|
|
- **Learning Rate**: 5e-5 |
|
|
- **Warmup Ratio**: 0.1 (10% of training) |
|
|
- **Weight Decay**: 0.05 |
|
|
- **Max Gradient Norm**: 1.0 |
|
|
- **Optimizer**: AdamW |
|
|
- **Precision**: bfloat16 mixed precision |
|
|
- **Gradient Checkpointing**: Enabled |
|
|
- **Max Sequence Length**: 2048 tokens |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **Framework**: HuggingFace Transformers + TRL SFTTrainer |
|
|
- **Distributed Training**: Accelerate (multi-GPU) |
|
|
- **GPUs**: 1x RTX PRO 6000 |
|
|
- **Training Time**: ~6-8 hours |
|
|
- **Memory per GPU**: ~15GB |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Qualitative Evaluation |
|
|
|
|
|
The model demonstrates: |
|
|
- β
Fluent Serbian language generation |
|
|
- β
Step-by-step reasoning in Serbian |
|
|
- β
Mathematical problem solving |
|
|
- β
Code understanding and generation |
|
|
- β
Multi-turn conversation capabilities |
|
|
|
|
|
|
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
|
|
|
1. **Language Coverage**: Primarily trained on Serbian and English; limited support for other languages |
|
|
2. **Factual Accuracy**: May generate plausible but incorrect information (hallucination) |
|
|
3. **Context Length**: While supporting 131k tokens, performance may degrade on very long contexts |
|
|
4. **Domain Specificity**: Best performance on conversational and reasoning tasks; may struggle with highly specialized domains |
|
|
5. **Training Data**: Limited to ~100k samples; may not cover all Serbian language variations |
|
|
|
|
|
### Potential Biases |
|
|
|
|
|
- **Translation Bias**: Serbian data is translated from English, may not reflect natural Serbian expressions |
|
|
- **Domain Bias**: Reasoning data focuses on math and code; may be less effective on other domains |
|
|
- **Cultural Bias**: Training data may reflect Western cultural perspectives |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- Verify factual claims with authoritative sources |
|
|
- Test thoroughly before deployment in production |
|
|
- Monitor for biased or inappropriate outputs |
|
|
- Consider fine-tuning on domain-specific data for specialized applications |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
### AI Democratization |
|
|
|
|
|
This model is part of an effort to democratize AI by bringing advanced capabilities to underrepresented languages. Serbian, despite having ~12 million speakers, has limited AI resources compared to high-resource languages. |
|
|
|
|
|
### Responsible Use |
|
|
|
|
|
Users should: |
|
|
- Be aware of potential biases and limitations |
|
|
- Not use for malicious purposes (misinformation, harassment, etc.) |
|
|
- Respect privacy and data protection regulations |
|
|
- Consider societal impact of deployments |
|
|
|
|
|
### Environmental Impact |
|
|
|
|
|
- **Training**: ~6-8 hours on 1x RTX PRO 6000 GPUs |
|
|
- **Carbon Footprint**: Estimated ~5-10 kg CO2eq (depends on energy source) |
|
|
- **Inference**: Efficient at 1.3B parameters, suitable for edge deployment |
|
|
|
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Asterisk Architecture |
|
|
|
|
|
The model uses the **Asterisk** architecture, which combines: |
|
|
|
|
|
1. **ASPP (Adjacency-Structured Parallel Propagation)**: |
|
|
- Graph-based reasoning with Union-Find structure |
|
|
- Each token maintains parent pointer: `parent[i] = i-1` |
|
|
- Iterative message passing: `h_i^(t+1) = Ο(h_i^(t), h_parent[i])` |
|
|
- 6 propagation steps per layer |
|
|
|
|
|
2. **Ο-flow Refinement**: |
|
|
- Continuous flow dynamics: `h' = h + Ξ± * v(h)` |
|
|
- Learnable velocity field for multi-step refinement |
|
|
- Adaptive per-token gating |
|
|
- 4 refinement steps per layer |
|
|
|
|
|
3. **Hybrid Fusion**: |
|
|
- Parallel execution of ASPP and standard Attention |
|
|
- Gated combination: `output = gate * ASPP(x) + (1-gate) * Attention(x)` |
|
|
- Applied to all 16 layers |
|
|
|
|
|
### Model Configuration |
|
|
|
|
|
```json |
|
|
{ |
|
|
"model_type": "asterisk", |
|
|
"hidden_size": 2048, |
|
|
"num_hidden_layers": 16, |
|
|
"num_attention_heads": 32, |
|
|
"num_key_value_heads": 8, |
|
|
"intermediate_size": 8192, |
|
|
"vocab_size": 128256, |
|
|
"max_position_embeddings": 131072, |
|
|
|
|
|
"aspp_hidden_dim": 512, |
|
|
"aspp_num_steps": 6, |
|
|
"aspp_dropout": 0.15, |
|
|
"aspp_num_neighbors": 1, |
|
|
|
|
|
"pi_flow": true, |
|
|
"pi_flow_steps": 4, |
|
|
"pi_flow_scale": 0.4, |
|
|
"pi_flow_use_gate": true, |
|
|
|
|
|
"hybrid_layer_indices": null |
|
|
} |
|
|
``` |
|
|
|
|
|
## Comparison with Other Models |
|
|
|
|
|
| Model | Base | Params | Layers | Language | Reasoning | Architecture | |
|
|
|-------|------|--------|--------|----------|-----------|--------------| |
|
|
| SmolLM2-135M | - | 135M | 30 | English | β | Transformer | |
|
|
| Asterisk | SmolLM2 | 171M | 30 | English | β
ASPP | Hybrid | |
|
|
| **Geilim-1B-SR** | Geilim-1B | 1.3B | 16 | Serbian | β
ASPP | Hybrid | |
|
|
|
|
|
### Advantages |
|
|
|
|
|
- β
**Efficient Size**: 1.3B parameters, suitable for consumer hardware |
|
|
- β
**Full Hybrid**: All 16 layers use ASPP + Attention |
|
|
- β
**Bilingual**: Serbian + English capabilities |
|
|
- β
**Reasoning**: Math, code, and general reasoning |
|
|
- β
**Fast Training**: ~6-8 hours on 1x RTX PRO 6000 |
|
|
- β
**Low Memory**: ~3GB inference, ~20GB training per GPU |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Inference |
|
|
|
|
|
- **Minimum**: 1x GPU with 8GB VRAM (e.g., RTX 3060) |
|
|
- **Recommended**: 1x GPU with 16GB+ VRAM (e.g., RTX 4080, A100) |
|
|
- **CPU Only**: Possible but slow (~10-20x slower) |
|
|
|
|
|
### Training |
|
|
|
|
|
- **Minimum**: 2x GPU with 24GB VRAM (e.g., RTX 3090/4090) |
|
|
- **Recommended**: 2x GPU with 40GB VRAM (e.g., A100) |
|
|
- **Memory**: ~20GB per GPU with gradient checkpointing |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- **NoesisLab** |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{geilim_1b_sr_2026, |
|
|
title={Geilim-1B-SR-Instruct: Serbian Reasoning Model with Asterisk Architecture}, |
|
|
author={NoesisLab}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/NoesisLab/Geilim-1B-SR-Instruct}, |
|
|
note={AI Democratization - Bringing reasoning to underrepresented languages} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Related Papers |
|
|
|
|
|
```bibtex |
|
|
@article{asterisk_2026, |
|
|
title={Asterisk: Hybrid ASPP-Attention Architecture for Efficient Reasoning}, |
|
|
author={NoesisLab}, |
|
|
year={2026}, |
|
|
note={Graph-based reasoning with Union-Find propagation} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Geilim-1B-Instruct**: Base model (Llama-3 architecture, 1B parameters) |
|
|
- **ODA-Mixture-100k**: Reasoning dataset (Math, Code, General) |
|
|
- **UltraChat**: High-quality conversation dataset |
|
|
- **Serbian NLP Community**: Language support and feedback |
|
|
- **HuggingFace**: Transformers library and model hosting |
|
|
- **Accelerate**: Distributed training framework |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Apache 2.0 License**, same as the base model. |
|
|
|
|
|
``` |
|
|
Copyright 2026 Asterisk Project |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
``` |
|
|
|
|
|
|
|
|
## Version History |
|
|
|
|
|
- **v1.0** (2026-02): Initial release |
|
|
- 1.3B parameters (1B base + 300M ASPP/Ο-flow) |
|
|
- Trained on 100k samples (50% ODA-Mixture + 50% UltraChat Serbian) |
|
|
- All 16 layers use hybrid ASPP + Attention |
|
|
- Supports Serbian and English |
|
|
|
|
|
## Contact and Support |
|
|
|
|
|
|
|
|
- **Email**: lizx93@mail2.sysu.edu.cn |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h3>π·πΈ Democratizing AI, one language at a time!</h3> |
|
|
<p><em>Making advanced AI technology accessible to every language</em></p> |