|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- pt |
|
|
pipeline_tag: text-generation |
|
|
parameters: 10k |
|
|
--- |
|
|
|
|
|
# MiniText-v1.0 |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
MiniText-v1.0 is a tiny **character-level language model** trained from scratch |
|
|
to learn basic Portuguese text patterns. |
|
|
|
|
|
The goal of this project is to explore the **minimum viable neural architecture** |
|
|
capable of producing structured natural language, without pretraining, |
|
|
instruction tuning, or external corpora. |
|
|
|
|
|
This model is intended for **research, education, and experimentation**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture:** Custom MiniText (character-level) |
|
|
- **Training Objective:** Next-character prediction |
|
|
- **Vocabulary:** Byte-level (0–255) |
|
|
- **Language:** Portuguese (basic) |
|
|
- **Initialization:** Random (no pretrained weights) |
|
|
- **Training:** Single-stream autoregressive training |
|
|
- **Parameters:** ~10K |
|
|
|
|
|
This is a **base model**, not a chat model. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a **synthetic Portuguese dataset** designed to emphasize: |
|
|
|
|
|
- Simple sentence structure |
|
|
- Common verbs and nouns |
|
|
- Basic grammar patterns |
|
|
- Repetition and reinforcement |
|
|
|
|
|
The dataset intentionally avoids: |
|
|
- Instruction-following |
|
|
- Dialog formatting |
|
|
- Reasoning traces |
|
|
|
|
|
This design allows clear observation of **language emergence** in small models. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
- Optimizer: Adam |
|
|
- Learning rate: 3e-4 |
|
|
- Sequence length: 64 |
|
|
- Epochs: 12000 |
|
|
- Loss function: Cross-Entropy Loss |
|
|
- CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS) |
|
|
|
|
|
Training includes checkpointing and continuation support. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Supported Use Cases |
|
|
|
|
|
- Educational experiments |
|
|
- Language modeling research |
|
|
- Studying emergent structure in small neural networks |
|
|
- Baseline comparisons for future MiniText versions |
|
|
|
|
|
### Out-of-Scope Use Cases |
|
|
|
|
|
- Conversational agents |
|
|
- Instruction-following systems |
|
|
- Reasoning or math tasks |
|
|
- Production deployment |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Output |
|
|
|
|
|
Prompt: |
|
|
o gato é |
|
|
|
|
|
Sample generation: |
|
|
o gato é um animal |
|
|
|
|
|
Note: Output quality varies due to the minimal size of the model. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Limited vocabulary and coherence |
|
|
- No reasoning or factual understanding |
|
|
- Susceptible to repetition and noise |
|
|
- Not aligned or safety-tuned |
|
|
|
|
|
These limitations are **expected and intentional**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
This model does not include safety filtering or alignment mechanisms. |
|
|
It should not be used in applications involving sensitive or high-risk domains. |
|
|
|
|
|
--- |
|
|
|
|
|
## Future Work |
|
|
|
|
|
Planned extensions of the MiniText family include: |
|
|
|
|
|
- MiniText-v1.1-Lang (improved Portuguese fluency) |
|
|
- MiniText-Math (symbolic pattern learning) |
|
|
- MiniText-Chat (conversation fine-tuning) |
|
|
- MiniText-Reasoning (structured token experiments) |
|
|
|
|
|
Each version will remain linked to this base model. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use MiniText-v1.0 in research or educational material, please cite the project repository. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|
|
|
|
|
|
Made by: Arthur Samuel(loboGOAT) |
|
|
|