| | --- |
| | license: apache-2.0 |
| | language: |
| | - pt |
| | pipeline_tag: text-generation |
| | params: 10000 |
| | --- |
| | |
| | # MiniText-v1.0 |
| |
|
| | ## Model Summary |
| |
|
| | MiniText-v1.0 is a tiny **character-level language model** trained from scratch |
| | to learn basic Portuguese text patterns. |
| |
|
| | The goal of this project is to explore the **minimum viable neural architecture** |
| | capable of producing structured natural language, without pretraining, |
| | instruction tuning, or external corpora. |
| |
|
| | This model is intended for **research, education, and experimentation**. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture:** Custom MiniText (character-level) |
| | - **Training Objective:** Next-character prediction |
| | - **Vocabulary:** Byte-level (0–255) |
| | - **Language:** Portuguese (basic) |
| | - **Initialization:** Random (no pretrained weights) |
| | - **Training:** Single-stream autoregressive training |
| | - **Parameters:** ~10K |
| |
|
| | This is a **base model**, not a chat model. |
| |
|
| | --- |
| |
|
| | ## Training Data |
| |
|
| | The model was trained on a **synthetic Portuguese dataset** designed to emphasize: |
| |
|
| | - Simple sentence structure |
| | - Common verbs and nouns |
| | - Basic grammar patterns |
| | - Repetition and reinforcement |
| |
|
| | The dataset intentionally avoids: |
| | - Instruction-following |
| | - Dialog formatting |
| | - Reasoning traces |
| |
|
| | This design allows clear observation of **language emergence** in small models. |
| |
|
| | --- |
| |
|
| | ## Training Procedure |
| |
|
| | - Optimizer: Adam |
| | - Learning rate: 3e-4 |
| | - Sequence length: 64 |
| | - Epochs: 12000 |
| | - Loss function: Cross-Entropy Loss |
| | - CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS) |
| |
|
| | Training includes checkpointing and continuation support. |
| |
|
| | --- |
| |
|
| | ## Intended Use |
| |
|
| | ### Supported Use Cases |
| |
|
| | - Educational experiments |
| | - Language modeling research |
| | - Studying emergent structure in small neural networks |
| | - Baseline comparisons for future MiniText versions |
| |
|
| | ### Out-of-Scope Use Cases |
| |
|
| | - Conversational agents |
| | - Instruction-following systems |
| | - Reasoning or math tasks |
| | - Production deployment |
| |
|
| | --- |
| |
|
| | ## Example Output |
| |
|
| | Prompt: |
| | o gato é |
| |
|
| | Sample generation: |
| | o gato é um animal |
| |
|
| | Note: Output quality varies due to the minimal size of the model. |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - Limited vocabulary and coherence |
| | - No reasoning or factual understanding |
| | - Susceptible to repetition and noise |
| | - Not aligned or safety-tuned |
| |
|
| | These limitations are **expected and intentional**. |
| |
|
| | --- |
| |
|
| | ## Ethical Considerations |
| |
|
| | This model does not include safety filtering or alignment mechanisms. |
| | It should not be used in applications involving sensitive or high-risk domains. |
| |
|
| | --- |
| |
|
| | ## Future Work |
| |
|
| | Planned extensions of the MiniText family include: |
| |
|
| | - MiniText-v1.1-Lang (improved Portuguese fluency) |
| | - MiniText-Math (symbolic pattern learning) |
| | - MiniText-Chat (conversation fine-tuning) |
| | - MiniText-Reasoning (structured token experiments) |
| |
|
| | Each version will remain linked to this base model. |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you use MiniText-v1.0 in research or educational material, please cite the project repository. |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | MIT License |
| |
|
| |
|
| | Made by: Arthur Samuel(loboGOAT) |
| |
|