MiniText-v1.0-base / README.md
Arthur Samuel Galego Panucci FIgueiredo
Update README.md
cc2cfdf verified
---
license: apache-2.0
language:
- pt
pipeline_tag: text-generation
parameters: 10k
---
# MiniText-v1.0
## Model Summary
MiniText-v1.0 is a tiny **character-level language model** trained from scratch
to learn basic Portuguese text patterns.
The goal of this project is to explore the **minimum viable neural architecture**
capable of producing structured natural language, without pretraining,
instruction tuning, or external corpora.
This model is intended for **research, education, and experimentation**.
---
## Model Details
- **Architecture:** Custom MiniText (character-level)
- **Training Objective:** Next-character prediction
- **Vocabulary:** Byte-level (0–255)
- **Language:** Portuguese (basic)
- **Initialization:** Random (no pretrained weights)
- **Training:** Single-stream autoregressive training
- **Parameters:** ~10K
This is a **base model**, not a chat model.
---
## Training Data
The model was trained on a **synthetic Portuguese dataset** designed to emphasize:
- Simple sentence structure
- Common verbs and nouns
- Basic grammar patterns
- Repetition and reinforcement
The dataset intentionally avoids:
- Instruction-following
- Dialog formatting
- Reasoning traces
This design allows clear observation of **language emergence** in small models.
---
## Training Procedure
- Optimizer: Adam
- Learning rate: 3e-4
- Sequence length: 64
- Epochs: 12000
- Loss function: Cross-Entropy Loss
- CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS)
Training includes checkpointing and continuation support.
---
## Intended Use
### Supported Use Cases
- Educational experiments
- Language modeling research
- Studying emergent structure in small neural networks
- Baseline comparisons for future MiniText versions
### Out-of-Scope Use Cases
- Conversational agents
- Instruction-following systems
- Reasoning or math tasks
- Production deployment
---
## Example Output
Prompt:
o gato é
Sample generation:
o gato é um animal
Note: Output quality varies due to the minimal size of the model.
---
## Limitations
- Limited vocabulary and coherence
- No reasoning or factual understanding
- Susceptible to repetition and noise
- Not aligned or safety-tuned
These limitations are **expected and intentional**.
---
## Ethical Considerations
This model does not include safety filtering or alignment mechanisms.
It should not be used in applications involving sensitive or high-risk domains.
---
## Future Work
Planned extensions of the MiniText family include:
- MiniText-v1.1-Lang (improved Portuguese fluency)
- MiniText-Math (symbolic pattern learning)
- MiniText-Chat (conversation fine-tuning)
- MiniText-Reasoning (structured token experiments)
Each version will remain linked to this base model.
---
## Citation
If you use MiniText-v1.0 in research or educational material, please cite the project repository.
---
## License
MIT License
Made by: Arthur Samuel(loboGOAT)