File size: 2,932 Bytes

---
license: apache-2.0
language:
- pt
pipeline_tag: text-generation
parameters: 10k
---

# MiniText-v1.0

## Model Summary

MiniText-v1.0 is a tiny **character-level language model** trained from scratch
to learn basic Portuguese text patterns.

The goal of this project is to explore the **minimum viable neural architecture**
capable of producing structured natural language, without pretraining,
instruction tuning, or external corpora.

This model is intended for **research, education, and experimentation**.

---

## Model Details

- **Architecture:** Custom MiniText (character-level)
- **Training Objective:** Next-character prediction
- **Vocabulary:** Byte-level (0–255)
- **Language:** Portuguese (basic)
- **Initialization:** Random (no pretrained weights)
- **Training:** Single-stream autoregressive training
- **Parameters:** ~10K

This is a **base model**, not a chat model.

---

## Training Data

The model was trained on a **synthetic Portuguese dataset** designed to emphasize:

- Simple sentence structure
- Common verbs and nouns
- Basic grammar patterns
- Repetition and reinforcement

The dataset intentionally avoids:
- Instruction-following
- Dialog formatting
- Reasoning traces

This design allows clear observation of **language emergence** in small models.

---

## Training Procedure

- Optimizer: Adam
- Learning rate: 3e-4
- Sequence length: 64
- Epochs: 12000
- Loss function: Cross-Entropy Loss
- CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS)

Training includes checkpointing and continuation support.

---

## Intended Use

### Supported Use Cases

- Educational experiments
- Language modeling research
- Studying emergent structure in small neural networks
- Baseline comparisons for future MiniText versions

### Out-of-Scope Use Cases

- Conversational agents
- Instruction-following systems
- Reasoning or math tasks
- Production deployment

---

## Example Output

Prompt:
o gato é

Sample generation:
o gato é um animal

Note: Output quality varies due to the minimal size of the model.

---

## Limitations

- Limited vocabulary and coherence
- No reasoning or factual understanding
- Susceptible to repetition and noise
- Not aligned or safety-tuned

These limitations are **expected and intentional**.

---

## Ethical Considerations

This model does not include safety filtering or alignment mechanisms.
It should not be used in applications involving sensitive or high-risk domains.

---

## Future Work

Planned extensions of the MiniText family include:

- MiniText-v1.1-Lang (improved Portuguese fluency)
- MiniText-Math (symbolic pattern learning)
- MiniText-Chat (conversation fine-tuning)
- MiniText-Reasoning (structured token experiments)

Each version will remain linked to this base model.

---

## Citation

If you use MiniText-v1.0 in research or educational material, please cite the project repository.

---

## License

MIT License


Made by: Arthur Samuel(loboGOAT)