File size: 2,932 Bytes
7a8e531 cc2cfdf 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 5231594 7a8e531 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: apache-2.0
language:
- pt
pipeline_tag: text-generation
parameters: 10k
---
# MiniText-v1.0
## Model Summary
MiniText-v1.0 is a tiny **character-level language model** trained from scratch
to learn basic Portuguese text patterns.
The goal of this project is to explore the **minimum viable neural architecture**
capable of producing structured natural language, without pretraining,
instruction tuning, or external corpora.
This model is intended for **research, education, and experimentation**.
---
## Model Details
- **Architecture:** Custom MiniText (character-level)
- **Training Objective:** Next-character prediction
- **Vocabulary:** Byte-level (0–255)
- **Language:** Portuguese (basic)
- **Initialization:** Random (no pretrained weights)
- **Training:** Single-stream autoregressive training
- **Parameters:** ~10K
This is a **base model**, not a chat model.
---
## Training Data
The model was trained on a **synthetic Portuguese dataset** designed to emphasize:
- Simple sentence structure
- Common verbs and nouns
- Basic grammar patterns
- Repetition and reinforcement
The dataset intentionally avoids:
- Instruction-following
- Dialog formatting
- Reasoning traces
This design allows clear observation of **language emergence** in small models.
---
## Training Procedure
- Optimizer: Adam
- Learning rate: 3e-4
- Sequence length: 64
- Epochs: 12000
- Loss function: Cross-Entropy Loss
- CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS)
Training includes checkpointing and continuation support.
---
## Intended Use
### Supported Use Cases
- Educational experiments
- Language modeling research
- Studying emergent structure in small neural networks
- Baseline comparisons for future MiniText versions
### Out-of-Scope Use Cases
- Conversational agents
- Instruction-following systems
- Reasoning or math tasks
- Production deployment
---
## Example Output
Prompt:
o gato é
Sample generation:
o gato é um animal
Note: Output quality varies due to the minimal size of the model.
---
## Limitations
- Limited vocabulary and coherence
- No reasoning or factual understanding
- Susceptible to repetition and noise
- Not aligned or safety-tuned
These limitations are **expected and intentional**.
---
## Ethical Considerations
This model does not include safety filtering or alignment mechanisms.
It should not be used in applications involving sensitive or high-risk domains.
---
## Future Work
Planned extensions of the MiniText family include:
- MiniText-v1.1-Lang (improved Portuguese fluency)
- MiniText-Math (symbolic pattern learning)
- MiniText-Chat (conversation fine-tuning)
- MiniText-Reasoning (structured token experiments)
Each version will remain linked to this base model.
---
## Citation
If you use MiniText-v1.0 in research or educational material, please cite the project repository.
---
## License
MIT License
Made by: Arthur Samuel(loboGOAT)
|