AxionLab-official
/

MiniText-v1.0-base

Text Generation

Portuguese

Model card Files Files and versions

xet

Community

Arthur Samuel Galego Panucci FIgueiredo commited on Dec 30, 2025

Commit

5231594

verified ·

1 Parent(s): 7a8e531

Update README.md

Browse files

Files changed (1) hide show

README.md +110 -35

README.md CHANGED Viewed

@@ -7,59 +7,134 @@ pipeline_tag: text-generation
 # MiniText-v1.0
-MiniText-v1.0 is a minimal character-level language model trained from scratch
 to learn basic Portuguese text patterns.
-This project in the future will explore all the language modeling limits:
-reasoning
-math
-code
-(ALL WITH 10K PARAMETERS)
-This project explores the lower limits of language modeling:
-how small can a neural network be and still produce coherent text?
-## Model details
-- Architecture: custom MiniText (character-level)
-- Parameters: 10k (educational scale)
-- Training data: synthetic Portuguese dataset
-- Training objective: next-character prediction
-- Language: Portuguese (basic)
-## What this model can do
-- Generate simple Portuguese words and sentences
-- Learn grammatical structure
-- Mix domains (language + math) as a base model
-## What this model is NOT
-- Not a chatbot
-- Not instruction-tuned
-- Not reasoning-capable
-- Not safe for production use
-This is a **base model** intended for research, experimentation, and education.
-## Example output
-Input:
 o gato é
-Output (example):
 o gato é um animal
-## How to run inference
-python infer.py
-License
-MIT
-Training Environment
-CPU - AMD Ryzen 5 5600G 32GB
-Epochs - 12000
 Made by: Arthur Samuel(loboGOAT)

 # MiniText-v1.0
+## Model Summary
+MiniText-v1.0 is a tiny **character-level language model** trained from scratch
 to learn basic Portuguese text patterns.
+The goal of this project is to explore the **minimum viable neural architecture**
+capable of producing structured natural language, without pretraining,
+instruction tuning, or external corpora.
+This model is intended for **research, education, and experimentation**.
+---
+## Model Details
+- **Architecture:** Custom MiniText (character-level)
+- **Training Objective:** Next-character prediction
+- **Vocabulary:** Byte-level (0–255)
+- **Language:** Portuguese (basic)
+- **Initialization:** Random (no pretrained weights)
+- **Training:** Single-stream autoregressive training
+- **Parameters:** ~10K
+This is a **base model**, not a chat model.
+---
+## Training Data
+The model was trained on a **synthetic Portuguese dataset** designed to emphasize:
+- Simple sentence structure
+- Common verbs and nouns
+- Basic grammar patterns
+- Repetition and reinforcement
+The dataset intentionally avoids:
+- Instruction-following
+- Dialog formatting
+- Reasoning traces
+This design allows clear observation of **language emergence** in small models.
+---
+## Training Procedure
+- Optimizer: Adam
+- Learning rate: 3e-4
+- Sequence length: 64
+- Epochs: 12000
+- Loss function: Cross-Entropy Loss
+- CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS)
+Training includes checkpointing and continuation support.
+---
+## Intended Use
+### Supported Use Cases
+- Educational experiments
+- Language modeling research
+- Studying emergent structure in small neural networks
+- Baseline comparisons for future MiniText versions
+### Out-of-Scope Use Cases
+- Conversational agents
+- Instruction-following systems
+- Reasoning or math tasks
+- Production deployment
+---
+## Example Output
+Prompt:
 o gato é
+Sample generation:
 o gato é um animal
+Note: Output quality varies due to the minimal size of the model.
+---
+## Limitations
+- Limited vocabulary and coherence
+- No reasoning or factual understanding
+- Susceptible to repetition and noise
+- Not aligned or safety-tuned
+These limitations are **expected and intentional**.
+---
+## Ethical Considerations
+This model does not include safety filtering or alignment mechanisms.
+It should not be used in applications involving sensitive or high-risk domains.
+---
+## Future Work
+Planned extensions of the MiniText family include:
+- MiniText-v1.1-Lang (improved Portuguese fluency)
+- MiniText-Math (symbolic pattern learning)
+- MiniText-Chat (conversation fine-tuning)
+- MiniText-Reasoning (structured token experiments)
+Each version will remain linked to this base model.
+---
+## Citation
+If you use MiniText-v1.0 in research or educational material, please cite the project repository.
+---
+## License
+MIT License
 Made by: Arthur Samuel(loboGOAT)