MiniText-v1.0-base / README.md

Arthur Samuel Galego Panucci FIgueiredo

Update README.md

cc2cfdf verified about 1 month ago

2.93 kB

license: apache-2.0
language:
  - pt
pipeline_tag: text-generation
parameters: 10k

MiniText-v1.0

Model Summary

MiniText-v1.0 is a tiny character-level language model trained from scratch to learn basic Portuguese text patterns.

The goal of this project is to explore the minimum viable neural architecture capable of producing structured natural language, without pretraining, instruction tuning, or external corpora.

This model is intended for research, education, and experimentation.

Model Details

Architecture: Custom MiniText (character-level)
Training Objective: Next-character prediction
Vocabulary: Byte-level (0–255)
Language: Portuguese (basic)
Initialization: Random (no pretrained weights)
Training: Single-stream autoregressive training
Parameters: ~10K

This is a base model, not a chat model.

Training Data

The model was trained on a synthetic Portuguese dataset designed to emphasize:

Simple sentence structure
Common verbs and nouns
Basic grammar patterns
Repetition and reinforcement

The dataset intentionally avoids:

Instruction-following
Dialog formatting
Reasoning traces

This design allows clear observation of language emergence in small models.

Training Procedure

Optimizer: Adam
Learning rate: 3e-4
Sequence length: 64
Epochs: 12000
Loss function: Cross-Entropy Loss
CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS)

Training includes checkpointing and continuation support.

Intended Use

Supported Use Cases

Educational experiments
Language modeling research
Studying emergent structure in small neural networks
Baseline comparisons for future MiniText versions

Out-of-Scope Use Cases

Conversational agents
Instruction-following systems
Reasoning or math tasks
Production deployment

Example Output

Prompt: o gato é

Sample generation: o gato é um animal

Note: Output quality varies due to the minimal size of the model.

Limitations

Limited vocabulary and coherence
No reasoning or factual understanding
Susceptible to repetition and noise
Not aligned or safety-tuned

These limitations are expected and intentional.

Ethical Considerations

This model does not include safety filtering or alignment mechanisms. It should not be used in applications involving sensitive or high-risk domains.

Future Work

Planned extensions of the MiniText family include:

MiniText-v1.1-Lang (improved Portuguese fluency)
MiniText-Math (symbolic pattern learning)
MiniText-Chat (conversation fine-tuning)
MiniText-Reasoning (structured token experiments)

Each version will remain linked to this base model.

Citation

If you use MiniText-v1.0 in research or educational material, please cite the project repository.

License

MIT License

Made by: Arthur Samuel(loboGOAT)