--- license: apache-2.0 language: - pt pipeline_tag: text-generation parameters: 10k --- # MiniText-v1.0 ## Model Summary MiniText-v1.0 is a tiny **character-level language model** trained from scratch to learn basic Portuguese text patterns. The goal of this project is to explore the **minimum viable neural architecture** capable of producing structured natural language, without pretraining, instruction tuning, or external corpora. This model is intended for **research, education, and experimentation**. --- ## Model Details - **Architecture:** Custom MiniText (character-level) - **Training Objective:** Next-character prediction - **Vocabulary:** Byte-level (0–255) - **Language:** Portuguese (basic) - **Initialization:** Random (no pretrained weights) - **Training:** Single-stream autoregressive training - **Parameters:** ~10K This is a **base model**, not a chat model. --- ## Training Data The model was trained on a **synthetic Portuguese dataset** designed to emphasize: - Simple sentence structure - Common verbs and nouns - Basic grammar patterns - Repetition and reinforcement The dataset intentionally avoids: - Instruction-following - Dialog formatting - Reasoning traces This design allows clear observation of **language emergence** in small models. --- ## Training Procedure - Optimizer: Adam - Learning rate: 3e-4 - Sequence length: 64 - Epochs: 12000 - Loss function: Cross-Entropy Loss - CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS) Training includes checkpointing and continuation support. --- ## Intended Use ### Supported Use Cases - Educational experiments - Language modeling research - Studying emergent structure in small neural networks - Baseline comparisons for future MiniText versions ### Out-of-Scope Use Cases - Conversational agents - Instruction-following systems - Reasoning or math tasks - Production deployment --- ## Example Output Prompt: o gato é Sample generation: o gato é um animal Note: Output quality varies due to the minimal size of the model. --- ## Limitations - Limited vocabulary and coherence - No reasoning or factual understanding - Susceptible to repetition and noise - Not aligned or safety-tuned These limitations are **expected and intentional**. --- ## Ethical Considerations This model does not include safety filtering or alignment mechanisms. It should not be used in applications involving sensitive or high-risk domains. --- ## Future Work Planned extensions of the MiniText family include: - MiniText-v1.1-Lang (improved Portuguese fluency) - MiniText-Math (symbolic pattern learning) - MiniText-Chat (conversation fine-tuning) - MiniText-Reasoning (structured token experiments) Each version will remain linked to this base model. --- ## Citation If you use MiniText-v1.0 in research or educational material, please cite the project repository. --- ## License MIT License Made by: Arthur Samuel(loboGOAT)