Arthur Samuel Galego Panucci FIgueiredo commited on
Commit
5231594
·
verified ·
1 Parent(s): 7a8e531

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -35
README.md CHANGED
@@ -7,59 +7,134 @@ pipeline_tag: text-generation
7
 
8
  # MiniText-v1.0
9
 
10
- MiniText-v1.0 is a minimal character-level language model trained from scratch
 
 
11
  to learn basic Portuguese text patterns.
12
 
13
- This project in the future will explore all the language modeling limits:
14
- reasoning
15
- math
16
- code
17
- (ALL WITH 10K PARAMETERS)
 
 
18
 
19
- This project explores the lower limits of language modeling:
20
- how small can a neural network be and still produce coherent text?
21
 
22
- ## Model details
 
 
 
 
 
 
23
 
24
- - Architecture: custom MiniText (character-level)
25
- - Parameters: 10k (educational scale)
26
- - Training data: synthetic Portuguese dataset
27
- - Training objective: next-character prediction
28
- - Language: Portuguese (basic)
29
-
30
 
31
- ## What this model can do
32
 
33
- - Generate simple Portuguese words and sentences
34
- - Learn grammatical structure
35
- - Mix domains (language + math) as a base model
36
 
37
- ## What this model is NOT
38
 
39
- - Not a chatbot
40
- - Not instruction-tuned
41
- - Not reasoning-capable
42
- - Not safe for production use
43
 
44
- This is a **base model** intended for research, experimentation, and education.
 
 
 
45
 
46
- ## Example output
47
 
48
- Input:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  o gato é
50
 
51
- Output (example):
52
  o gato é um animal
53
 
54
- ## How to run inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
- python infer.py
57
 
58
- License
59
- MIT
60
 
61
- Training Environment
62
- CPU - AMD Ryzen 5 5600G 32GB
63
- Epochs - 12000
64
 
65
  Made by: Arthur Samuel(loboGOAT)
 
7
 
8
  # MiniText-v1.0
9
 
10
+ ## Model Summary
11
+
12
+ MiniText-v1.0 is a tiny **character-level language model** trained from scratch
13
  to learn basic Portuguese text patterns.
14
 
15
+ The goal of this project is to explore the **minimum viable neural architecture**
16
+ capable of producing structured natural language, without pretraining,
17
+ instruction tuning, or external corpora.
18
+
19
+ This model is intended for **research, education, and experimentation**.
20
+
21
+ ---
22
 
23
+ ## Model Details
 
24
 
25
+ - **Architecture:** Custom MiniText (character-level)
26
+ - **Training Objective:** Next-character prediction
27
+ - **Vocabulary:** Byte-level (0–255)
28
+ - **Language:** Portuguese (basic)
29
+ - **Initialization:** Random (no pretrained weights)
30
+ - **Training:** Single-stream autoregressive training
31
+ - **Parameters:** ~10K
32
 
33
+ This is a **base model**, not a chat model.
 
 
 
 
 
34
 
35
+ ---
36
 
37
+ ## Training Data
 
 
38
 
39
+ The model was trained on a **synthetic Portuguese dataset** designed to emphasize:
40
 
41
+ - Simple sentence structure
42
+ - Common verbs and nouns
43
+ - Basic grammar patterns
44
+ - Repetition and reinforcement
45
 
46
+ The dataset intentionally avoids:
47
+ - Instruction-following
48
+ - Dialog formatting
49
+ - Reasoning traces
50
 
51
+ This design allows clear observation of **language emergence** in small models.
52
 
53
+ ---
54
+
55
+ ## Training Procedure
56
+
57
+ - Optimizer: Adam
58
+ - Learning rate: 3e-4
59
+ - Sequence length: 64
60
+ - Epochs: 12000
61
+ - Loss function: Cross-Entropy Loss
62
+ - CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS)
63
+
64
+ Training includes checkpointing and continuation support.
65
+
66
+ ---
67
+
68
+ ## Intended Use
69
+
70
+ ### Supported Use Cases
71
+
72
+ - Educational experiments
73
+ - Language modeling research
74
+ - Studying emergent structure in small neural networks
75
+ - Baseline comparisons for future MiniText versions
76
+
77
+ ### Out-of-Scope Use Cases
78
+
79
+ - Conversational agents
80
+ - Instruction-following systems
81
+ - Reasoning or math tasks
82
+ - Production deployment
83
+
84
+ ---
85
+
86
+ ## Example Output
87
+
88
+ Prompt:
89
  o gato é
90
 
91
+ Sample generation:
92
  o gato é um animal
93
 
94
+ Note: Output quality varies due to the minimal size of the model.
95
+
96
+ ---
97
+
98
+ ## Limitations
99
+
100
+ - Limited vocabulary and coherence
101
+ - No reasoning or factual understanding
102
+ - Susceptible to repetition and noise
103
+ - Not aligned or safety-tuned
104
+
105
+ These limitations are **expected and intentional**.
106
+
107
+ ---
108
+
109
+ ## Ethical Considerations
110
+
111
+ This model does not include safety filtering or alignment mechanisms.
112
+ It should not be used in applications involving sensitive or high-risk domains.
113
+
114
+ ---
115
+
116
+ ## Future Work
117
+
118
+ Planned extensions of the MiniText family include:
119
+
120
+ - MiniText-v1.1-Lang (improved Portuguese fluency)
121
+ - MiniText-Math (symbolic pattern learning)
122
+ - MiniText-Chat (conversation fine-tuning)
123
+ - MiniText-Reasoning (structured token experiments)
124
+
125
+ Each version will remain linked to this base model.
126
+
127
+ ---
128
+
129
+ ## Citation
130
+
131
+ If you use MiniText-v1.0 in research or educational material, please cite the project repository.
132
+
133
+ ---
134
 
135
+ ## License
136
 
137
+ MIT License
 
138
 
 
 
 
139
 
140
  Made by: Arthur Samuel(loboGOAT)