Update README.md
Browse files
README.md
CHANGED
|
@@ -1,25 +1,23 @@
|
|
| 1 |
-
---
|
| 2 |
-
library_name: transformers
|
| 3 |
-
datasets:
|
| 4 |
-
- HuggingFaceFW/finepdfs
|
| 5 |
-
- HuggingFaceFW/fineweb-edu
|
| 6 |
-
- gair-prox/FineWeb-pro
|
| 7 |
-
license: mit
|
| 8 |
-
---
|
| 9 |
# MultivexAI/Plyx-15M
|
| 10 |
|
| 11 |
-
**MultivexAI/Plyx-15M** is a 15 million parameter language model trained
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Pre-training Data
|
| 16 |
|
| 17 |
-
Plyx-15M was trained on a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
2. **`fineweb-edu` (Structured Learning):** A corpus focused on educational and instructional materials, giving the model a solid base in clear, organized knowledge.
|
| 21 |
-
3. **`finepdfs` (Technical Documents):** High-quality, specialized knowledge sourced from millions of professional documents and reports (PDFs). This ensures the model is exposed to formal, structured text and complex writing styles.
|
| 22 |
|
| 23 |
## License
|
| 24 |
|
| 25 |
-
The data used for pre-training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# MultivexAI/Plyx-15M
|
| 2 |
|
| 3 |
+
The **MultivexAI/Plyx-15M** is a 15 million parameter language model trained completely from scratch. It is designed for maximum efficiency, showing that focusing intensely on data quality can create a highly capable foundation even at this minimal size.
|
| 4 |
|
| 5 |
+
Plyx-15M is intended for quick testing, research into data efficiency, and specialized fine-tuning tasks where model size must be kept small.
|
| 6 |
+
|
| 7 |
+
**Model Series Note:** This is **Version 1** of the Plyx model family. We are continuing this work and plan to release future models in various sizes. We expect to publish initial performance benchmarks for Plyx-15M here soon.
|
| 8 |
|
| 9 |
## Pre-training Data
|
| 10 |
|
| 11 |
+
Plyx-15M was trained exclusively on a carefully selected set of premium datasets, prioritizing accuracy and structure.
|
| 12 |
+
|
| 13 |
+
1. **`fineweb-pro` (Ultra-Clean Web Text):** This data is a highly refined subset of general internet content. It was aggressively filtered using advanced, automated tools to remove common errors and noise, giving the model a clean understanding of everyday language.
|
| 14 |
+
2. **`fineweb-edu` (Learning Materials):** Content focused on education and instruction, providing the model with a solid base in clear, organized knowledge.
|
| 15 |
+
3. **`finepdfs` (Structured Documents):** Specialized knowledge sourced from millions of professional reports and complex documents (PDFs). This ensures the model is exposed to formal, technical writing styles and organized information structures.
|
| 16 |
+
|
| 17 |
+
### Limitations
|
| 18 |
|
| 19 |
+
**Plyx-15M is a very small model (15 million parameters)**. Its overall performance will be limited compared to models with billions of parameters. It should primarily be used for research, highly specific tasks, or as a base for fine-tuning, not as a general-purpose language model replacement.
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## License
|
| 22 |
|
| 23 |
+
The data used for pre-training (`fineweb-pro`, `fineweb-edu`, and `finepdfs`) is derived from sources made available under the **ODC-By 1.0 license**. Users must also abide by the [CommonCrawl Terms of Use](https://commoncrawl.org/terms-of-use/). We do not alter the license of any of the underlying data.
|