MultivexAI
/

Plyx-15M

@@ -1,25 +1,23 @@
----
-library_name: transformers
-datasets:
-- HuggingFaceFW/finepdfs
-- HuggingFaceFW/fineweb-edu
-- gair-prox/FineWeb-pro
-license: mit
----
 # MultivexAI/Plyx-15M
-**MultivexAI/Plyx-15M** is a 15 million parameter language model trained entirely from scratch. It is built for efficiency, focusing on the quality of its pre-training data to create a compact, capable foundation model. Plyx-15M is ideal for quick experimentation, research into data efficiency, and deployment on resource-limited hardware.
-**Model Series Note:** This is **Version 1** of the Plyx model family. We are continuing our research and plan to release future models with diverse parameter counts. We expect to publish initial performance benchmarks for this model shortly.
 ## Pre-training Data
-Plyx-15M was trained on a strategic blend of premium datasets, prioritizing data integrity and structure.
-1.  **`fineweb-pro` (Refined Web Text):** This data is a heavily filtered and refined subset of general web content. It uses advanced, automated processes to remove noise and errors, providing the model with a very clean foundation in general language.
-2.  **`fineweb-edu` (Structured Learning):** A corpus focused on educational and instructional materials, giving the model a solid base in clear, organized knowledge.
-3.  **`finepdfs` (Technical Documents):** High-quality, specialized knowledge sourced from millions of professional documents and reports (PDFs). This ensures the model is exposed to formal, structured text and complex writing styles.
 ## License
-The data used for pre-training, including `fineweb-pro` and `finepdfs`, is derived from FineWeb and related sources, which are made available under the **ODC-By 1.0 license**. Users must also abide by the [CommonCrawl Terms of Use](https://commoncrawl.org/terms-of-use/). We do not alter the license of any of the underlying data.

 # MultivexAI/Plyx-15M
+The **MultivexAI/Plyx-15M** is a 15 million parameter language model trained completely from scratch. It is designed for maximum efficiency, showing that focusing intensely on data quality can create a highly capable foundation even at this minimal size.
+Plyx-15M is intended for quick testing, research into data efficiency, and specialized fine-tuning tasks where model size must be kept small.
+**Model Series Note:** This is **Version 1** of the Plyx model family. We are continuing this work and plan to release future models in various sizes. We expect to publish initial performance benchmarks for Plyx-15M here soon.
 ## Pre-training Data
+Plyx-15M was trained exclusively on a carefully selected set of premium datasets, prioritizing accuracy and structure.
+1.  **`fineweb-pro` (Ultra-Clean Web Text):** This data is a highly refined subset of general internet content. It was aggressively filtered using advanced, automated tools to remove common errors and noise, giving the model a clean understanding of everyday language.
+2.  **`fineweb-edu` (Learning Materials):** Content focused on education and instruction, providing the model with a solid base in clear, organized knowledge.
+3.  **`finepdfs` (Structured Documents):** Specialized knowledge sourced from millions of professional reports and complex documents (PDFs). This ensures the model is exposed to formal, technical writing styles and organized information structures.
+### Limitations
+**Plyx-15M is a very small model (15 million parameters)**. Its overall performance will be limited compared to models with billions of parameters. It should primarily be used for research, highly specific tasks, or as a base for fine-tuning, not as a general-purpose language model replacement.
 ## License
+The data used for pre-training (`fineweb-pro`, `fineweb-edu`, and `finepdfs`) is derived from sources made available under the **ODC-By 1.0 license**. Users must also abide by the [CommonCrawl Terms of Use](https://commoncrawl.org/terms-of-use/). We do not alter the license of any of the underlying data.