Text Generation
Transformers
Safetensors
llama
text-generation-inference
MultivexAI commited on
Commit
6854049
·
verified ·
1 Parent(s): 0a65c2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -15
README.md CHANGED
@@ -1,25 +1,23 @@
1
- ---
2
- library_name: transformers
3
- datasets:
4
- - HuggingFaceFW/finepdfs
5
- - HuggingFaceFW/fineweb-edu
6
- - gair-prox/FineWeb-pro
7
- license: mit
8
- ---
9
  # MultivexAI/Plyx-15M
10
 
11
- **MultivexAI/Plyx-15M** is a 15 million parameter language model trained entirely from scratch. It is built for efficiency, focusing on the quality of its pre-training data to create a compact, capable foundation model. Plyx-15M is ideal for quick experimentation, research into data efficiency, and deployment on resource-limited hardware.
12
 
13
- **Model Series Note:** This is **Version 1** of the Plyx model family. We are continuing our research and plan to release future models with diverse parameter counts. We expect to publish initial performance benchmarks for this model shortly.
 
 
14
 
15
  ## Pre-training Data
16
 
17
- Plyx-15M was trained on a strategic blend of premium datasets, prioritizing data integrity and structure.
 
 
 
 
 
 
18
 
19
- 1. **`fineweb-pro` (Refined Web Text):** This data is a heavily filtered and refined subset of general web content. It uses advanced, automated processes to remove noise and errors, providing the model with a very clean foundation in general language.
20
- 2. **`fineweb-edu` (Structured Learning):** A corpus focused on educational and instructional materials, giving the model a solid base in clear, organized knowledge.
21
- 3. **`finepdfs` (Technical Documents):** High-quality, specialized knowledge sourced from millions of professional documents and reports (PDFs). This ensures the model is exposed to formal, structured text and complex writing styles.
22
 
23
  ## License
24
 
25
- The data used for pre-training, including `fineweb-pro` and `finepdfs`, is derived from FineWeb and related sources, which are made available under the **ODC-By 1.0 license**. Users must also abide by the [CommonCrawl Terms of Use](https://commoncrawl.org/terms-of-use/). We do not alter the license of any of the underlying data.
 
 
 
 
 
 
 
 
 
1
  # MultivexAI/Plyx-15M
2
 
3
+ The **MultivexAI/Plyx-15M** is a 15 million parameter language model trained completely from scratch. It is designed for maximum efficiency, showing that focusing intensely on data quality can create a highly capable foundation even at this minimal size.
4
 
5
+ Plyx-15M is intended for quick testing, research into data efficiency, and specialized fine-tuning tasks where model size must be kept small.
6
+
7
+ **Model Series Note:** This is **Version 1** of the Plyx model family. We are continuing this work and plan to release future models in various sizes. We expect to publish initial performance benchmarks for Plyx-15M here soon.
8
 
9
  ## Pre-training Data
10
 
11
+ Plyx-15M was trained exclusively on a carefully selected set of premium datasets, prioritizing accuracy and structure.
12
+
13
+ 1. **`fineweb-pro` (Ultra-Clean Web Text):** This data is a highly refined subset of general internet content. It was aggressively filtered using advanced, automated tools to remove common errors and noise, giving the model a clean understanding of everyday language.
14
+ 2. **`fineweb-edu` (Learning Materials):** Content focused on education and instruction, providing the model with a solid base in clear, organized knowledge.
15
+ 3. **`finepdfs` (Structured Documents):** Specialized knowledge sourced from millions of professional reports and complex documents (PDFs). This ensures the model is exposed to formal, technical writing styles and organized information structures.
16
+
17
+ ### Limitations
18
 
19
+ **Plyx-15M is a very small model (15 million parameters)**. Its overall performance will be limited compared to models with billions of parameters. It should primarily be used for research, highly specific tasks, or as a base for fine-tuning, not as a general-purpose language model replacement.
 
 
20
 
21
  ## License
22
 
23
+ The data used for pre-training (`fineweb-pro`, `fineweb-edu`, and `finepdfs`) is derived from sources made available under the **ODC-By 1.0 license**. Users must also abide by the [CommonCrawl Terms of Use](https://commoncrawl.org/terms-of-use/). We do not alter the license of any of the underlying data.