Plyx-15M / README.md

Update README.md

d3fe2a0 verified 7 months ago

2.23 kB

library_name: transformers
datasets:
  - HuggingFaceFW/finepdfs
  - HuggingFaceFW/fineweb-edu
  - gair-prox/FineWeb-pro
license: mit

MultivexAI/Plyx-15M

MultivexAI/Plyx-15M is a 15 million parameter language model trained completely from scratch. It is designed for maximum efficiency, showing that focusing intensely on data quality can create a highly capable foundation even at this minimal size.

Plyx-15M is intended for quick testing, research into data efficiency, and specialized fine-tuning tasks where model size must be kept small.

Model Series Note: This is Version 1 of the Plyx model family. We are continuing this work and plan to release future models in various sizes. We expect to publish initial performance benchmarks for Plyx-15M here soon.

Pre-training Data

Plyx-15M was trained exclusively on a carefully selected set of premium datasets, prioritizing accuracy and structure.

fineweb-pro (Ultra-Clean Web Text): This data is a highly refined subset of general internet content. It was aggressively filtered using advanced, automated tools to remove common errors and noise, giving the model a clean understanding of everyday language.
fineweb-edu (Learning Materials): Content focused on education and instruction, providing the model with a solid base in clear, organized knowledge.
finepdfs (Structured Documents): Specialized knowledge sourced from millions of professional reports and complex documents (PDFs). This ensures the model is exposed to formal, technical writing styles and organized information structures.

Limitations

Plyx-15M is a very small model (15 million parameters). Its overall performance will be limited compared to models with billions of parameters. It should primarily be used for research, highly specific tasks, or as a base for fine-tuning, not as a general-purpose language model replacement.

License

The data used for pre-training (fineweb-pro, fineweb-edu, and finepdfs) is derived from sources made available under the ODC-By 1.0 license. Users must also abide by the CommonCrawl Terms of Use. We do not alter the license of any of the underlying data.