Model Overview
Description:
Nemotron-CLIMB FastText Classifiers are five lightweight, CPU-based fastText text classifiers β quality, advertisement, informational_value, cultural_value, and educational_value β developed by NVIDIA as part of the Nemotron-CLIMB data curation pipeline. Their sole purpose is to efficiently estimate the suitability of candidate web documents for large-language-model training at scale, enabling automated data quality control before any model training occurs. These models are ready for commercial use.
License/Terms of Use:
The fastText library used for training is developed by Meta Research and released under the MIT License. The NVIDIA-trained classifier weights are released under the NVIDIA Open Model License.
Deployment Geography:
Global
Use Case:
These classifiers are intended for use by ML engineers and data scientists who are building or refining pre-training corpora for large language models. The specific use case is automated scoring and filtering of web-crawled documents across five quality dimensions (text quality, advertisement content, informational value, cultural value, and educational value) as part of a data curation pipeline.
References(s):
- fastText β Library for efficient text classification and representation learning
- DCLM (DataComp-LM) β Source data pool derived from Common Crawl
- nvidia/Nemotron-4-340B-Instruct β Teacher LLM used for annotation
Model Architecture:
Architecture Type: Shallow Neural Network (fastText supervised classifier)
Network Architecture: fastText supervised model with bag-of-words and n-gram input representation. Each classifier is a separate binary model file.
This model was developed based on the fastText supervised classification framework developed by Meta Research.
Number of model parameters: Each model contains 300-dimensional word embeddings over a large vocabulary derived from ~1 million web documents, resulting in a binary model file of approximately 6.8 GB per classifier.
Design Choices: The classifiers were produced through a two-stage knowledge distillation process:
- LLM-based annotation (teacher signal). Approximately 1 million web documents β sourced from the publicly available DCLM (DataComp-LM) data pool, which itself is derived from Common Crawl β were evaluated by nvidia/Nemotron-4-340B-Instruct. Each document was truncated to 2,048 tokens and scored on a 0β5 Likert scale across multiple quality dimensions using a detailed rubric prompt. The rubric assesses text quality, presence of promotional language, informational depth, cultural significance, and educational value.
- FastText classifier training (student models). For each quality dimension, a separate fastText supervised classifier was trained on the LLM-generated labels. Training used an 80/10/10 train/validation/test split, with hyperparameters: learning rate 0.289, 7 epochs, 2-word n-grams, 300-dimensional embeddings.
Input(s):
Input Type(s): Text
Input Format(s):
- Text: Plain-text string (UTF-8 encoded)
Input Parameters:
- Text: One-Dimensional (1D)
Other Properties Related to Input: Documents are expected to be web-crawled text. During the teacher-annotation stage, documents were truncated to 2,048 tokens. At inference time, fastText processes the full input text. No special pre-processing is required beyond standard text normalization (lowercasing is handled internally by fastText).
Output(s)
Output Type(s): Text (classification label with confidence score)
Output Format(s):
- Text: A predicted label in the range
__label__0through__label__5(corresponding to the 0β5 Likert scale) along with an associated probability score.
Output Parameters:
- Text: One-Dimensional (1D)
Other Properties Related to Output: Each classifier outputs a discrete score from 0 to 5 representing the quality of the input document along its respective dimension (quality, advertisement, informational value, cultural value, or educational value). Higher scores indicate higher quality / value. The classifiers run at high throughput on CPU hardware and do not require GPU acceleration.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIAβs hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine:
- fastText (Python bindings or command-line interface)
- Not Applicable (N/A) β No NVIDIA-specific runtime engine is required
Supported Hardware Microarchitecture Compatibility:
- CPU-only β These models are designed to run on standard x86_64 or ARM CPU hardware. No GPU is required.
Supported Operating System(s):
- Linux
- macOS
- Windows
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
| Classifier | Filename | Size | Version |
|---|---|---|---|
| Quality | best_model_quality.bin |
~6.8 GB | v1.0 |
| Advertisement | best_model_advertisement.bin |
~6.8 GB | v1.0 |
| Informational Value | best_model_informational_value.bin |
~6.8 GB | v1.0 |
| Cultural Value | best_model_cultural_value.bin |
~6.8 GB | v1.0 |
| Educational Value | best_model_educational_value.bin |
~6.8 GB | v1.0 |
All five classifiers are v1.0 releases produced from the same knowledge-distillation pipeline.
Training, Testing, and Evaluation Datasets:
Training Dataset:
Data Modality:
- Text
Training Data Size:
Text Training Data Size: Less than a Billion Tokens
Data Collection Method by dataset:
- Hybrid: Automated, Synthetic (LLM-generated labels via Nemotron-4-340B-Instruct)
Labeling Method by dataset:
- Synthetic β Labels were generated by nvidia/Nemotron-4-340B-Instruct using a detailed rubric prompt that assesses text quality, presence of promotional language, informational depth, cultural significance, and educational value on a 0β5 Likert scale.
Properties: ~800,000 text documents per classifier (80% of ~1M total). Content is English-language web text sourced from Common Crawl via the DCLM data pool. The data may include publicly available web content of various types (articles, blogs, forums, etc.).
Testing Dataset:
Data Collection Method by dataset:
- Hybrid: Automated, Synthetic
Labeling Method by dataset:
- Synthetic
Properties: ~100,000 text documents per classifier (10% of ~1M total). Same source distribution as training data.
Evaluation Dataset:
Data Collection Method by dataset:
- Hybrid: Automated, Synthetic
Labeling Method by dataset:
- Synthetic
Properties: ~100,000 text documents per classifier (10% of ~1M total). Same source distribution as training data.
Inference:
Acceleration Engine: N/A β CPU inference via the fastText library
Test Hardware:
- Standard x86_64 CPU (no GPU required)
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing: | Not Applicable. |
| Measures taken to mitigate against unwanted bias: | Not Applicable. |
Explainability
| Field | Response |
|---|---|
| Intended Task/Domain: | Text Classification / LLM Pre-training Data Curation |
| Model Type: | Shallow Neural Network (fastText supervised classifier) |
| Intended Users: | ML engineers and data scientists building or refining pre-training corpora for large language models. |
| Output: | Text (A predicted label __label__0 through __label__5 corresponding to a 0β5 Likert scale, along with an associated probability score) |
| Describe how the model works: | Input text is tokenized into bag-of-words and n-gram features by the fastText library. These features are mapped to 300-dimensional word embeddings, which are averaged and passed through a linear classifier to produce a predicted quality score (0β5 Likert scale) for the given dimension. Each of the five classifiers (quality, advertisement, informational value, cultural value, educational value) is a separate binary model trained via knowledge distillation from Nemotron-4-340B-Instruct labels. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable. |
| Technical Limitations & Mitigation: | The classifiers are trained on English web text only; they may produce unreliable scores for non-English documents or highly specialized domains (e.g., code, mathematical notation). Labels are distilled from an LLM teacher (Nemotron-4-340B-Instruct), so classifier outputs are bounded by the teacher model's judgment and may inherit its systematic errors. Documents were truncated to 2,048 tokens during the teacher-annotation stage, so very long documents may have been scored based on partial content. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Accuracy, Precision, Recall, F1 Score (per Likert-scale class), Throughput (documents/second on CPU) |
| Potential Known Risks: | Misclassification of documents could lead to high-quality content being filtered out or low-quality content being retained in LLM training corpora, which may degrade downstream model performance. The classifiers may also inherit biases present in the teacher LLM's scoring rubric or in the Common Crawl source data. |
| Licensing: | NVIDIA Open Model License |
Safety & Security
| Field | Response |
|---|---|
| Model Application Field(s): | Large Language Model Data Curation |
| Describe the life critical impact (if present). | Not Applicable. |
| Use Case Restrictions: | Abide by the NVIDIA Open Model License. |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
Privacy
| Privacy Subcard |
|---|
| Nemotron-Climb-Classifiers was trained on large-scale publicly available data that may contain images, audio-video, and text relating to people. NVIDIA collected and used this data in compliance with applicable data protection and privacy laws. This model was not designed to specifically derive insights or otherwise learn from any personal data contained in the datasets. |
| NVIDIA uses a combination of filters, data minimization techniques, and other guardrails to help prevent personal data from being recited by our models. We employ automated tools and data processing techniques during pre-training or training to identify and filter certain categories of personal data. |
| Please review NVIDIA's Applicable Privacy Policy for more information. |
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.