ww762744's picture
Update README.md
241bcb6 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Ultra-FineWeb-L2-Selector
emoji: 
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.5.1
python_version: '3.10'
app_file: app.py
pinned: false
license: apache-2.0

⚡ Ultra-FineWeb Classifier

A lightweight fastText-based classifier for filtering high-quality web data, supporting both English and Chinese.

🌟 Features

  • Fast Inference: Based on fastText for efficient classification
  • Bilingual Support: Works with both English (en) and Chinese (zh) content
  • Quality Scoring: Returns a quality score from 0 to 1
  • Easy to Use: Simple web interface powered by Gradio

📊 Quality Score Interpretation

Score Range Quality Level Recommendation
≥ 0.7 🌟 High Quality Suitable for LLM training
0.4 - 0.7 📊 Medium Quality May need review
< 0.4 ⚠️ Low Quality Likely not suitable

🔗 Links

📝 Citation

@misc{wang2025ultrafineweb,
  title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data},
  author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu},
  year={2025},
  eprint={2505.05427},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
}

📄 License

Apache 2.0