cabinet-data_curation
updated
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Paper
•
2507.01352
•
Published
•
56
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges
in Russian Speech Generative Models
Paper
•
2507.13563
•
Published
•
53
Scaling Laws for Optimal Data Mixtures
Paper
•
2507.09404
•
Published
•
37
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper
•
2511.14993
•
Published
•
230
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper
•
2512.16676
•
Published
•
212
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Paper
•
2512.04324
•
Published
•
155
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
•
2512.14051
•
Published
•
45
DRIVE: Data Curation Best Practices for Reinforcement Learning with
Verifiable Reward in Competitive Code Generation
Paper
•
2511.06307
•
Published
•
52
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Paper
•
2511.18050
•
Published
•
38
FineVision: Open Data Is All You Need
Paper
•
2510.17269
•
Published
•
75
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper
•
2510.23587
•
Published
•
67
RAG-Anything: All-in-One RAG Framework
Paper
•
2510.12323
•
Published
•
56
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers
Paper
•
2508.21148
•
Published
•
140
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
•
2509.18154
•
Published
•
53
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Paper
•
2508.01191
•
Published
•
238
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining
Paper
•
2508.10975
•
Published
•
60
Alchemist: Turning Public Text-to-Image Data into Generative Gold
Paper
•
2505.19297
•
Published
•
84
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
Dataset and Comprehensive Benchmark
Paper
•
2509.09680
•
Published
•
43
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Paper
•
2512.16905
•
Published
•
32
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Paper
•
2511.12609
•
Published
•
105
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Paper
•
2511.16043
•
Published
•
109
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models
Paper
•
2512.00590
•
Published
•
48
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
Paper
•
2510.16872
•
Published
•
109
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper
•
2509.12201
•
Published
•
106
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated
Data Refinement Using Contrastive Learning
Paper
•
2503.18406
•
Published
•
3