gaoxin

GX-XinGao

1 29 22

AI & ML interests

None yet

Recent Activity

upvoted a paper 22 days ago

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

updated a collection 28 days ago

R-Select

updated a collection 28 days ago

R-Select

View all activity

Organizations

upvoted a paper 22 days ago

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Paper • 2606.22138 • Published 25 days ago • 25

upvoted 2 papers about 2 months ago

ACC: Compiling Agent Trajectories for Long-Context Training

Paper • 2605.21850 • Published May 21 • 61

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Paper • 2605.15963 • Published May 15 • 17

upvoted 2 papers 3 months ago

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Paper • 2604.10480 • Published Apr 12 • 20

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Paper • 2604.04771 • Published Apr 6 • 125

upvoted 2 papers 4 months ago

Motivation in Large Language Models

Paper • 2603.14347 • Published Mar 15 • 17

Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training

Paper • 2603.07223 • Published Mar 7 • 13

upvoted a paper 5 months ago

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Paper • 2602.11089 • Published Feb 11 • 18

upvoted 2 papers 6 months ago

Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

Paper • 2601.17027 • Published Jan 17 • 42

Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Paper • 2601.09733 • Published Dec 30, 2025 • 9

upvoted 2 papers 7 months ago

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Paper • 2512.19673 • Published Dec 22, 2025 • 66

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Paper • 2512.14051 • Published Dec 16, 2025 • 47

upvoted a paper 8 months ago

GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Paper • 2511.11134 • Published Nov 14, 2025 • 33

upvoted 2 papers 12 months ago

Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

Paper • 2507.17512 • Published Jul 23, 2025 • 37

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Paper • 2507.10541 • Published Jul 14, 2025 • 30

upvoted an article about 1 year ago

Article

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

loubnabnl, anton-l, davanstrien

•

Mar 20, 2024

• 115

upvoted 4 papers about 1 year ago

Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Paper • 2504.21117 • Published Apr 29, 2025 • 26

CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges

Paper • 2504.19093 • Published Apr 27, 2025 • 18

A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis

Paper • 2504.12322 • Published Apr 11, 2025 • 28

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Paper • 2504.09925 • Published Apr 14, 2025 • 39

gaoxin

AI & ML interests

Recent Activity

Organizations

GX-XinGao's activity

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models