meituan-longcat
/

LongCat-Flash-Prover

Text Generation

Model card Files Files and versions

Maternion commited on 21 days ago

Commit

992f781

·

verified ·

1 Parent(s): 9532d65

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -52,7 +52,7 @@ tags:
 ## Introduction
-We introduce **LongCat-Flash-Prover**, a flagship $560$-billion-parameter open-source Mixture-of-Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR).
 We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving.
 To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch.
 During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels.

 ## Introduction
+We introduce **LongCat-Flash-Prover**, a flagship 560-billion-parameter open-source Mixture-of-Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR).
 We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving.
 To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch.
 During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels.