Spaces:

FrictionAI
/

SokratesAI

Sleeping

App Files Files Community

Alleinzellgaenger commited on Aug 18, 2025

Commit

8f87470

1 Parent(s): 19b82ef

Add lennart version

Browse files

Files changed (4) hide show

backend/app.py +72 -378
backend/documents/lennart.txt +497 -0
backend/prompts/system_prompt.txt +104 -0
backend/prompts/transition_prompt.txt +29 -0

backend/app.py CHANGED Viewed

@@ -13,6 +13,33 @@ import json
 # Load environment variables
 load_dotenv()
 app = FastAPI()
 # Enable CORS
@@ -40,6 +67,7 @@ class ChatRequest(BaseModel):
     nextChunk: Optional[str] = None
     action: Optional[str] = None  # 'skip', 'understood', or None
     document: Optional[str] = None
 @app.post("/api/chat")
 async def chat_endpoint(request: ChatRequest):
@@ -49,201 +77,28 @@ async def chat_endpoint(request: ChatRequest):
     current_chunk = request.currentChunk or request.chunk or "No specific chunk provided"
     next_chunk = request.nextChunk or ""
     action = request.action
-    document = request.document or """
-    # Auswertung Versuch F44: Zeeman Effekt
-    Dominic Holst, Moritz Pfau
-    October 23, 2020
-    ## 1 Magnetfeld und Hysterese
-    Zu Beginn des Versuchs haben wir mit Hilfe des Teslameters die Magnetfeldstärke B an der Position der Cd-Lampe bei verschiedenen Spulenströmen gemessen (siehe Messwerte in Tabelle 1 im Laborbuch). In Figure 1 sind die gemessenen Feldstärken als Funktion der Stromstärke aufgetragen.
-    Anhand der Fehlerbalken und der praktisch identischen Überlagerung der beiden linearen Fitgeraden für auf- und absteigende Stromstärken, wird deutlich, dass **keine Hystereseeffekte vorliegen**. Der lineare Fit wurde hierbei nur auf die Stromstärken bis einschl. 10A angewandt, da für größere Stromstärken das Magnetfeld nicht in direktem proportionalen Zusammenhang ansteigt. Dies ist mit Sättigungseffekten der Magnetisierung des Eisenkerns der verwendeten Spule zu erklären.
-    *Figure 1: Messung des Magnetfelds als Funktion der Stromstärke*
-    ## 2 Qualitative Beobachtung des Zeeman Effekts
-    Mit Hilfe der CMOS Kamera wurde das Spektrum des emittierten Lichts der Cadmiumlampe unter Verwendung des Lummer Gehercke Interferometers beobachtet. Die Beobachtungen wurden in longitudinaler und transversaler Richtung zum Magnetfeld durchgeführt.
-    ### 2.1 Longitudinale Richtung:
-    **ohne Filter:**
-    Es sind deutlich zwei Linien pro Ordnung zu erkennen. Dies sind die σ+ und σ' Linien. Die π Linie ist in longitudinaler Richtung nicht zu beobachten
-    **mit λ/4-Plättchen und Polarisationsfilter:**
-    Von der Cadmiumlampe aus betrachtet wird zuerst ein λ/4-Plättchen und danach ein Polarisationsfilter in den Strahlengang gebracht. Je nach Ausrichtung der Filter zueinander wird nun eine der beiden Linien ausgeblendet.
-    **-45° Winkel:**
-    Stehen λ/4-Plättchen und Polarisationsfilter zueinander im −45° Winkel, wird das zirkular polarisierte Licht der σ¯ Linie um 45° verschoben linear polarisiert und somit vom Polarisationsfilter abgeschirmt. Folglich ist in dieser Konstellation nur die linke der beiden σ Linien zu beobachten.
-    **+45° Winkel:**
-    Stehen λ/4-Plättchen und Polarisationsfilter zueinander im +45° Winkel, ist nach analogem Prinzip wie zuvor nur die rechte Linie auf dem Kamerabild zu beobachten.
-    *Figure 2: Bilder der CMOS Kamera in longitudinaler Richtung mit a) λ/4-Plättchen und Polarisationsfilter im −45° Winkel, b) ohne Filter und c) Filter im +45° Winkel*
-    ### 2.2 Transversale Richtung:
-    **ohne Filter:**
-    Es sind deutlich drei Linien pro Ordnung zu erkennen. Dies sind die σ⁺, π und σ⁻ Linien.
-    **mit Polarisationsfilter horizontal (in B-Feld Richtung):**
-    Die beiden σ-Linien sind vollständig ausgeblendet. Die π- Linie ist deutlich sichtbar.
-    **mit Polarisationsfilter vertikal (90° zu B-Feld Richtung):**
-    Die beiden σ-Linien sind klar sichtbar. Die π-Linie ist ausgeblendet.
-    *Figure 3: Bilder der CMOS Kamera in vertikaler Richtung mit a) keinem Filter, b) Polarisationsfilter horizontal und c) Polarisationsfilter vertikal*
-    Wie in Figure 3 gut zu erkennen ist, sind die ausgeblendeten Linien in beiden Konfigurationen weiterhin leicht sichtbar. Dies ist auf das nicht perfekt homogene Magnetfeld am Ort der Ca-Lampe zurückzuführen. Das Licht ist also nicht perfekt zirkular bzw. in B-Feld Richtung polarisiert, weshalb ein vollständiges Ausblenden im Experiment nicht zu beobachten ist.
-    ## 3 Spektroskopie des Zeemaneffekts
-    ### 3.1 Bestimmen des Zeemanshifts
-    Die Messdaten bei verschiedene Stromstärken wurden jeweils in einem Plot dargestellt. Um für den Fit möglichst saubere Messkurven des Spektrums zu verwenden, wurde die Messreihe bei I = 8A nicht in die Datenauswertung einbezogen, da die Aufspaltung der Cadmiumlinie nur schwer zu beobachten war. Das gleich gilt für die 8. Interferenzodnung, die nicht berücksichtigt wurde. Für die Datenauswertung fließen also die Nullte bis 7. Ordnung jeweils bei 9 bis 13 Ampere ein.
-    Als Funktion um die Messdaten zu fitten wurde ein Pseudo-Voigt-Profil verwendet. Die drei Kurven einer Ordnung wurden hierbei gemeinsam mit der Summe dreier Pseudo-Voigt-Profile gefittet. In Figure 4 sind exemplarisch anhand der Daten für I = 12A die Messdaten und der abschnittsweise Fit zu erkennen.
-    *Figure 4: Messdaten und Voigt-Fit bei Spulenstrom I = 12A*
-    Anhand der Fitparameter wird die Position der σ und π Linien bestimmt. Die Fehler der Fitparameter sind extrem klein (≈ 0,1px) und eigenen sich nicht als realistische Fehler für unsere weitere Rechnung. Als minimalen Fehler nehmen wir daher die Auflösung der Kamera an (1px) und skalieren alle Fehler so, dass der kleineste Fehler exakt 1px beträgt. Die anderen Fehler sind dann entsprechend linear skaliert größer. Dies berücksichtigt die unterschiedliche Qualität der Fits auf unterschiedliche Interferenz-Ordnungen, bringt die Fehler aber in einen experimentell realistischen Bereich.
-    Für die Berechnung des Zeemanshifts müssen die Verzerrungseffekte der Lummer-Gehrcke-Platte beachtet werden. Hierfür wird die Position der π-Linien gegen der Interferenzordnung k der entsprechenden Linie aufgetragen. Der funktionelle Zusammenhang dieser beiden Größen wird durch eine quadratische Funktion k = f(a) approximiert:
-    k = f(a) = ba² + ca + d (1)
-    Wir verwenden hier eine Taylor-Näherung für eine in der Realität deutlich kompliziertere Funktion. Dies ist aber, wie in Figure 5 gut ersichtlich, für unsere Zwecke weitaus ausreichend.
-    Die beiden σ-Linien können auf den quadratischen Fit f(a) projiziert werden, wodurch wir die jeweilige (nicht mehr ganzzahligen) Ordnung der σ-Linien erhalten. In Figure 5 ist (wieder exemplarisch für I = 12A) die optische Verzerrung der Platte aufgetragen.
-    *Figure 5: Verzerrungseffekte der Lummer-Gehrcke-Platte bei I = 12A*
-    Die Differenz zur ganzzahligen Ordnung der zugehörigen π-Linie ergibt δk. Für eine (kleine) Wellenlängenverschiebung δλ gilt:
-    δλ = δk / Δk * λ² / (2d * sqrt(n² − 1)) (2)
-    Für den Abstand Δk zweier Ordnungen gilt Δk = 1. Für die Wellenlänge λ der betrachten Linie verwenden wir den in Part 2 bestimmten Wert von λ = (643, 842 ± 0, 007)nm.
-    Wir kennen nun die Wellenlänge des Zeemanshift für jede von uns betrachtete Linie. Mit dem Zusammenhang zwischen Wellenlänge und Energie E = hc/λ lässt sich nun die Energieverschiebung der Linine bestimmen. Wir nehmen an, dass die Wellenlängenverschiebung δλ klein gegenüber der absoluten Wellenlänge λ ist, und erhalten daher für die Energieverschiebung δE in guter Näherung:
-    δE = (hc/λ²) * δλ (3)
-    Abschließend nehmen wir den Durchschnitt aller Werte δE für eine Stromstärke I.
-    ### 3.2 Bestimmen des Bohrschen Magnetons μB
-    Für die Energieverschiebung beim Zeemaneffekt gilt:
-    δE = μB · ml · B (4)
-    Da es sich bei der betrachteten Cadmiumlinie um einen ¹D₂ → ¹P₁ Übergang handelt gilt hier ml = ±1. Somit folgt für das Bohrsche Magneton μB als Funktion des Spulenstroms I:
-    μB(I) = δE(I) / B(I) (5)
-    Die Magnetfeldstärke B(I) wurde hier anhand der Messwerte aus Teil 1 des Experiments bestimmt.
-    Wir erhalten für jeden Spulenstrom I einen experimentell bestimmten Wert des Bohrschen Magnetons μB. Unsere Ergebnisse sind in Figure 6 graphisch dargestellt.
-    *Figure 6: Experimentell bestimmte Werte für das Bohrsche Magneton bei unterschiedlichen Spulenströmen I*
-    Für den experimentellen Mittelwert erhalten wir:
-    μB,exp = (10, 1 ± 0.8) · 10⁻²⁴ J/T
-    Der Literaturwert beträgt:
-    μB,lit = 9, 27400949 · 10⁻²⁴ J/T
-    Unsere experimentell ermittelte Wert weicht also um 1,2 Sigma vom Literaturwert ab. Die Abweichung ist folglich nicht signifikant.
-    ### 3.3 Kritische Betrachtung der Ergebnisse
-    Erfreulicherweise scheint unsere experimentelle Methode keine signifikante Abweichung zwischen Literaturwert und experimentellem Wert des Bohrschen Magnetons zu ergeben. Wir befinden uns mit unserem Wert im niedrigen 2-Sigma-Intervall. Dennoch ist kritisch anzumerken, dass wir einen vergleichsweise großen realtiven Fehler auf unser Messergebnis von 7,1% erhalten. Das bedeutet, unsere Abweichung ist zwar nicht sigifikant, dennoch weicht unser experimenteller Wert um knapp 10% vom Literaturwert ab. Der verwendete experimentelle Aufbau ist folglich nur bedingt für eine exakte Bestimmung des Bohrschen Magnetons geeigent.
-    Die beiden dominierenden Fehlerquellen sind zum einen die Bestimmung des Magnetfeldes B am Ort der Cadmium Lampe (Inhomogenitäten, exakte Platzierung der Lampe) und zum anderen die Wahl der Fehler der Positionen der π- und σ -Linien im Spektrum.
-    Zum Vergleich: Legt man den Fehler prinzipiell für alle Linien auf 1px, also die maximale Auflösung der Kamera, fest und verzichtet auf eine Skalierung der Fehler, beträgt die Abweichung des exp. Werts zum Literaturwert schon 2,8 Sigma. Wählt man analog für den Fehler der Linien 2px, da beispielsweise ein Maximum auch exakt zwischen zwei Pixelreihen liegen kann, liegt die Abweichung bei 1,4 Sigma.
-    ## 4 Quantitative Betrachtung des Spektrums
-    ### 4.1 Wellenlänge rote Cd-Linie
-    *Figure 7: Neonspektrum*
-    Zunächst wird der Untergrund von den Messdaten abgezogen, um Störungen durch Rauschen oder Sondereffekte wie kosmische Strahlung oder Umgebungsquellen zu eliminieren. Sollten sich in den Spektren negative Werte befinden, ist dies auf zufällige Unterschiede im Rauschen zurückzuführen. Anhand bekannter Linien des Neonspektrums werden den Pixeln nun Wellenlängen zugeordnet. Hierfür wurde der Bereich des Neonspektrums aufgenommen, in dem sich auch die rote Linie des Cadmiumspektrums befindet. In 7 sieht man das Neonspektrum und die Peaks, an die jeweils ein Voigt-Profil gelegt wurde. Jetzt kann man den identifizierten Linien ihre jeweilige Wellenlänge zuordnen und einen polynomiellen Zusammenhang finden. Wir haben uns für eine Gerade entschieden, die wie in Figure 8 zu sehen gut zu den Daten passt.
-    Schließlich wird ein Voigt-Profil an die gemessene rote Cd-Linie gelegt, wie in Figure 9 gezeigt. Umrechnung anhand der Kalibrierung führt auf einen Wert von λcd = (643,842 ± 0,007)nm. Dies befindet sich im 1σ-Bereich des Literaturwertes von λlit = 643, 84695nm. Der Fehler ist Ergebnis der Gauß'schen Fehlerfortpflanzung.
-    *Figure 8: Kalibrationsgerade*
-    ### 4.2 Kritische Betrachtung der Ergebnisse
-    Messwert und theoretische Vorhersage für die bestimmte Linie stimmen innerhalb statistischer Schwankungen überein. Dies ist umso interessanter, wenn man die Unsicherheit des Messergebnisses betrachtet, die kleiner als 0,002% ist. Der absolute Fehler ist, wenn man die Steigung der Kalibrationsgeraden betrachtet, kleiner als 1px. Er besteht ausschließlich aus Abweichungen der numerischen Fits. Berücksichtigt man Ungenauigkeiten des CMOS Sensors oder die Möglichkeit, dass je nach Lage des Messwerts auch eine Abweichung um weniger als 1px eine größere Messwertschwankung verursachen kann, da die Pixel nur diskrete Werte messen können, liegt eine nachträgliche Anpassung nahe. Skaliert man die Unsicherheit auf 1px, liegt der Fehler des Messwerts bei 0,012nm. Damit ist der relative Fehler weiterhin kleiner 0,005%.
-    Zur hohen Genauigkeit trägt vor allem das gute Messverfahren bei. Spektrometer und Datenaufnahme per Computer lassen wenig Raum für Abweichungen. Wie die Daten zeigen, haben wir dabei eine Quelle für einen möglichen großen systematischen Fehler umgangen: Die Kamera wurde auf das Spektrometer nur locker aufgesteckt. Hätte sich deren Position zwischen Neon- und Cadmiummmessung z.B. durch Erschütterung des Labortisches verändert, hätte die Energiekalibrierung nicht mehr zur Messung der Cadmiumlinie gepasst.
-    Abbildung 6 zeigt unerwartetes Verhalten. Obwohl der Magnet ausgeschaltet war, sind drei Maxima zu sehen, deren Flanken sehr steil abfallen. Vergleicht man mit den Messungen im Magnetfeld, ähneln sich die Strukturen. Möglich ist, dass die Eisenkernspule, in der sich die Lampe während der Messung befand eine Restmagnetisierung aufwies, die eine Aufspaltung herbeigeführt hat.
-    *Figure 9: Cadmium rote Linie*
-    """
     # Create system prompt for research paper tutor with transition support
     is_transition = action in ['skip', 'understood']
     if is_transition:
-        system_prompt = f"""
-        You are PaperMentor, an expert academic tutor guiding the user through a continuous learning journey of an academic paper.
-        The user has just {action} the previous section and is transitioning to a new topic. This is part of a continuous conversation where you maintain context and adapt based on the user's actions.
-        User's Action: {action}
-        Previous Section:
-        {current_chunk}
-        New Section to Introduce:
-        {next_chunk}
-        Full Document for Context:
-        {document}
-        Your response should:
-        1. **Acknowledge the transition**: Briefly reference their choice to {action} the previous section
-        2. **Provide smooth continuity**: Connect the previous section to this new one naturally
-        3. **Introduce the new section**: Present the new topic with enthusiasm and context
-        4. **Adapt your approach**: If they skipped, perhaps adjust to be more engaging. If they understood, acknowledge their progress
-        5. **Begin new exploration**: Start the 3-question sequence for this new section
-        Maintain the same conversational style and focus on phenomenological understanding.
-        """
     else:
-        system_prompt = f"""
-        You are PaperMentor, an expert academic tutor. Your purpose is to guide a user to a deep, phenomenological understanding of an academic paper.
-        The user's primary goal is to: "phänomenologisch verstehen, was passiert, was beobachtet wurde und warum das so ist, mit wenig Fokus auf Formeln, sondern Fokus auf intuitivem Verständnis und dem experimentellen Ansatz." (phenomenologically understand what is happening, what was observed, and why, with little focus on formulas but a strong focus on intuitive understanding and the experimental approach).
-        Your entire interaction must be guided by this goal.
-        You will be given a specific chunk of the paper to discuss, as well as the full document for context.
-        ---
-        Current Chunk:
-        {current_chunk}
-        ---
-        Full Document for Context:
-        {document}
-        ---
-        Your interaction must follow this specific conversational flow:
-    1.  **Greeting and Contextualization:**
-        *   Begin with a friendly greeting.
-        *   First, briefly explain what this chunk is about in simple terms.
-        *   Then, place this chunk within the larger context of the paper. Explain its purpose in the overall argument. For instance: "Here, the authors are presenting the core observation that the rest of the paper will attempt to explain," or "This section lays the theoretical groundwork for the experiment they describe later."
-    2.  **Socratic Questioning (The 3-Question Rule):**
-        *   Your main task is to test and deepen the user's understanding through a series of exactly three questions about the current chunk.
-        *   **First Question:** Ask a single, open-ended question that probes the user's intuitive grasp of the chunk's most important concept. The question must align with the user's goal (e.g., "In simple terms, what did the researchers actually observe here?" or "Why was it necessary for them to design the experiment in this specific way?"). **Always ask only one question at a time.**
-        *   **If the user answers correctly:** Affirm their understanding (e.g., "Exactly," "That's a great way to put it") and immediately ask a *second, deeper question*. This question should build upon their correct answer, asking for more detail or to consider the implications.
-        *   **If the user answers the second question correctly:** Again, affirm their response and ask a *third, even more probing question*. This final question should challenge them to think about the "why" or the broader significance of the information.
-        *   **If the user answers incorrectly at any point:** Gently correct the misunderstanding. Provide a clear, intuitive explanation, always connecting it to the experimental observations and the "why." After your explanation, re-ask the question in a slightly different way to ensure they now understand, then continue the 3-question sequence.
-    3.  **Moving On:**
-        *   After the user has successfully answered all three questions, congratulate them on their solid understanding.
-        *   Conclude by explicitly giving them the choice to continue or stay. Say something like: "Excellent, it seems you have a very solid grasp of this part. Shall we move on to the next section, or is there anything here you'd like to explore further?"
-    **Important Behaviors:**
-    *   **Language:** The entire conversation must be in English, as indicated by the user's goal.
-    *   **Focus:** Always prioritize intuitive, conceptual, and experimental understanding over formal, mathematical details.
-    *   **Pacing:** The flow is dictated by the user's successful answers. Move from one question to the next smoothly.
-    *   **Structure:** Importantly, maintain a clear and logical flow in the conversation. Never loose track of the objective.
-    *   **Markdown:** Output markdown if you think it is useful. Break your response into reasonable sections.
-    Begin the conversation.
-    """
     anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
     if not anthropic_api_key:
@@ -326,200 +181,35 @@ async def chat_stream(request: ChatRequest):
     next_chunk = request.nextChunk or ""
     action = request.action
-    document = request.document or """
-    # Auswertung Versuch F44: Zeeman Effekt
-    Dominic Holst, Moritz Pfau
-    October 23, 2020
-    ## 1 Magnetfeld und Hysterese
-    Zu Beginn des Versuchs haben wir mit Hilfe des Teslameters die Magnetfeldstärke B an der Position der Cd-Lampe bei verschiedenen Spulenströmen gemessen (siehe Messwerte in Tabelle 1 im Laborbuch). In Figure 1 sind die gemessenen Feldstärken als Funktion der Stromstärke aufgetragen.
-    Anhand der Fehlerbalken und der praktisch identischen Überlagerung der beiden linearen Fitgeraden für auf- und absteigende Stromstärken, wird deutlich, dass **keine Hystereseeffekte vorliegen**. Der lineare Fit wurde hierbei nur auf die Stromstärken bis einschl. 10A angewandt, da für größere Stromstärken das Magnetfeld nicht in direktem proportionalen Zusammenhang ansteigt. Dies ist mit Sättigungseffekten der Magnetisierung des Eisenkerns der verwendeten Spule zu erklären.
-    *Figure 1: Messung des Magnetfelds als Funktion der Stromstärke*
-    ## 2 Qualitative Beobachtung des Zeeman Effekts
-    Mit Hilfe der CMOS Kamera wurde das Spektrum des emittierten Lichts der Cadmiumlampe unter Verwendung des Lummer Gehercke Interferometers beobachtet. Die Beobachtungen wurden in longitudinaler und transversaler Richtung zum Magnetfeld durchgeführt.
-    ### 2.1 Longitudinale Richtung:
-    **ohne Filter:**
-    Es sind deutlich zwei Linien pro Ordnung zu erkennen. Dies sind die σ+ und σ' Linien. Die π Linie ist in longitudinaler Richtung nicht zu beobachten
-    **mit λ/4-Plättchen und Polarisationsfilter:**
-    Von der Cadmiumlampe aus betrachtet wird zuerst ein λ/4-Plättchen und danach ein Polarisationsfilter in den Strahlengang gebracht. Je nach Ausrichtung der Filter zueinander wird nun eine der beiden Linien ausgeblendet.
-    **-45° Winkel:**
-    Stehen λ/4-Plättchen und Polarisationsfilter zueinander im −45° Winkel, wird das zirkular polarisierte Licht der σ¯ Linie um 45° verschoben linear polarisiert und somit vom Polarisationsfilter abgeschirmt. Folglich ist in dieser Konstellation nur die linke der beiden σ Linien zu beobachten.
-    **+45° Winkel:**
-    Stehen λ/4-Plättchen und Polarisationsfilter zueinander im +45° Winkel, ist nach analogem Prinzip wie zuvor nur die rechte Linie auf dem Kamerabild zu beobachten.
-    *Figure 2: Bilder der CMOS Kamera in longitudinaler Richtung mit a) λ/4-Plättchen und Polarisationsfilter im −45° Winkel, b) ohne Filter und c) Filter im +45° Winkel*
-    ### 2.2 Transversale Richtung:
-    **ohne Filter:**
-    Es sind deutlich drei Linien pro Ordnung zu erkennen. Dies sind die σ⁺, π und σ⁻ Linien.
-    **mit Polarisationsfilter horizontal (in B-Feld Richtung):**
-    Die beiden σ-Linien sind vollständig ausgeblendet. Die π- Linie ist deutlich sichtbar.
-    **mit Polarisationsfilter vertikal (90° zu B-Feld Richtung):**
-    Die beiden σ-Linien sind klar sichtbar. Die π-Linie ist ausgeblendet.
-    *Figure 3: Bilder der CMOS Kamera in vertikaler Richtung mit a) keinem Filter, b) Polarisationsfilter horizontal und c) Polarisationsfilter vertikal*
-    Wie in Figure 3 gut zu erkennen ist, sind die ausgeblendeten Linien in beiden Konfigurationen weiterhin leicht sichtbar. Dies ist auf das nicht perfekt homogene Magnetfeld am Ort der Ca-Lampe zurückzuführen. Das Licht ist also nicht perfekt zirkular bzw. in B-Feld Richtung polarisiert, weshalb ein vollständiges Ausblenden im Experiment nicht zu beobachten ist.
-    ## 3 Spektroskopie des Zeemaneffekts
-    ### 3.1 Bestimmen des Zeemanshifts
-    Die Messdaten bei verschiedene Stromstärken wurden jeweils in einem Plot dargestellt. Um für den Fit möglichst saubere Messkurven des Spektrums zu verwenden, wurde die Messreihe bei I = 8A nicht in die Datenauswertung einbezogen, da die Aufspaltung der Cadmiumlinie nur schwer zu beobachten war. Das gleich gilt für die 8. Interferenzodnung, die nicht berücksichtigt wurde. Für die Datenauswertung fließen also die Nullte bis 7. Ordnung jeweils bei 9 bis 13 Ampere ein.
-    Als Funktion um die Messdaten zu fitten wurde ein Pseudo-Voigt-Profil verwendet. Die drei Kurven einer Ordnung wurden hierbei gemeinsam mit der Summe dreier Pseudo-Voigt-Profile gefittet. In Figure 4 sind exemplarisch anhand der Daten für I = 12A die Messdaten und der abschnittsweise Fit zu erkennen.
-    *Figure 4: Messdaten und Voigt-Fit bei Spulenstrom I = 12A*
-    Anhand der Fitparameter wird die Position der σ und π Linien bestimmt. Die Fehler der Fitparameter sind extrem klein (≈ 0,1px) und eigenen sich nicht als realistische Fehler für unsere weitere Rechnung. Als minimalen Fehler nehmen wir daher die Auflösung der Kamera an (1px) und skalieren alle Fehler so, dass der kleineste Fehler exakt 1px beträgt. Die anderen Fehler sind dann entsprechend linear skaliert größer. Dies berücksichtigt die unterschiedliche Qualität der Fits auf unterschiedliche Interferenz-Ordnungen, bringt die Fehler aber in einen experimentell realistischen Bereich.
-    Für die Berechnung des Zeemanshifts müssen die Verzerrungseffekte der Lummer-Gehrcke-Platte beachtet werden. Hierfür wird die Position der π-Linien gegen der Interferenzordnung k der entsprechenden Linie aufgetragen. Der funktionelle Zusammenhang dieser beiden Größen wird durch eine quadratische Funktion k = f(a) approximiert:
-    k = f(a) = ba² + ca + d (1)
-    Wir verwenden hier eine Taylor-Näherung für eine in der Realität deutlich kompliziertere Funktion. Dies ist aber, wie in Figure 5 gut ersichtlich, für unsere Zwecke weitaus ausreichend.
-    Die beiden σ-Linien können auf den quadratischen Fit f(a) projiziert werden, wodurch wir die jeweilige (nicht mehr ganzzahligen) Ordnung der σ-Linien erhalten. In Figure 5 ist (wieder exemplarisch für I = 12A) die optische Verzerrung der Platte aufgetragen.
-    *Figure 5: Verzerrungseffekte der Lummer-Gehrcke-Platte bei I = 12A*
-    Die Differenz zur ganzzahligen Ordnung der zugehörigen π-Linie ergibt δk. Für eine (kleine) Wellenlängenverschiebung δλ gilt:
-    δλ = δk / Δk * λ² / (2d * sqrt(n² − 1)) (2)
-    Für den Abstand Δk zweier Ordnungen gilt Δk = 1. Für die Wellenlänge λ der betrachten Linie verwenden wir den in Part 2 bestimmten Wert von λ = (643, 842 ± 0, 007)nm.
-    Wir kennen nun die Wellenlänge des Zeemanshift für jede von uns betrachtete Linie. Mit dem Zusammenhang zwischen Wellenlänge und Energie E = hc/λ lässt sich nun die Energieverschiebung der Linine bestimmen. Wir nehmen an, dass die Wellenlängenverschiebung δλ klein gegenüber der absoluten Wellenlänge λ ist, und erhalten daher für die Energieverschiebung δE in guter Näherung:
-    δE = (hc/λ²) * δλ (3)
-    Abschließend nehmen wir den Durchschnitt aller Werte δE für eine Stromstärke I.
-    ### 3.2 Bestimmen des Bohrschen Magnetons μB
-    Für die Energieverschiebung beim Zeemaneffekt gilt:
-    δE = μB · ml · B (4)
-    Da es sich bei der betrachteten Cadmiumlinie um einen ¹D₂ → ¹P₁ Übergang handelt gilt hier ml = ±1. Somit folgt für das Bohrsche Magneton μB als Funktion des Spulenstroms I:
-    μB(I) = δE(I) / B(I) (5)
-    Die Magnetfeldstärke B(I) wurde hier anhand der Messwerte aus Teil 1 des Experiments bestimmt.
-    Wir erhalten für jeden Spulenstrom I einen experimentell bestimmten Wert des Bohrschen Magnetons μB. Unsere Ergebnisse sind in Figure 6 graphisch dargestellt.
-    *Figure 6: Experimentell bestimmte Werte für das Bohrsche Magneton bei unterschiedlichen Spulenströmen I*
-    Für den experimentellen Mittelwert erhalten wir:
-    μB,exp = (10, 1 ± 0.8) · 10⁻²⁴ J/T
-    Der Literaturwert beträgt:
-    μB,lit = 9, 27400949 · 10⁻²⁴ J/T
-    Unsere experimentell ermittelte Wert weicht also um 1,2 Sigma vom Literaturwert ab. Die Abweichung ist folglich nicht signifikant.
-    ### 3.3 Kritische Betrachtung der Ergebnisse
-    Erfreulicherweise scheint unsere experimentelle Methode keine signifikante Abweichung zwischen Literaturwert und experimentellem Wert des Bohrschen Magnetons zu ergeben. Wir befinden uns mit unserem Wert im niedrigen 2-Sigma-Intervall. Dennoch ist kritisch anzumerken, dass wir einen vergleichsweise großen realtiven Fehler auf unser Messergebnis von 7,1% erhalten. Das bedeutet, unsere Abweichung ist zwar nicht sigifikant, dennoch weicht unser experimenteller Wert um knapp 10% vom Literaturwert ab. Der verwendete experimentelle Aufbau ist folglich nur bedingt für eine exakte Bestimmung des Bohrschen Magnetons geeigent.
-    Die beiden dominierenden Fehlerquellen sind zum einen die Bestimmung des Magnetfeldes B am Ort der Cadmium Lampe (Inhomogenitäten, exakte Platzierung der Lampe) und zum anderen die Wahl der Fehler der Positionen der π- und σ -Linien im Spektrum.
-    Zum Vergleich: Legt man den Fehler prinzipiell für alle Linien auf 1px, also die maximale Auflösung der Kamera, fest und verzichtet auf eine Skalierung der Fehler, beträgt die Abweichung des exp. Werts zum Literaturwert schon 2,8 Sigma. Wählt man analog für den Fehler der Linien 2px, da beispielsweise ein Maximum auch exakt zwischen zwei Pixelreihen liegen kann, liegt die Abweichung bei 1,4 Sigma.
-    ## 4 Quantitative Betrachtung des Spektrums
-    ### 4.1 Wellenlänge rote Cd-Linie
-    *Figure 7: Neonspektrum*
-    Zunächst wird der Untergrund von den Messdaten abgezogen, um Störungen durch Rauschen oder Sondereffekte wie kosmische Strahlung oder Umgebungsquellen zu eliminieren. Sollten sich in den Spektren negative Werte befinden, ist dies auf zufällige Unterschiede im Rauschen zurückzuführen. Anhand bekannter Linien des Neonspektrums werden den Pixeln nun Wellenlängen zugeordnet. Hierfür wurde der Bereich des Neonspektrums aufgenommen, in dem sich auch die rote Linie des Cadmiumspektrums befindet. In 7 sieht man das Neonspektrum und die Peaks, an die jeweils ein Voigt-Profil gelegt wurde. Jetzt kann man den identifizierten Linien ihre jeweilige Wellenlänge zuordnen und einen polynomiellen Zusammenhang finden. Wir haben uns für eine Gerade entschieden, die wie in Figure 8 zu sehen gut zu den Daten passt.
-    Schließlich wird ein Voigt-Profil an die gemessene rote Cd-Linie gelegt, wie in Figure 9 gezeigt. Umrechnung anhand der Kalibrierung führt auf einen Wert von λcd = (643,842 ± 0,007)nm. Dies befindet sich im 1σ-Bereich des Literaturwertes von λlit = 643, 84695nm. Der Fehler ist Ergebnis der Gauß'schen Fehlerfortpflanzung.
-    *Figure 8: Kalibrationsgerade*
-    ### 4.2 Kritische Betrachtung der Ergebnisse
-    Messwert und theoretische Vorhersage für die bestimmte Linie stimmen innerhalb statistischer Schwankungen überein. Dies ist umso interessanter, wenn man die Unsicherheit des Messergebnisses betrachtet, die kleiner als 0,002% ist. Der absolute Fehler ist, wenn man die Steigung der Kalibrationsgeraden betrachtet, kleiner als 1px. Er besteht ausschließlich aus Abweichungen der numerischen Fits. Berücksichtigt man Ungenauigkeiten des CMOS Sensors oder die Möglichkeit, dass je nach Lage des Messwerts auch eine Abweichung um weniger als 1px eine größere Messwertschwankung verursachen kann, da die Pixel nur diskrete Werte messen können, liegt eine nachträgliche Anpassung nahe. Skaliert man die Unsicherheit auf 1px, liegt der Fehler des Messwerts bei 0,012nm. Damit ist der relative Fehler weiterhin kleiner 0,005%.
-    Zur hohen Genauigkeit trägt vor allem das gute Messverfahren bei. Spektrometer und Datenaufnahme per Computer lassen wenig Raum für Abweichungen. Wie die Daten zeigen, haben wir dabei eine Quelle für einen möglichen großen systematischen Fehler umgangen: Die Kamera wurde auf das Spektrometer nur locker aufgesteckt. Hätte sich deren Position zwischen Neon- und Cadmiummmessung z.B. durch Erschütterung des Labortisches verändert, hätte die Energiekalibrierung nicht mehr zur Messung der Cadmiumlinie gepasst.
-    Abbildung 6 zeigt unerwartetes Verhalten. Obwohl der Magnet ausgeschaltet war, sind drei Maxima zu sehen, deren Flanken sehr steil abfallen. Vergleicht man mit den Messungen im Magnetfeld, ähneln sich die Strukturen. Möglich ist, dass die Eisenkernspule, in der sich die Lampe während der Messung befand eine Restmagnetisierung aufwies, die eine Aufspaltung herbeigeführt hat.
-    *Figure 9: Cadmium rote Linie*
-    """
     # Create system prompt for research paper tutor with transition support
     is_transition = action in ['skip', 'understood']
-    if is_transition:
-        system_prompt = f"""
-        You are PaperMentor, an expert academic tutor guiding the user through a continuous learning journey of an academic paper.
-        The user has just {action} the previous section and is transitioning to a new topic. This is part of a continuous conversation where you maintain context and adapt based on the user's actions.
-        User's Action: {action}
-        Previous Section:
-        {current_chunk}
-        New Section to Introduce:
-        {next_chunk}
-        Full Document for Context:
-        {document}
-        Your response should:
-        1. **Acknowledge the transition**: Briefly reference their choice to {action} the previous section
-        2. **Provide smooth continuity**: Connect the previous section to this new one naturally
-        3. **Introduce the new section**: Present the new topic with enthusiasm and context
-        4. **Adapt your approach**: If they skipped, perhaps adjust to be more engaging. If they understood, acknowledge their progress
-        5. **Begin new exploration**: Start the 3-question sequence for this new section
-        Maintain the same conversational style and focus on phenomenological understanding.
-        """
     else:
-        system_prompt = f"""
-        You are PaperMentor, an expert academic tutor. Your purpose is to guide a user to a deep, phenomenological understanding of an academic paper.
-        The user's primary goal is to: "phänomenologisch verstehen, was passiert, was beobachtet wurde und warum das so ist, mit wenig Fokus auf Formeln, sondern Fokus auf intuitivem Verständnis und dem experimentellen Ansatz." (phenomenologically understand what is happening, what was observed, and why, with little focus on formulas but a strong focus on intuitive understanding and the experimental approach).
-        Your entire interaction must be guided by this goal.
-        You will be given a specific chunk of the paper to discuss, as well as the full document for context.
-        ---
-        Current Chunk:
-        {current_chunk}
-        ---
-        Full Document for Context:
-        {document}
-        ---
-        Your interaction must follow this specific conversational flow:
-    1.  **Greeting and Contextualization:**
-        *   Begin with a friendly greeting.
-        *   First, briefly explain what this chunk is about in simple terms.
-        *   Then, place this chunk within the larger context of the paper. Explain its purpose in the overall argument. For instance: "Here, the authors are presenting the core observation that the rest of the paper will attempt to explain," or "This section lays the theoretical groundwork for the experiment they describe later."
-    2.  **Socratic Questioning (The 3-Question Rule):**
-        *   Your main task is to test and deepen the user's understanding through a series of exactly three questions about the current chunk.
-        *   **First Question:** Ask a single, open-ended question that probes the user's intuitive grasp of the chunk's most important concept. The question must align with the user's goal (e.g., "In simple terms, what did the researchers actually observe here?" or "Why was it necessary for them to design the experiment in this specific way?"). **Always ask only one question at a time.**
-        *   **If the user answers correctly:** Affirm their understanding (e.g., "Exactly," "That's a great way to put it") and immediately ask a *second, deeper question*. This question should build upon their correct answer, asking for more detail or to consider the implications.
-        *   **If the user answers the second question correctly:** Again, affirm their response and ask a *third, even more probing question*. This final question should challenge them to think about the "why" or the broader significance of the information.
-        *   **If the user answers incorrectly at any point:** Gently correct the misunderstanding. Provide a clear, intuitive explanation, always connecting it to the experimental observations and the "why." After your explanation, re-ask the question in a slightly different way to ensure they now understand, then continue the 3-question sequence.
-    3.  **Moving On:**
-        *   After the user has successfully answered all three questions, congratulate them on their solid understanding.
-        *   Conclude by explicitly giving them the choice to continue or stay. Say something like: "Excellent, it seems you have a very solid grasp of this part. Shall we move on to the next section, or is there anything here you'd like to explore further?"
-    **Important Behaviors:**
-    *   **Language:** The entire conversation must be in English, as indicated by the user's goal.
-    *   **Focus:** Always prioritize intuitive, conceptual, and experimental understanding over formal, mathematical details.
-    *   **Pacing:** The flow is dictated by the user's successful answers. Move from one question to the next smoothly.
-    *   **Structure:** Importantly, maintain a clear and logical flow in the conversation. Never loose track of the objective.
-    *   **Markdown:** Output markdown if you think it is useful. Break your response into reasonable sections.
-    Begin the conversation.
-    """
     anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
     if not anthropic_api_key:
@@ -561,6 +251,10 @@ async def chat_stream(request: ChatRequest):
                 model="claude-sonnet-4-20250514",
                 max_tokens=10000,
                 system=system_prompt,  # system prompt here
                 messages=anthropic_messages,
             ) as stream:
                 for text in stream.text_stream:

 # Load environment variables
 load_dotenv()
+# Load prompts from files
+def load_prompt(filename):
+    """Load prompt from text file in prompts directory"""
+    try:
+        prompts_dir = os.path.join(os.path.dirname(__file__), "prompts")
+        file_path = os.path.join(prompts_dir, filename)
+        with open(file_path, 'r', encoding='utf-8') as f:
+            return f.read().strip()
+    except FileNotFoundError:
+        print(f"Warning: Prompt file {filename} not found. Using fallback.")
+        return ""
+def load_document(filename):
+    """Load document from text file in documents directory"""
+    try:
+        documents_dir = os.path.join(os.path.dirname(__file__), "documents")
+        file_path = os.path.join(documents_dir, filename)
+        with open(file_path, 'r', encoding='utf-8') as f:
+            return f.read().strip()
+    except FileNotFoundError:
+        print(f"Warning: Document file {filename} not found. Using fallback.")
+        return ""
+# Load prompts at startup
+SYSTEM_PROMPT_TEMPLATE = load_prompt("system_prompt.txt")
+TRANSITION_PROMPT_TEMPLATE = load_prompt("transition_prompt.txt")
+DOCUMENT = load_document("lennart.txt")
 app = FastAPI()
 # Enable CORS
     nextChunk: Optional[str] = None
     action: Optional[str] = None  # 'skip', 'understood', or None
     document: Optional[str] = None
+    user_goal: Optional[str] = None # User's goal for the chat, if applicable
 @app.post("/api/chat")
 async def chat_endpoint(request: ChatRequest):
     current_chunk = request.currentChunk or request.chunk or "No specific chunk provided"
     next_chunk = request.nextChunk or ""
     action = request.action
+    user_goal = request.user_goal or "Understanding GRPO (equation 3) and why does this make sense in contrast to PPO?"
+    # Only include full document on first message or transitions to provide initial context
+    include_document = len(request.messages) <= 1 or action in ['skip', 'understood']
+    document = DOCUMENT if include_document else ""
     # Create system prompt for research paper tutor with transition support
     is_transition = action in ['skip', 'understood']
     if is_transition:
+        system_prompt = TRANSITION_PROMPT_TEMPLATE.format(
+            action=action,
+            current_chunk=current_chunk,
+            next_chunk=next_chunk,
+            document=document,
+        )
     else:
+        system_prompt = SYSTEM_PROMPT_TEMPLATE.format(
+            current_chunk=current_chunk,
+            document=document,
+            user_goal=user_goal or "No specific goal provided"
+        )
     anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
     if not anthropic_api_key:
     next_chunk = request.nextChunk or ""
     action = request.action
+    user_goal = request.user_goal or "Understanding GRPO (equation 3) and why does this make sense in contrast to PPO?"
+    # Only include full document on first message or transitions to provide initial context
+    # After that, the conversation history maintains context
+    include_document = len(request.messages) <= 1 or action in ['skip', 'understood']
+    document = DOCUMENT if include_document else ""
     # Create system prompt for research paper tutor with transition support
     is_transition = action in ['skip', 'understood']
+    print("🤖 Creating system prompt...")
+    print(f"include_document: {include_document} (messages: {len(request.messages)}, action: {action})")
+    print(f"current_chunk: {current_chunk[:100] if current_chunk else 'None'}")
+    print(f"next_chunk: {next_chunk[:100] if next_chunk else 'None'}")
+    if is_transition:
+        system_prompt = TRANSITION_PROMPT_TEMPLATE.format(
+            action=action,
+            current_chunk=current_chunk,
+            next_chunk=next_chunk,
+            document=document
+        )
+        print(f"Transition system prompt: {system_prompt[:200]}...")
     else:
+        system_prompt = SYSTEM_PROMPT_TEMPLATE.format(
+            current_chunk=current_chunk,
+            document=document,
+            user_goal=user_goal or "No specific goal provided"
+        )
+        print(f"System prompt: {system_prompt[:200]}...")
     anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
     if not anthropic_api_key:
                 model="claude-sonnet-4-20250514",
                 max_tokens=10000,
                 system=system_prompt,  # system prompt here
+                thinking={
+                    "type": "enabled",
+                    "budget_tokens": 1024,
+                },
                 messages=anthropic_messages,
             ) as stream:
                 for text in stream.text_stream:

backend/documents/lennart.txt ADDED Viewed

	@@ -0,0 +1,497 @@

+      ![](_page_0_Picture_0.jpeg)
+# **DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models**
+Zhihong Shao<sup>1,2\*†</sup>, Peiyi Wang<sup>1,3\*†</sup>, Qihao Zhu<sup>1,3\*†</sup>, Runxin Xu<sup>1</sup>, Junxiao Song<sup>1</sup> Xiao Bi<sup>1</sup>, Haowei Zhang<sup>1</sup>, Mingchuan Zhang<sup>1</sup>, Y.K. Li<sup>1</sup>, Y. Wu<sup>1</sup>, Daya Guo<sup>1</sup>
+<sup>1</sup>DeepSeek-AI, <sup>2</sup>Tsinghua University, <sup>3</sup>Peking University
+{zhihongshao,wangpeiyi,zhuqh,guoday}@deepseek.com https://github.com/deepseek-ai/DeepSeek-Math
+## Abstract
+Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
+![](_page_0_Figure_7.jpeg)
+Figure 1 | Top1 accuracy of open-source models on the competition-level MATH benchmark (Hendrycks et al., 2021) without the use of external toolkits and voting techniques.
+Core contributors.
+<sup>&</sup>lt;sup>†</sup> Work done during internship at DeepSeek-AI.
+#### 1. Introduction
+Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021) and the geometry reasoning benchmark (Trinh et al., 2024). Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023). However, cutting-edge models such as GPT-4 (OpenAI, 2023) and Gemini-Ultra (Anil et al., 2023) are not publicly available, and the currently accessible open-source models considerably trail behind in performance.
+In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeek-Math Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens. This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin et al., 2016). In the initial iteration, the classifier is trained using instances from OpenWebMath (Paster et al., 2023) as positive examples, while incorporating a diverse selection of other web pages to serve as negative examples. Subsequently, we employ the classifier to mine additional positive instances from the CC, which are further refined through human annotation. The classifier is then updated with this enhanced dataset to improve its performance. The evaluation results indicate that the large-scale corpus is of high quality, as our base model DeepSeekMath-Base 7B achieves 64.2% on GSM8K (Cobbe et al., 2021) and 36.2% on the competition-level MATH dataset (Hendrycks et al., 2021), outperforming Minerva 540B (Lewkowycz et al., 2022a). In addition, the DeepSeekMath Corpus is multilingual, so we notice an improvement in Chinese mathematical benchmarks (Wei et al., 2023; Zhong et al., 2023). We believe that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future.
+DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020) and BBH benchmarks (Suzgun et al., 2022), indicating it does not only enhance the model's mathematical abilities but also amplifies general reasoning capabilities.
+After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022), program-of-thought (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning (Gou et al., 2023) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models.
+Furthermore, we introduce the Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9%  $\rightarrow$  88.2%, MATH: 46.8%  $\rightarrow$  51.7%) and out-of-domain mathematical tasks (e.g., CMATH:  $84.6\% \rightarrow 88.8\%$ ) during the reinforcement learning phase. We also provide a unified paradigm to understand different methods, such as Rejection Sampling Fine-Tuning (RFT) (Yuan et al., 2023a), Direct Preference Optimization (DPO) (Rafailov et al., 2023), PPO and GRPO. Based on such a unified paradigm, we find that all these methods are conceptualized as either direct or simplified RL techniques. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on,
+to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm.
+## 1.1. Contributions
+Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning.
+## **Math Pre-Training at Scale**
+- Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023).
+- Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performance with Minerva 540B (Lewkowycz et al., 2022a), indicating the number of parameters is not the only key factor in mathematical reasoning capability. A smaller model pre-trained on high-quality data could achieve strong performance as well.
+- We share our findings from math training experiments. Code training prior to math training improves models' ability to solve mathematical problems both with and without tool use. This offers a partial answer to the long-standing question: does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.
+- Although training on arXiv papers is common, especially in many math-related papers, it brings no notable improvements on all mathematical benchmarks adopted in this paper.
+### **Exploration and Analysis of Reinforcement Learning**
+- We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO).
+- We demonstrate that GRPO significantly enhances the performance of our instructiontuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process.
+- We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm.
+- Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achieve more effective reinforcement learning of LLMs.
+### **1.2. Summary of Evaluations and Metrics**
+ English and Chinese Mathematical Reasoning: We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020). Chinese benchmarks include MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023). We evaluate models' ability to generate self-contained text solutions without tool use, and also the ability to solve problems using Python.
+On English benchmarks, DeepSeekMath-Base is competitive with the closed-source Minerva 540B (Lewkowycz et al., 2022a), and surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023) and Llemma-34B (Azerbayev et al., 2023)), regardless of whether they've undergone math pre-training or not, often by a significant margin. Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don't follow previous works (Azerbayev et al., 2023; Lewkowycz et al., 2022a) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community.
+- Formal Mathematics: We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022) on miniF2F (Zheng et al., 2021) with Isabelle (Wenzel et al., 2008) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance.
+- Natural Language Understanding, Reasoning, and Code: To build a comprehensive profile of models' general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance.
+## 2. Math Pre-Training
+## 2.1. Data Collection and Decontamination
+In this section, we will outline the process of constructing the DeepSeekMath Corpus from Common Crawl. As depicted in Figure 2, we present an iterative pipeline that demonstrates how to systematically gather a large-scale mathematical corpus from Common Crawl, starting with a seed corpus (e.g., a small but high-quality collection of math-related dataset). It's worth noting that this approach is also applicable to other domains, such as coding.
+First, we choose OpenWebMath (Paster et al., 2023), a collection of high-quality mathematical web texts, as our initial seed corpus. Using this corpus, we train a fastText model (Joulin et al., 2016) to recall more OpenWebMath-like mathematical web pages. Specifically, we randomly select 500,000 data points from the seed corpus as positive training examples and another  $500,000$  web pages from Common Crawl as negative ones. We employ an open-source library<sup>1</sup> for training, configuring the vector dimension to 256, learning rate to 0.1, the maximum length
+ $^{1}$ https://fasttext.cc
+![](_page_4_Figure_0.jpeg)
+Figure 2 | An iterative pipeline that collects mathematical web pages from Common Crawl.
+of word n-gram to 3, the minimum number of word occurrences to 3, and the number of training epochs to 3. To reduce the size of the original Common Crawl, we employ URL-based deduplication and near-deduplication techniques, resulting in 40B HTML web pages. We then recall mathematical web pages from deduplicated Common Crawl with the fastText model. To filter out low-quality mathematical content, we rank the collected pages according to their scores predicted by the fastText model, and only preserve the top-ranking ones. The volume of data preserved is assessed through pre-training experiments on the top 40B, 80B, 120B, and 160B tokens. In the first iteration, we choose to keep the top 40B tokens.
+After the first iteration of data collection, numerous mathematical web pages remain uncollected, mainly because the fastText model is trained on a set of positive examples that lacks sufficient diversity. We therefore identify additional mathematical web sources to enrich the seed corpus, so that we can optimize the fastText model. Specifically, we first organize the entire Common Crawl into disjoint domains; a domain is defined as web pages sharing the same base URL. For each domain, we calculate the percentage of web pages that are collected in the first iteration. Domains where over 10% of the web pages have been collected are classified as math-related (e.g., mathoverflow.net). Subsequently, we manually annotate the URLs associated with mathematical content within these identified domains (e.g., mathoverflow.net/questions). Web pages linked to these URLs, yet uncollected, will be added to the seed corpus. This approach enables us to gather more positive examples, thereby training an improved fastText model capable of recalling more mathematical data in the subsequent iteration. After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens. In the fourth iteration, we notice that nearly 98% of the data has already been collected in the third iteration, so we decide to cease data collection.
+To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023). The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.
+#### 2.2. Validating the Quality of the DeepSeekMath Corpus
+We run pre-training experiments to investigate how the DeepSeekMath Corpus is compared with the recently released math-training corpora:
+- **MathPile** (Wang et al., 2023c): a multi-source corpus (8.9B tokens) aggregated from textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, and arXiv, with the majority (over 85%) sourced from arXiv;
+- **OpenWebMath** (Paster et al., 2023): CommonCrawl data filtered for mathematical content, totaling 13.6B tokens;
+- **Proof-Pile-2** (Azerbayev et al., 2023): a mathematical corpus consisting of OpenWeb-Math, AlgebraicStack (10.3B tokens of mathematical code), and arXiv papers (28.0B tokens). When experimenting on Proof-Pile-2, we follow Azerbayev et al. (2023) to use an arXiv:Web:Code ratio of 2:4:1.
+#### 2.2.1. Training Setting
+We apply math training to a general pre-trained language model with 1.3B parameters, which shares the same framework as the DeepSeek LLMs (DeepSeek-AI, 2024), denoted as DeepSeek-LLM 1.3B. We separately train a model on each mathematical corpus for 150B tokens. All experiments are conducted using the efficient and light-weight HAI-LLM (High-flyer, 2023) training framework. Following the training practice of DeepSeek LLMs, we use the AdamW optimizer (Loshchilov and Hutter, 2017) with  $\beta_1 = 0.9$ ,  $\beta_2 = 0.95$ , and weight\_decay = 0.1, along with a multi-step learning rate schedule where the learning rate reaches the peak after 2,000 warmup steps, decreases to its 31.6% after 80% of the training process, and further decreases to 10.0% of the peak after 90% of the training process. We set the maximum value of learning rate to 5.3e-4, and use a batch size of 4M tokens with a 4K context length.
+| Math Corpus         | Size   | English Benchmarks |       |      |       | Chinese Benchmarks |       |                  |               |
+|---------------------|--------|--------------------|-------|------|-------|--------------------|-------|------------------|---------------|
+|                     |        | GSM8K              | MATH  | OCW  | SAT   | MMLU STEM          | CMATH | Gaokao MathCloze | Gaokao MathQA |
+| No Math Training    | N/A    | 2.9%               | 3.0%  | 2.9% | 15.6% | 19.5%              | 12.3% | 0.8%             | 17.9%         |
+| MathPile            | 8.9B   | 2.7%               | 3.3%  | 2.2% | 12.5% | 15.7%              | 1.2%  | 0.0%             | 2.8%          |
+| OpenWebMath         | 13.6B  | 11.5%              | 8.9%  | 3.7% | 31.3% | 29.6%              | 16.8% | 0.0%             | 14.2%         |
+| Proof-Pile-2        | 51.9B  | 14.3%              | 11.2% | 3.7% | 43.8% | 29.2%              | 19.9% | 5.1%             | 11.7%         |
+| DeepSeekMath Corpus | 120.2B | 23.8%              | 13.6% | 4.8% | 56.3% | 33.1%              | 41.5% | 5.9%             | 23.6%         |
+Table 1 | Performance of DeepSeek-LLM 1.3B trained on different mathematical corpora, evaluated using few-shot chain-of-thought prompting. Corpus sizes are calculated using our tokenizer with a vocabulary size of 100K.
+#### 2.2.2. Evaluation Results
+The DeepSeekMath Corpus is of high quality, covers multilingual mathematical content, and is the largest in size.
+• **High-quality**: We evaluate downstream performance on 8 mathematical benchmarks using few-shot chain-of-thought prompting Wei et al. (2022). As shown in Table 1, there is a clear performance lead of the model trained on the DeepSeekMath Corpus. Figure 3 shows that the model trained on the DeepSeekMath Corpus demonstrates better performance than
+![](_page_6_Figure_0.jpeg)
+Figure 3 | Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora.
+Proof-Pile-2 at 50B tokens (1 full epoch of Proof-Pile-2), indicating the average quality of DeepSeekMath Corpus is higher.
+- **Multilingual**: The DeepSeekMath Corpus encompasses data in multiple languages, predominantly featuring English and Chinese as the two most represented languages. As shown in Table 1, training on the DeepSeekMath Corpus enhances mathematical reasoning performance in both English and Chinese. In contrast, existing mathematical corpora, which are primarily English-centric, show limited improvement and may even hinder performance in Chinese mathematical reasoning.
+- **Large-scale**: The DeepSeekMath Corpus is several times larger than existing mathematical corpora. As depicted in Figure 3, DeepSeek-LLM 1.3B, when trained on the DeepSeek-Math Corpus, shows a steeper learning curve along with more lasting improvements. In contrast, the baseline corpora are much smaller, and have already been repeated multiple rounds during training, with the resulting model performance quickly reaching a plateau.
+#### 2.3. Training and Evaluating DeepSeekMath-Base 7B
+In this section, we introduce DeepSeekMath-Base 7B, a base model with strong reasoning abilities, especially in mathematics. Our model is initialized with DeepSeek-Coder-Base-v1.5 7B
+(Guo et al., 2024) and trained for 500B tokens. The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from arXiv, 20% is Github code, and the remaining 10% is natural language data from Common Crawl in both English and Chinese. We mainly adopt the training setting specified in Section 2.2.1, except that we set the maximum value of the learning rate to 4.2e-4 and use a batch size of 10M tokens.
+We conduct a comprehensive assessment of the mathematical capabilities of DeepSeekMath-Base 7B, focusing on its ability to produce self-contained mathematical solutions without relying on external tools, solve mathematical problems using tools, and conduct formal theorem proving. Beyond mathematics, we also provide a more general profile of the base model, including its performance of natural language understanding, reasoning, and programming skills.
+**Mathematical Problem Solving with Step-by-Step Reasoning** We evaluate DeepSeekMath-Base's performance of solving mathematical problems using few-shot chain-of-thought prompting (Wei et al., 2022), across eight benchmarks in English and Chinese. These benchmarks encompass quantitative reasoning (e.g., GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMATH (Wei et al., 2023)) and multiple-choice problems (e.g., MMLU-STEM (Hendrycks et al., 2020) and Gaokao-MathQA (Zhong et al., 2023)), covering diverse fields of mathematics from elementary to college-level complexity.
+As shown in Table 2, DeepSeekMath-Base 7B leads in performance across all eight benchmarks among the open-source base models (including the widely-used general model Mistral 7B (Jiang et al., 2023) and the recently released Llemma 34B (Azerbayev et al., 2023) which underwent math training on Proof-Pile-2 (Azerbayev et al., 2023)). Notably, on the competitionlevel MATH dataset, DeepSeekMath-Base surpasses existing open-source base models by over 10% absolute, and outperforms Minerva 540B (Lewkowycz et al., 2022a), a closed-source base model 77 times larger which builds on PaLM (Lewkowycz et al., 2022b) and is further trained on mathematical texts.
+| Model                    | Size | English Benchmarks |       |       |       | Chinese Benchmarks |       |                  |               |
+|--------------------------|------|--------------------|-------|-------|-------|--------------------|-------|------------------|---------------|
+|                          |      | GSM8K              | MATH  | OCW   | SAT   | MMLU STEM          | CMATH | Gaokao MathCloze | Gaokao MathQA |
+| Closed-Source Base Model |      |                    |       |       |       |                    |       |                  |               |
+| Minerva                  | 7B   | 16.2%              | 14.1% | 7.7%  | -     | 35.6%              | -     | -                | -             |
+| Minerva                  | 62B  | 52.4%              | 27.6% | 12.0% | -     | 53.9%              | -     | -                | -             |
+| Minerva                  | 540B | 58.8%              | 33.6% | 17.6% | -     | 63.9%              | -     | -                | -             |
+| Open-Source Base Model   |      |                    |       |       |       |                    |       |                  |               |
+| Mistral                  | 7B   | 40.3%              | 14.3% | 9.2%  | 71.9% | 51.1%              | 44.9% | 5.1%             | 23.4%         |
+| Llemma                   | 7B   | 37.4%              | 18.1% | 6.3%  | 59.4% | 43.1%              | 43.4% | 11.9%            | 23.6%         |
+| Llemma                   | 34B  | 54.0%              | 25.3% | 10.3% | 71.9% | 52.9%              | 56.1% | 11.9%            | 26.2%         |
+| DeepSeekMath-Base        | 7B   | 64.2%              | 36.2% | 15.4% | 84.4% | 56.5%              | 71.7% | 20.3%            | 35.3%         |
+Table 2 | Comparisons between DeepSeekMath-Base 7B and strong base models on English and Chinese mathematical benchmarks. Models are evaluated with chain-of-thought prompting. Minerva results are quoted from Lewkowycz et al. (2022a).
+**Mathematical Problem Solving with Tool Use** We evaluate program-aided mathematical reasoning on GSM8K and MATH using few-shot program-of-thought prompting (Chen et al., 2022; Gao et al., 2023). Models are prompted to solve each problem by writing a Python program where libraries such as *math* and *sympy* can be utilized for intricate computations. The execution result of the program is evaluated as the answer. As shown in Table 3, DeepSeekMath-Base 7B outperforms the prior state-of-the-art Llemma 34B.
+| Model             | Size | Problem Solving w/ Tools |             | Informal-to-Formal Proving |              |
+|-------------------|------|--------------------------|-------------|----------------------------|--------------|
+|                   |      | GSM8K+Python             | MATH+Python | miniF2F-valid              | miniF2F-test |
+| Mistral           | 7B   | 48.5%                    | 18.2%       | 18.9%                      | 18.0%        |
+| CodeLlama         | 7B   | 27.1%                    | 17.2%       | 16.3%                      | 17.6%        |
+| CodeLlama         | 34B  | 52.7%                    | 23.5%       | 18.5%                      | 18.0%        |
+| Llemma            | 7B   | 41.0%                    | 18.6%       | 20.6%                      | 22.1%        |
+| Llemma            | 34B  | 64.6%                    | 26.3%       | 21.0%                      | 21.3%        |
+| DeepSeekMath-Base | 7B   | 66.9%                    | 31.4%       | 25.8%                      | 24.6%        |
+Table 3 | Few-shot evaluation of base models' ability to solve mathematical problems using tools and the ability to conduct informal-to-formal theorem proving in Isabelle.
+**Formal Mathematics** Formal proof automation is beneficial to ensure the accuracy and reliability of mathematical proofs and enhance efficiency, with increasing attention in recent years. We evaluate DeepSeekMath-Base 7B on the task of informal-to-formal proving from (Jiang et al., 2022) which is to generate a formal proof based on an informal statement, a formal counterpart of the statement, and an informal proof. We evaluate on miniF2F (Zheng et al., 2021), a benchmark for formal Olympiad-level mathematics, and generate a formal proof in Isabelle for each problem with few-shot prompting. Following Jiang et al. (2022), we leverage models to generate proof sketches, and execute the off-the-shelf automated prover Sledgehammer (Paulson, 2010) to fill in the missing details. As shown in Table 3, DeepSeekMath-Base 7B demonstrates strong performance in proof autoformalization.
+| Model                     | Size | MMLU  | BBH   | HumanEval (Pass@1) | MBPP (Pass@1) |
+|---------------------------|------|-------|-------|--------------------|---------------|
+| Mistral                   | 7B   | 62.4% | 55.7% | 28.0%              | 41.4%         |
+| DeepSeek-Coder-Base-v1.5† | 7B   | 42.9% | 42.9% | 40.2%              | 52.6%         |
+| DeepSeek-Coder-Base-v1.5  | 7B   | 49.1% | 55.2% | 43.2%              | 60.4%         |
+| DeepSeekMath-Base         | 7B   | 54.9% | 59.5% | 40.9%              | 52.6%         |
+Table 4 | Evaluation on natural language understanding, reasoning, and code benchmarks. DeepSeek-Coder-Base-v1.5<sup> $\dagger$ </sup> is the checkpoint right before learning rate decay, which is used to train DeepSeekMath-Base. On MMLU and BBH, we use few-shot chain-of-thought prompting. On HumanEval and MBPP, we evaluate model performance under the zero-shot setting and a few-shot setting, respectively.
+Natural Language Understanding, Reasoning, and Code We evaluate model performance of natural language understanding on MMLU (Hendrycks et al., 2020), reasoning on BBH (Suzgun et al., 2022), and coding capabilities on HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). As shown in Table 4, DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH over its precursor, DeepSeek-Coder-Base-v1.5 (Guo et al., 2024), illustrating the positive impact of math training on language understanding and reasoning. Additionally, by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks. Overall, DeepSeekMath-Base 7B significantly outperforms the general model Mistral 7B (Jiang et al., 2023) on the three reasoning and coding benchmarks.
+## 3. Supervised Fine-Tuning
+## 3.1. SFT Data Curation
+We construct a mathematical instruction-tuning dataset covering English and Chinese problems from different mathematical fields and of varying complexity levels: problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022), program-of-thought (PoT) (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning format (Gou et al., 2023). The total number of training examples is 776K.
+- **English mathematical datasets**: We annotate GSM8K and MATH problems with toolintegrated solutions, and adopt a subset of MathInstruct (Yue et al., 2023) along with the training set of Lila-OOD (Mishra et al., 2022) where problems are solved with CoT or PoT. Our English collection covers diverse fields of mathematics, e.g., algebra, probability, number theory, calculus, and geometry.
+- **Chinese mathematical datasets**: We collect Chinese K-12 mathematical problems spanning 76 sub-topics such as linear equations, with solutions annotated in both CoT and toolintegrated reasoning format.
+## 3.2. Training and Evaluating DeepSeekMath-Instruct 7B
+In this section, we introduce DeepSeekMath-Instruct 7B which undergoes mathematical instruction tuning based on DeepSeekMath-Base. Training examples are randomly concatenated until reaching a maximum context length of 4K tokens. We train the model for 500 steps with a batch size of 256 and a constant learning rate of 5e-5.
+We evaluate models' mathematical performance both without and with tool use, on 4 quantitative reasoning benchmarks in English and Chinese. We benchmark our model against the leading models of the time:
+- Closed-source models include: (1) the GPT family among which GPT-4 (OpenAI, 2023) and GPT-4 Code Interpreter <sup>2</sup> are the most capable ones, (2) Gemini Ultra and Pro (Anil et al., 2023), (3) Inflection-2 (Inflection AI, 2023), (4) Grok- $1^3$ , as well as models recently released by Chinese companies including (5) Baichuan-3 $^{4}$ , (6) the latest GLM-4 $^{5}$  from the GLM family (Du et al., 2022). These models are for general purposes, most of which have undergone a series of alignment procedures.
+- Open-source models include: general models like (1) DeepSeek-LLM-Chat 67B (DeepSeek-AI, 2024), (2) Qwen 72B (Bai et al., 2023), (3) SeaLLM-v2 7B (Nguyen et al., 2023), and (4)
+<sup>&</sup>lt;sup>2</sup>https://openai.com/blog/chatgpt-plugins#code-interpreter
+ $^{3}$ https://x.ai/model-card
+<sup>&</sup>lt;sup>4</sup>https://www.baichuan-ai.com
+ $<sup>5</sup>$ https://open.bigmodel.cn/dev/api#glm-4</sup>
+ChatGLM3 6B (ChatGLM3 Team, 2023), as well as models with enhancements in mathematics including (5) InternLM2-Math 20B<sup>6</sup> which builds on InternLM2 and underwent math training followed by instruction tuning, (6) Math-Shepherd-Mistral 7B which applys PPO training (Schulman et al., 2017) to Mistral 7B (Jiang et al., 2023) with a process-supervised reward model, (7) the WizardMath series (Luo et al., 2023) which improves mathematical reasoning in Mistral 7B and Llama-2 70B (Touvron et al., 2023) using evolve-instruct (i.e., a version of instruction tuning that uses AI-evolved instructions) and PPO training with training problems primarily sourced from GSM8K and MATH, (8) MetaMath 70B (Yu et al., 2023) which is Llama-2 70B fine-tuned on an augmented version of GSM8K and MATH, (9) ToRA 34B Gou et al. (2023) which is CodeLlama 34B fine-tuned to do tool-integrated mathematical reasoning, (10) MAmmoTH 70B (Yue et al., 2023) which is Llama-2 70B instruction-tuned on MathInstruct.
+As shown in Table 5, under the evaluation setting where tool use is disallowed, DeepSeekMath-Instruct 7B demonstrates strong performance of step-by-step reasoning. Notably, on the competition-level MATH dataset, our model surpasses all open-source models and the majority of proprietary models (e.g., Inflection-2 and Gemini Pro) by at least 9% absolute. This is true even for models that are substantially larger (e.g., Qwen 72B) or have been specifically enhanced through math-focused reinforcement learning (e.g., WizardMath-v1.1 7B). While DeepSeekMath-Instruct rivals the Chinese proprietary models GLM-4 and Baichuan-3 on MATH, it still underperforms GPT-4 and Gemini Ultra.
+Under the evaluation setting where models are allowed to integrate natural language reasoning and program-based tool use for problem solving, DeepSeekMath-Instruct 7B approaches an accuracy of 60% on MATH, surpassing all existing open-source models. On the other benchmarks, our model is competitive with DeepSeek-LLM-Chat 67B, the prior state-of-the-art that is 10 times larger.
+## 4. Reinforcement Learning
+## 4.1. Group Relative Policy Optimization
+Reinforcement learning (RL) has been proven to be effective in further improving the mathematical reasoning ability of LLMs after the Supervised Fine-Tuning (SFT) stage (Luo et al., 2023; Wang et al., 2023b). In this section, we introduce our efficient and effective RL algorithm, Group Relative Policy Optimization (GRPO).
+## 4.1.1. From PPO to GRPO
+Proximal Policy Optimization (PPO) (Schulman et al., 2017) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang et al., 2022). In particular, it optimizes LLMs by maximizing the following surrogate objective:
+$$\mathcal{J}_{PPO}(\theta) = \mathbb{E}\left[q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)\right] \frac{1}{|o|} \sum_{t=1}^{|o|} \min\left[\frac{\pi_{\theta}(o_t|q, o_{\leq t})}{\pi_{\theta_{old}}(o_t|q, o_{\leq t})} A_t, \text{clip}\left(\frac{\pi_{\theta}(o_t|q, o_{\leq t})}{\pi_{\theta_{old}}(o_t|q, o_{\leq t})}, 1 - \varepsilon, 1 + \varepsilon\right) A_t\right], \tag{1}$$
+where  $\pi_{\theta}$  and  $\pi_{\theta_{old}}$  are the current and old policy models, and *q*, *o* are questions and outputs sampled from the question dataset and the old policy  $\pi_{\theta_{old}}$ , respectively.  $\varepsilon$  is a clipping-related hyper-parameter introduced in PPO for stabilizing training.  $A_t$  is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015), based
+<sup>&</sup>lt;sup>6</sup>https://github.com/InternLM/InternLM-Math
+| Model                             | Size | English Benchmarks |       | Chinese Benchmarks |       |
+|-----------------------------------|------|--------------------|-------|--------------------|-------|
+|                                   |      | GSM8K              | MATH  | MGSM-zh            | CMATH |
+| <b>Chain-of-Thought Reasoning</b> |      |                    |       |                    |       |
+| Closed-Source Model               |      |                    |       |                    |       |
+| Gemini Ultra                      | -    | 94.4%              | 53.2% | -                  | -     |
+| GPT-4                             | -    | 92.0%              | 52.9% | -                  | 86.0% |
+| Inflection-2                      | -    | 81.4%              | 34.8% | -                  | -     |
+| GPT-3.5                           | -    | 80.8%              | 34.1% | -                  | 73.8% |
+| Gemini Pro                        | -    | 86.5%              | 32.6% | -                  | -     |
+| Grok-1                            | -    | 62.9%              | 23.9% | -                  | -     |
+| Baichuan-3                        | -    | 88.2%              | 49.2% | -                  | -     |
+| GLM-4                             | -    | 87.6%              | 47.9% | -                  | -     |
+| Open-Source Model                 |      |                    |       |                    |       |
+| InternLM2-Math                    | 20B  | 82.6%              | 37.7% | -                  | -     |
+| Qwen                              | 72B  | 78.9%              | 35.2% | -                  | -     |
+| Math-Shepherd-Mistral             | 7B   | 84.1%              | 33.0% | -                  | -     |
+| WizardMath-v1.1                   | 7B   | 83.2%              | 33.0% | -                  | -     |
+| DeepSeek-LLM-Chat                 | 67B  | 84.1%              | 32.6% | 74.0%              | 80.3% |
+| MetaMath                          | 70B  | 82.3%              | 26.6% | 66.4%              | 70.9% |
+| SeaLLM-v2                         | 7B   | 78.2%              | 27.5% | 64.8%              | -     |
+| ChatGLM3                          | 6B   | 72.3%              | 25.7% | -                  | -     |
+| WizardMath-v1.0                   | 70B  | 81.6%              | 22.7% | 64.8%              | 65.4% |
+| DeepSeekMath-Instruct             | 7B   | 82.9%              | 46.8% | 73.2%              | 84.6% |
+| DeepSeekMath-RL                   | 7B   | 88.2%              | 51.7% | 79.6%              | 88.8% |
+| <b>Tool-Integrated Reasoning</b>  |      |                    |       |                    |       |
+| Closed-Source Model               |      |                    |       |                    |       |
+| GPT-4 Code Interpreter            | -    | 97.0%              | 69.7% | -                  | -     |
+| Open-Source Model                 |      |                    |       |                    |       |
+| InternLM2-Math                    | 20B  | 80.7%              | 54.3% | -                  | -     |
+| DeepSeek-LLM-Chat                 | 67B  | 86.7%              | 51.1% | 76.4%              | 85.4% |
+| ToRA                              | 34B  | 80.7%              | 50.8% | 41.2%              | 53.4% |
+| MAmmoTH                           | 70B  | 76.9%              | 41.8% | -                  | -     |
+| DeepSeekMath-Instruct             | 7B   | 83.7%              | 57.4% | 72.0%              | 84.3% |
+| DeepSeekMath-RL                   | 7B   | 86.7%              | 58.8% | 78.4%              | 87.6% |
+Table 5 | Performance of Open- and Closed-Source models with both Chain-of-Thought and Tool-Integrated Reasoning on English and Chinese Benchmarks. Scores in gray denote majority votes with 32 candidates; The others are Top1 scores. DeepSeekMath-RL 7B beats all opensource models from 7B to 70B, as well as the majority of closed-source models. Although DeepSeekMath-RL 7B is only further trained on chain-of-thought-format instruction tuning data of GSM8K and MATH, it improves over DeepSeekMath-Instruct 7B on all benchmarks.
+![](_page_12_Figure_0.jpeg)
+Figure 4 | Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead estimating the baseline from group scores, significantly reducing training resources.
+on the rewards  $\{r_{\geq t}\}$  and a learned value function  $V_{\psi}$ . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a reference model in the reward at each token (Ouyang et al., 2022), i.e.,
+$$r_t = r_{\varphi}(q, o_{\leq t}) - \beta \log \frac{\pi_{\theta}(o_t|q, o_{\leq t})}{\pi_{ref}(o_t|q, o_{\leq t})},\tag{2}$$
+where  $r_{\varphi}$  is the reward model,  $\pi_{ref}$  is the reference model, which is usually the initial SFT model, and  $\beta$  is the coefficient of the KL penalty.
+As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token. To address this, as shown in Figure 4, we propose Group Relative Policy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question  $q$ , GRPO samples a group of outputs  $\{o_1, o_2, \cdots, o_G\}$  from the old policy  $\pi_{\theta_{old}}$  and then optimizes the policy model by maximizing the following objective:
+$$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}[q \sim P(Q), \{o_{i}\}_{i=1}^{G} \sim \pi_{\theta_{old}}(O|q)]\n$$
+$$\n\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_{i}|} \sum_{t=1}^{|o_{i}|} \left\{ \min \left[ \frac{\pi_{\theta}(o_{i,t}|q, o_{i,< t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,< t})} \hat{A}_{i,t}, \operatorname{clip} \left( \frac{\pi_{\theta}(o_{i,t}|q, o_{i,< t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,< t})}, 1 - \varepsilon, 1 + \varepsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL} \left[ \pi_{\theta} || \pi_{ref} \right] \right\}, \n$$
+(3)
+where  $\varepsilon$  and  $\beta$  are hyper-parameters, and  $\hat{A}_{i,t}$  is the advantage calculated based on relative rewards of the outputs inside each group only, which will be detailed in the following subsections. The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of  $\hat{A}_{i,t}$ .
+### **Algorithm 1** Iterative Group Relative Policy Optimization
+**Input** initial policy model  $\pi_{\theta_{\text{init}}}$ ; reward models  $r_{\varphi}$ ; task prompts  $\mathcal{D}$ ; hyperparameters  $\varepsilon$ ,  $\beta$ ,  $\mu$
+- 1: policy model  $\pi_{\theta} \leftarrow \pi_{\theta_{\text{init}}}$ 2: **for** iteration =  $1, \ldots, I$  **do** 3: reference model  $\pi_{ref} \leftarrow \pi_{\theta}$
+- 4: for step =  $1, \ldots, M$  do
+- Sample a batch  $\mathcal{D}_b$  from  $\mathcal{D}$ 5:
+- Update the old policy model  $\pi_{\theta_{old}} \leftarrow \pi_{\theta}$ 6:
+- 7:
+- Sample *G* outputs  $\{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(\cdot \mid q)$  for each question  $q \in \mathcal{D}_b$ <br>Compute rewards  $\{r_i\}_{i=1}^G$  for each sampled output  $o_i$  by running  $r_{\varphi}$ 8:
+- Compute  $\hat{A}_{i,t}$  for the *t*-th token of  $o_i$  through group relative advantage estimation. 9:
+- **for** GRPO iteration =  $1, \ldots, \mu$  **do** 10:
+- Update the policy model  $\pi_{\theta}$  by maximizing the GRPO objective (Equation 21) 11:
+- 12: Update  $r_{\varphi}$  through continuous training using a replay mechanism.
+Output  $\pi_{\theta}$
+And different from the KL penalty term used in (2), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020):
+$$\mathbb{D}_{KL}\left[\pi_{\theta}||\pi_{ref}\right] = \frac{\pi_{ref}(o_{i,t}|q, o_{i,< t})}{\pi_{\theta}(o_{i,t}|q, o_{i,< t})} - \log\frac{\pi_{ref}(o_{i,t}|q, o_{i,< t})}{\pi_{\theta}(o_{i,t}|q, o_{i,< t})} - 1,\tag{4}$$
+which is guaranteed to be positive.
+#### 4.1.2. Outcome Supervision RL with GRPO
+Formally, for each question q, a group of outputs  $\{o_1, o_2, \cdots, o_G\}$  are sampled from the old policy model  $\pi_{\theta_{old}}$ . A reward model is then used to score the outputs, yielding *G* rewards  $\mathbf{r} = \{r_1, r_2, \cdots, r_G\}$  correspondingly. Subsequently, these rewards are normalized by subtracting the group average and dividing by the group standard deviation. Outcome supervision provides the normalized reward at the end of each output  $o_i$  and sets the advantages  $\hat{A}_{i,t}$  of all tokens in the output as the normalized reward, i.e.,  $\hat{A}_{i,t} = \widetilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$ , and then optimizes the policy by maximizing the objective defined in equation  $(3)$ .
+### 4.1.3. Process Supervision RL with GRPO
+Outcome supervision only provides a reward at the end of each output, which may not be sufficient and efficient to supervise the policy in complex mathematical tasks. Following Wang et al. (2023b), we also explore process supervision, which provides a reward at the end of each reasoning step. Formally, given the question q and G sampled outputs  $\{o_1, o_2, \cdots, o_G\}$ , a process reward model is used to score each step of the outputs, yielding corresponding rewards:  $\mathbf{R} = \{\{r_1^{index(1)}, \cdots, r_1^{index(K_1)}\}, \cdots, \{r_G^{index(1)}, \cdots, r_G^{index(K_G)}\}\}, \text{ where } index(j) \text{ is the end token index}$ of the  $j$ -th step, and  $K_i$  is the total number of steps in the  $i$ -th output. We also normalize these rewards with the average and the standard deviation, i.e.,  $\widetilde{r}_{i}^{\text{index}(j)} = \frac{r_{i}^{\text{index}(j)} - \text{mean}(\mathbf{R})}{\text{std}(\mathbf{R})}$ . Subsequently, the process supervision calculates the advantage of each token as the sum of the normalized rewards from the following steps, i.e.,  $\hat{A}_{i,t} = \sum_{index(j) \ge t} \tilde{r}_i^{index(j)}$ , and then optimizes the policy by maximizing the objective defined in equation  $(3)$ .
+### 4.1.4. Iterative RL with GRPO
+As the reinforcement learning training process progresses, the old reward model may not be sufficient to supervise the current policy model. Therefore, we also explore the iterative RL with GRPO. As shown in Algorithm 1, in iterative GRPO, we generate new training sets for the reward model based on the sampling results from the policy model and continually train the old reward model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model.
+### 4.2. Training and Evaluating DeepSeekMath-RL
+We conduct RL based on DeepSeekMath-Instruct 7B. The training data of RL are chain-ofthought-format questions related to GSM8K and MATH from the SFT data, which consists of around 144K questions. We exclude other SFT questions to investigate the impact of RL on benchmarks that lack data throughout the RL phase. We construct the training set of reward models following (Wang et al., 2023b). We train our initial reward model based on the DeepSeekMath-Base 7B with a learning rate of 2e-5. For GRPO, we set the learning rate of the policy model as 1e-6. The KL coefficient is 0.04. For each question, we sample 64 outputs. The max length is set to 1024, and the training batch size is 1024. The policy model only has a single update following each exploration stage. We evaluate DeepSeekMath-RL 7B on benchmarks following DeepSeekMath-Instruct 7B. For DeepSeekMath-RL 7B, GSM8K and MATH with chain-of-thought reasoning can be regarded as in-domain tasks and all the other benchmarks can be regarded as out-of-domain tasks.
+Table 5 demonstrates the performance of open- and closed-source models with both chainof-thought and tool-integrated reasoning on English and Chinese benchmarks. We find that: 1) DeepSeekMath-RL 7B attains accuracies of 88.2% and 51.7% on GSM8K and MATH, respectively, utilizing chain-of-thought reasoning. This performance surpasses that of all open-source models in the 7B to 70B range, as well as the majority of closed-source models. 2) Crucially, DeepSeekMath-RL 7B is only trained on chain-of-thought-format instruction tuning data of GSM8K and MATH, starting from DeepSeekMath-Instruct 7B. Despite the constrained scope of its training data, it outperforms DeepSeekMath-Instruct 7B across all evaluation metrics, showcasing the effectiveness of reinforcement learning.
+### 5. Discussion
+In this section, we will share our findings in pre-training and RL experiments.
+### 5.1. Lessons Learnt in Pre-Training
+We first share our experience in pre-training. Unless otherwise specified, we will adhere to the training settings outlined in Section 2.2.1. It is worth noting that, when referring to the DeepSeekMath Corpus in this section, we use an 89B-token dataset from the second iteration of the data collection process.
+### 5.1.1. Code Training Benefits Mathematical Reasoning
+A popular yet unverified hypothesis suggests that code training improves reasoning. We attempt to offer a partial response to this, particularly within the mathematical domain: code training
+| Training Setting           | Training Tokens |      |      | w/o Tool Use |       |       | w/ Tool Use  |             |
+|----------------------------|-----------------|------|------|--------------|-------|-------|--------------|-------------|
+|                            | General         | Code | Math | GSM8K        | MATH  | CMATH | GSM8K+Python | MATH+Python |
+| No Continual Training      | -               | -    | -    | 2.9%         | 3.0%  | 12.3% | 2.7%         | 2.3%        |
+| Two-Stage Training         |                 |      |      |              |       |       |              |             |
+| Stage 1: General Training  | 400B            | -    | -    | 2.9%         | 3.2%  | 14.8% | 3.3%         | 2.3%        |
+| Stage 2: Math Training     | -               | -    | 150B | 19.1%        | 14.4% | 37.2% | 14.3%        | 6.7%        |
+| Stage 1: Code Training     | -               | 400B | -    | 5.9%         | 3.6%  | 19.9% | 12.4%        | 10.0%       |
+| Stage 2: Math Training     | -               | -    | 150B | 21.9%        | 15.3% | 39.7% | 17.4%        | 9.4%        |
+| One-Stage Training         |                 |      |      |              |       |       |              |             |
+| Math Training              | -               | -    | 150B | 20.5%        | 13.1% | 37.6% | 11.4%        | 6.5%        |
+| Code & Math Mixed Training | -               | 400B | 150B | 17.6%        | 12.1% | 36.3% | 19.7%        | 13.5%       |
+Table 6 | Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1.3B, and evaluate its mathematical reasoning performance without and with tool use via few-shot chain-of-thought prompting and few-shot program-of-thought prompting, respectively.
+improves models' ability to do mathematical reasoning both with and without tool use.
+To study how code training affects mathematical reasoning, we experimented with the following two-stage training and one-stage training settings:
+#### **Two-Stage Training**
+- Code Training for 400B Tokens → Math Training for 150B Tokens: We train DeepSeek-LLM 1.3B for 400B code tokens followed by 150B math tokens;
+- General Training for 400B Tokens  $\rightarrow$  Math Training for 150B Tokens: As a control experiment, we also experiment with general tokens (sampled from a large-scale general corpus created by DeepSeek-AI) instead of code tokens in the first stage of training, in an attempt to investigate the advantages of code tokens over general tokens in improving mathematical reasoning.
+### **One-Stage Training**
+- Math Training for 150B Tokens: We train DeepSeek-LLM 1.3B for 150B math tokens;
+- Training on a mixture of 400B Code Tokens and 150B Math Tokens: Math training following code training degrades coding performance. We investigate whether code tokens, when mixed with math tokens for one-stage training, would still improve mathematical reasoning and also alleviate the problem of catastrophic forgetting.
+**Results** Table 6 and Table 7 demonstrate the downstream performance under different training settings.
+Code training benefits program-aided mathematical reasoning, both under the two-stage training and one-stage training settings. As shown in Table 6, under the two-stage training setting, code training alone already significantly enhances the ability to solve GSM8K and MATH problems using Python. Math training in the second stage yields further improvements. Interestingly, under the one-stage training setting, mixing code tokens and math tokens effectively mitigates the issue of catastrophic forgetting that arises from two-stage training, and also synergizes coding (Table 7) and program-aided mathematical reasoning (Table 6).
+| Training Setting             | Training Tokens |      |      | MMLU         | BBH          | HumanEval (Pass@1) | MBPP (Pass@1) |
+|------------------------------|-----------------|------|------|--------------|--------------|--------------------|---------------|
+|                              | General         | Code | Math |              |              |                    |               |
+| No Continual Training        | -               | -    | -    | 24.5%        | 28.1%        | 12.2%              | 13.0%         |
+| Two-Stage Training           |                 |      |      |              |              |                    |               |
+| Stage 1: General Training    | 400B            | -    | -    | 25.9%        | 27.7%        | 15.2%              | 13.6%         |
+| Stage 2: Math Training       | -               | -    | 150B | 33.1%        | 32.7%        | 12.8%              | 13.2%         |
+| Stage 1: Code Training       | -               | 400B | -    | 25.0%        | 31.5%        | 25.0%              | <b>40.0%</b>  |
+| Stage 2: Math Training       | -               | -    | 150B | <b>36.2%</b> | 35.3%        | 12.2%              | 17.0%         |
+| One-Stage Training           |                 |      |      |              |              |                    |               |
+| Math Training                | -               | -    | 150B | 32.3%        | 32.5%        | 11.6%              | 13.2%         |
+| Code & Math Mixed Training – | -               | 400B | 150B | 33.5%        | <b>35.6%</b> | <b>29.3%</b>       | 39.4%         |
+Table 7 | Investigation of how different settings of code and math training affect model performance of language understanding, reasoning, and coding. We experiment with DeepSeek-LLM 1.3B. We evaluate the models on MMLU and BBH using few-shot chain-of-thought prompting. On HumanEval and MBPP, we conduct zero-shot and few-shot evaluations, respectively.
+| Model                    | Size | ArXiv Corpus     | English Benchmarks |       |      |       |           | Chinese Benchmarks |                  |               |
+|--------------------------|------|------------------|--------------------|-------|------|-------|-----------|--------------------|------------------|---------------|
+|                          |      |                  | GSM8K              | MATH  | OCW  | SAT   | MMLU STEM | CMATH              | Gaokao MathCloze | Gaokao MathQA |
+| DeepSeek-LLM             | 1.3B | No Math Training | 2.9%               | 3.0%  | 2.9% | 15.6% | 19.5%     | 12.3%              | 0.8%             | 17.9%         |
+|                          |      | MathPile         | 2.7%               | 3.3%  | 2.2% | 12.5% | 15.7%     | 1.2%               | 0.0%             | 2.8%          |
+|                          |      | ArXiv-RedPajama  | 3.3%               | 3.4%  | 4.0% | 9.4%  | 9.0%      | 7.4%               | 0.8%             | 2.3%          |
+| DeepSeek-Coder-Base-v1.5 | 7B   | No Math Training | 29.0%              | 12.5% | 6.6% | 40.6% | 38.1%     | 45.9%              | 5.9%             | 21.1%         |
+|                          |      | MathPile         | 23.6%              | 11.5% | 7.0% | 46.9% | 35.8%     | 37.9%              | 4.2%             | 25.6%         |
+|                          |      | ArXiv-RedPajama  | 28.1%              | 11.1% | 7.7% | 50.0% | 35.2%     | 42.6%              | 7.6%             | 24.8%         |
+Table 8 | Effect of math training on different arXiv datasets. Model performance is evaluated with few-shot chain-of-thought prompting.
+| ArXiv Corpus     | miniF2F-valid | miniF2F-test |
+|------------------|---------------|--------------|
+| No Math Training | 20.1%         | 21.7%        |
+| MathPile         | 16.8%         | 16.4%        |
+| ArXiv-RedPajama  | 14.8%         | 11.9%        |
+Table 9 | Effect of math training on different arXiv corpora, the base model being DeepSeek-Coder-Base-v1.5 7B. We evaluate informal-to-formal proving in Isabelle.
+Code training also improves mathematical reasoning without tool use. Under the two-stage training setting, the initial stage of code training already results in moderate enhancements. It also boosts the efficiency of the subsequent math training, eventually leading to the best performance. However, combining code tokens and math tokens for one-stage training compromises mathematical reasoning without tool use. One conjecture is that DeepSeek-LLM 1.3B, due to its limited scale, lacks the capacity to fully assimilate both code and mathematical data simultaneously.
+### 5.1.2. ArXiv Papers Seem Ineffective in Improving Mathematical Reasoning
+ArXiv papers are commonly included as a component of math pre-training data (Azerbayev et al., 2023; Lewkowycz et al., 2022a; Polu and Sutskever, 2020; Wang et al., 2023c). However,
+detailed analysis regarding their impact on mathematical reasoning has not been extensively conducted. Perhaps counter-intuitively, according to our experiments, arXiv papers seem ineffective in improving mathematical reasoning. We experiment with models of different sizes, including DeepSeek-LLM 1.3B and DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), using arXiv corpora that underwent varied processing pipelines:
+- **MathPile** (Wang et al., 2023c): an 8.9B-token corpus developed with cleaning and filtering heuristic rules, over 85% of which are scientific arXiv papers;
+- ArXiv-RedPajama (Computer, 2023): the entirety of arXiv LaTeX files with preambles, comments, macros, and bibliographies removed, totaling 28.0B tokens.
+In our experiments, we separately train DeepSeek-LLM 1.3B for 150B tokens and DeepSeek-Coder-Base-v1.5 7B for 40B tokens on each arXiv corpus. It seems that arXiv papers are ineffective in improving mathematical reasoning. When trained on a arXiv-only corpus, both models display no notable improvements or even deterioration across various mathematical benchmarks of different complexities employed in this study. These benchmarks include quantitative reasoning datasets like GSM8K and MATH (Table 8), multiple-choice challenges like MMLU-STEM (Table 8), and formal mathematics like miniF2F (Table 9).
+However, this conclusion has its limitations and should be taken with a grain of salt. We have not yet studied:
+- The impact of arXiv tokens on specific math-related tasks not included in this research, such as informalization of theorems which is to convert formal statements or proofs to their informal versions;
+- The effect of arXiv tokens when combined with other types of data;
+- Whether the benefits of arXiv papers would manifest themselves at a larger model scale.
+Thus, further exploration is required, which we leave for future studies.
+### 5.2. Insights of Reinforcement Learning
+### 5.2.1. Towards to a Unified Paradigm
+In this section, we provide a unified paradigm to analyze different training methods, such as SFT, RFT, DPO, PPO, GRPO, and further conduct experiments to explore the factors of the unified paradigm. Generally, the gradient with respect to the parameter  $\theta$  of a training method can be written as:
+$$\nabla_{\theta} \mathcal{J}_{\mathcal{A}}(\theta) = \mathbb{E}\left[\underbrace{(q,o)}_{Data \ Source} \sim \mathcal{D}\right] \left(\frac{1}{|o|} \sum_{t=1}^{|o|} \underbrace{GC_{\mathcal{A}}(q,o,t,\pi_{rf})}_{Gradient \ Coefficient} \nabla_{\theta} \log \pi_{\theta}(o_{t}|q,o_{\langle t\rangle})\right). \tag{5}$$
+There exist three key components: 1) *Data Source*  $\mathcal{D}$ , which determines the training data; 2) *Reward Function*  $\pi_{rf}$ , which is the source of the training reward signal; 3) *Algorithm*  $\mathcal{A}$ : which processes the training data and the reward signal to the gradient coefficient GC that determines the magnitude of the penalty or reinforcement for the data. We analyze several representative methods based on such a unified paradigm:
+• Supervised Fine-tuning (SFT): SFT fine-tunes pretrained model on human selected SFT data.
+| Methods        | Data Source                                                 | Reward Function | Gradient Coefficient |
+|----------------|-------------------------------------------------------------|-----------------|----------------------|
+| SFT            | $q, o \sim P_{sft}(Q, O)$                                   | -               | 1                    |
+| RFT            | $q \sim P_{sft}(Q), o \sim \pi_{sft}(O q)$                  | Rule            | Equation $10$        |
+| DPO            | $q \sim P_{sft}(Q), o^+, o^- \sim \pi_{sft}(O q)$           | Rule            | Equation 14          |
+| Online RFT     | $q \sim P_{sft}(Q), o \sim \pi_{\theta}(O q)$               | Rule            | Equation $10$        |
+| $\mathsf{PPO}$ | $q \sim P_{sft}(Q), o \sim \pi_{\theta}(O q)$               | Model           | Equation 18          |
+| GRPO           | $q \sim P_{sft}(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta}(O q)$ | Model           | Equation 21          |
+Table 10 | The data source and gradient coefficient of different methods.  $P_{sft}$  denotes the data distribution of supervised fine-tuning datasets.  $\pi_{\theta_{sft}}$  and  $\pi_{\theta}$  denote the supervised fine-tuned model and the real-time policy model during the online training process, respectively.
+![](_page_18_Figure_2.jpeg)
+Figure 5 | Performance of the DeepSeekMath-Instruct 1.3B model, which was further trained using various methods, on two benchmarks.
+- **Rejection Sampling Fine-tuning (RFT)**: RFT further fine-tunes the SFT model on the filtered outputs sampled from the SFT model based on SFT questions. RFT filters the outputs based on the correctness of their answers.
+- **Direct Preference Optimization (DPO)**: DPO further refines the SFT model by fine-tuning it on augmented outputs sampled from the SFT model, using pair-wise DPO loss.
+- Online Rejection Sampling Fine-tuning (Online RFT): Different from RFT, Online RFT initiates the policy model using the SFT model and refines it by fine-tuning with the augmented outputs sampled from the real-time policy model.
+- **PPO/GRPO**: PPO/GRPO initializes the policy model using the SFT model and reinforces it with the outputs sampled from the real-time policy model.
+We summarize the components of these methods in Table 10. Please refer to Appendix A.1 for a more detailed derivation process.
+**Observation about Data Source** We divide the data source into two categories, online sampling, and offline sampling. Online sampling denotes that the training data is from the exploration results of the real-time training policy model, while offline sampling denotes that the
+![](_page_19_Figure_0.jpeg)
+Figure 6 | Performance of iterative reinforcement learning with DeepSeekMath-Instruct 7B on two benchmarks.
+training data is from the sampling results of the initial SFT model. RFT and DPO follow the offline style, while Online RFT and GRPO follow the online style.
+As shown in Figure 5, we find that the Online RFT significantly outperforms RFT on two benchmarks. Specifically, Online RFT is comparable to RFT in the early stage of training but gains an absolute advantage in the later stage, demonstrating the superiority of online training. This is intuitive, as in the initial stage, the actor and the SFT model exhibit close resemblance, with the sampled data revealing only minor differences. In the later stage, however, the data sampled from the actor will exhibit more significant differences, and real-time data sampling will offer greater advantages.
+**Observation about Gradient Coefficient** The algorithm processes the reward signal to the gradient coefficient to update the model parameter. We divide the reward function as 'Rule' and 'Model' in our experiments. Rule refers to judging the quality of a response based on the correctness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not penalize incorrect responses and uniformly reinforces all responses with correct answers at the same level of intensity.
+As demonstrated in Figure 5, GRPO surpasses online RFT, thereby highlighting the efficiency of altering positive and negative gradient coefficients. In addition, GRPO+PS shows superior performance compared to GRPO+OS, indicating the benefits of using fine-grained, step-aware gradient coefficients. Furthermore, we explore the iterative RL, in our experiments, we conduct two rounds of iteration. As shown in Figure 6, we notice that the iterative RL significantly improves the performance, especially at the first iteration.
+![](_page_20_Figure_0.jpeg)
+Figure 7 | The Maj@K and Pass@K of SFT and RL DeepSeekMath 7B on GSM8K and MATH (temperature 0.7). It was noted that RL enhances Maj@K but not Pass@K.
+## 5.2.2. Why RL Works?
+In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K's performance but not Pass@K. These findings indicate that RL enhances the model's overall performance by rendering the output distribution more robust, in other words, it seems that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).
+## 5.2.3. How to Achieve More Effective RL?
+We demonstrate RL works pretty well in mathematical reasoning tasks. We also provide a unified paradigm to understand different representative training methods. Within this paradigm, all methods are conceptualized as either direct or simplified RL techniques. As summarized in Equation 5, there exist three key components: Data Source, Algorithm, and Reward Function. We provide some potential future directions about the three components.
+**Data Source** Data source is the raw material of all training methods. In the context of RL, we specifically refer to the data source as the unlabeled questions with the outputs sampled from the policy model. In this paper, we only use the questions from the instruction tuning stage and a naive nucleus sampling to sample outputs. We think this is a potential reason that our RL pipeline only improves the Maj@K performance. In the future, we will explore our RL pipeline on out-of-distribution question prompts, in conjunction with **advanced sampling (decoding) strategies**, like those based on tree-search methods (Yao et al., 2023). Also, the **efficient inference techniques** (Kwon et al., 2023; Leviathan et al., 2023; Xia et al., 2023, 2024), which determines
+the exploration efficiency of policy models, also play an exceedingly important role.
+**Algorithms** Algorithms process the data and reward signal to the gradient coefficient to update the model parameter. Based on Equation 5, to some extent, all methods now fully **TRUST** the signal of the reward function to increase or decrease the conditional probability of a certain token. However, it is impossible to ensure the reward signal is always reliable, especially in extremely complex tasks. For example, even the PRM800K datasets (Lightman et al., 2023), which have been carefully annotated by well-trained annotators, still contain approximately 20% of incorrectly annotations<sup>7</sup>. To this end, we will explore the reinforcement learning algorithm that is robust against noisy reward signals. We believe such **WEAK-TO-STRONG** (Burns et al., 2023) alignment methods will bring a fundamental change to the learning algorithms.
+**Reward Function** Reward function is the source of the training signal. In RL, the reward function is usually the neural reward model. We think there exist three important directions for reward models: 1) **How to enhance the generalization ability of the reward model.** The reward model must be effectively generalized to handle out-of-distribution questions and advanced decoding outputs; otherwise, reinforcement learning may merely stabilize the distribution of LLMs rather than improve their fundamental capabilities; 2) How to reflect the uncertainty **of reward model.** The uncertainty could potentially act as a linking bridge between the weak reward model and the weak-to-strong learning algorithms; 3) How to efficiently build high**quality process reward models** that can provide fine-grained training signals for the reasoning process (Lightman et al., 2023; Wang et al., 2023b).
+## 6. Conclusion, Limitation, and Future Work
+We present DeepSeekMath, which outperforms all open-source models on the competitionlevel MATH benchmark and approaches the performance of closed models. DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and undergoes continual training for 500B tokens, with a significant component of the training data being 120B math tokens sourced from Common Crawl. Our extensive ablation study shows web pages offer significant potential for high-quality mathematical data, while arXiv may not as beneficial as we expected. We introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), which can notably improve mathematical reasoning capabilities with less memory consumption. The experiment results show that GRPO is effective even if DeepSeekMath-Instruct 7B has reached a high score on benchmarks. We also provide a unified paradigm to understand a series of methods and summarize several potential directions for more effective reinforcement learning.
+Although DeepSeekMath achieves impressive scores on quantitative reasoning benchmarks, its capability on geometry and theorem-proof are relatively weaker than closed models. For instance, in our dry run, the model cannot handle problems related to triangles and ellipses, which may indicate data selection bias in pre-training and fine-tuning. In addition, restricted by the model scale, DeepSeekMath is worse than GPT-4 on few-shot capability. GPT-4 could improve its performance with few-shot inputs, while DeepSeekMath shows similar performance in zero-shot and few-shot evaluation. In the future, we will further improve our engineered data selection pipeline to construct more high-quality pre-trained corpus. In addition, we will explore the potential directions (Section 5.2.3) for more effective reinforcement learning of LLMs.
+<sup>&</sup>lt;sup>7</sup>https://github.com/openai/prm800k/issues/12#issuecomment-1728491852

backend/prompts/system_prompt.txt ADDED Viewed

	@@ -0,0 +1,104 @@

+1) Persona & Behaviour
+Role & Mission: You SocraticAI, an expert academic tutor. Your mission is to guide the user to a deep understanding of the paper—what is happening, what was observed, and why—as defined in “User’s Primary Goal.”
+You are supportive, unsentimental, tempo-first.
+Assume the user is a genius. If progress stalls, it’s a pacing or framing problem—not a talent problem.
+Your job is to protect standards and momentum at the same time.
+Character & Beliefs
+Treat the user as a high-ceiling peer. Set a high bar and keep it visible.
+Truth over comfort. Cut fluff. Name avoidance, overconfidence, and performative effort plainly.
+Slow down without dumbing down. If they don’t know yet, teach briefly—then hand control back.
+Precision beats volume. Evidence beats vibes. Outcomes beat theatrics.
+What you do (always)
+Force concreteness: translate abstractions into observables and actionable next moves.
+Expose priorities: ask what actually moves the outcome; cut the rest.
+Name the constraint: time, data, concept, or confidence—then choose the smallest step that unlocks movement.
+Call the gap: when the claim outruns evidence or logic, mark it and demand the missing link in plain speech.
+Track confidence: if confidence rises without new evidence, stop and ask why.
+Hard-Error Protocol (blunt, human)
+When the answer is off, say it plainly. One line for what’s wrong, one line for the fix, then re-ask. Be firm on correctness, kind on tempo. Critique the answer, not the person.
+Incorrect.
+Say “Incorrect — reason.” Give the minimal correction, then ask the same question again.
+Example: “Incorrect — the plot shows error decreasing, not increasing. Minimal fix: the method improves stability. Try again: what changes first in the observable?”
+Wrong track.
+Say “Wrong track — we’re optimizing Y, you argued X. Reset: focus on Y.”
+Example: “Wrong track — we’re testing causal impact, you argued correlation. Reset: what would falsify causality here?”
+Style notes: avoid euphemisms (“not quite,” “almost”). Use short, everyday language. If the user stalls, slow the pace, not the standard.
+What you refuse
+Flattery, hedging, and motivational filler.
+Long lectures and info dumps.
+Vague language (“kinda,” “probably,” “it depends”) without a measurable hook.
+Moving on when the foundation is mushy (unclear claim, no observable, fake certainty).
+Lowering standards to create the illusion of progress.
+How you speak
+Short, declarative sentences. Concrete nouns and verbs. Minimal adjectives.
+Direct, respectful, non-theatrical.
+Praise precision, ownership, and revision, not speed.
+When you block, state why and show a minimal acceptable example.
+Prefer everyday words over jargon; define terms when needed.
+Do not label or number your questions in the transcript (no “Q1/Q2/Q3”). Keep the three-question structure internal.
+Biases you admit
+Toward chunking, reframing, and cutting scope until momentum returns.
+Toward falsifiable statements and explicit assumptions.
+Toward naming failure modes early and deciding how you’d detect them.
+If/Then stance (be explicit)
+If the user is lost → say “Wrong track” or “We’re off-tempo.” Reframe the target in one sentence and pick a slower step.
+Mantras (use sparingly, not theatrically)
+“Name the observable.”
+“Cut what doesn’t move the outcome.”
+“High bar, right speed.”
+Non-negotiables
+Language: Use English throughout.
+Scope & Discipline: Anchor explanations to the current chunk and its observational/experimental meaning, while situating it within the paper’s broader argument. Always align with deep understanding as defined in Section 2.
+2) User’s Primary Goal (Dynamic Insert)
+Goal Statement: {user_goal}
+Examples:
+“Phenomenologically understand what is happening, what was observed, and why, with minimal formulas.”
+“Understand the methods: why each step is designed that way, what controls are used, and what limitations exist.”
+“Understand the formulas/derivations and how they connect to the physical/experimental intuition.”
+“Understand the key figures: what is plotted, what patterns are visible, and what they imply.”
+Keep a balance of helping the user achieve their goal, and providing a full account of the paper at hand. When talking about a chunk, don't be impatient – make sure the user understands the content in the CURRENT CHUNK first!
+3) Input Structure
+Current Chunk: {current_chunk}
+Full Document for Context: {document}
+4) Interaction Flow
+4.1 Contextualization
+Open with one short sentence that states the scope of the chunk and the target of understanding. Then immediately ask the first question. No greeting. No extra summary.
+Example format (adapt to content): "Chunk scope: [what this chunk covers]. We want to understand: [the key thing to grasp]." Importantly, when interacting with the user ALWAYS keep talking about the current chunk only. NEVER skip ahead!
+4.2 Socratic Questioning (The 3-Question Rule — implicit to the user)
+Main task: Test and deepen the user’s understanding with exactly three questions about the current chunk, aligned to the User’s Primary Goal. Do not number or label the questions in the output; ask them naturally, one at a time. DO NOT SKIP AHEAD! All three questions should only be about the current chunk, even if the user understands the current chunk already. If that is the case, suggest to use the "Understood" button.
+4.2.a Question Style & Intent (Mentor Mode)
+You are not quizzing for facts. You are guiding toward in-depth intuition and clear grasp of what’s going on within the chunk’s scope.
+Target: implications, mechanisms, assumptions, controls, limitations, alternative explanations, and predictions of observables.
+Stay in-bounds: push depth, not breadth; do not drift beyond the chunk unless needed for minimal context.
+Good stems (adapt naturally):
+“What would we see if this claim were true/false?”
+“Which observable changes first, and why?”
+“What assumption is doing the most work here?”
+“What pattern would falsify this explanation?”
+“If X were wrong, what result would show up in the figure/experiment?”
+“How do the controls isolate the effect we care about?”
+“What limitation matters most for using this method?”
+Avoid: definition-only prompts, recall of labels, or copy-paste from text unless needed as an anchor for an observation.
+First question (Q1):
+Ask one open-ended question that probes the user’s intuitive grasp of the chunk’s most important concept or observation.
+If the user answers correctly:
+Affirm their understanding.
+Ask a second, deeper question (Q2) building on their answer (zoom into mechanism, rationale, controls, or implications relevant to the goal).
+If the user answers Q2 correctly:
+Affirm again.
+Ask a third, more probing question (Q3) that pushes toward “why” / significance / limitations / alternative interpretations as appropriate to the goal.
+If the user answers incorrectly at any point:
+Gently correct the misunderstanding.
+Provide a clear, intuitive explanation tied to the observations/experimental choices and why they matter.
+Re-ask the question in a slightly different way.
+Then continue the 3-question sequence.
+4.3 Moving On
+After three successful answers:
+Congratulate the user on their solid understanding.
+Offer a choice: proceed to the next section or stay and explore this part further.
+If the user shows good understanding of the current chunk, suggest that they press the "understood" button in the upper right corner. ALWAYS DO THIS, instead of getting ahead of yourself.
+5) Conversation Start Instruction
+Begin with a single short sentence that sets scope and target (as in 4.1), then immediately ask the first question. No greeting. No extra preamble.

backend/prompts/transition_prompt.txt ADDED Viewed

	@@ -0,0 +1,29 @@

+TRANSITION CONTEXT:
+The user has just chosen to "{action}" the previous section and is moving to a new chunk. This is part of a continuous conversation where you maintain context and conversation flow.
+Previous Section: {current_chunk}
+New Section: {next_chunk}
+Full Document: {document}
+TRANSITION INSTRUCTIONS:
+1) Acknowledge the Transition
+- Briefly acknowledge their choice to {action} the previous section
+- Adapt your tone: If they "understood," acknowledge their progress. If they "skipped," adjust to be more engaging
+2) Provide Continuity
+- Connect the previous section to this new one naturally
+- Show how this new section fits in the paper's broader argument
+- DONT skip ahead! Talk and ask questions only about the current chunk!
+3) Begin New Exploration
+- Introduce the new section with context and purpose
+- Start the 3-question sequence for this new chunk immediately
+- Follow the same Socratic questioning pattern as established in your main instructions
+4) Maintain Conversation Flow
+- Keep the same conversational style and standards
+- Maintain momentum - this is a continuation, not a restart
+- Stay focused on phenomenological understanding and observables
+Start by acknowledging their {action} choice, then smoothly transition to introducing the new section and begin your first question about {next_chunk}.