Spaces:

FrictionAI
/

SokratesAI

Sleeping

App Files Files Community

Alleinzellgaenger commited on Aug 18, 2025

Commit

3ecd1bf

1 Parent(s): 23bdcc2

Add welcome screen, scroll to highlight, lennart version

Browse files

Files changed (4) hide show

frontend/src/components/DocumentProcessor.jsx +0 -0
frontend/src/components/DocumentViewer.jsx +45 -9
frontend/src/components/WelcomeScreen.jsx +84 -0
frontend/src/hooks/useDocumentProcessor.js +20 -95

frontend/src/components/DocumentProcessor.jsx CHANGED Viewed

The diff for this file is too large to render. See raw diff

frontend/src/components/DocumentViewer.jsx CHANGED Viewed

@@ -46,18 +46,42 @@ const MyHighlightContainer = () => {
   return component;
 };
-const DocumentViewer = ({ selectedFile, documentData, onPageChange, preloadedHighlights = null, currentChatId = null }) => {
     const [highlights, setHighlights] = useState([]);
     const [pdfUrl, setPdfUrl] = useState(null);
     /** Refs for PdfHighlighter utilities */
     const highlighterUtilsRef = useRef();
     // Utility function to normalize highlight data
     const normalizeHighlight = (highlightData) => {
         // Ensure the highlight has the required structure
         if (!highlightData.id || !highlightData.position || !highlightData.content) {
-            console.warn('Invalid highlight data:', highlightData);
             return null;
         }
@@ -83,19 +107,19 @@ const DocumentViewer = ({ selectedFile, documentData, onPageChange, preloadedHig
         }
     }, [selectedFile]);
-    // Load preloaded highlights when component mounts or when currentChatId changes
     useEffect(() => {
         if (preloadedHighlights) {
             let highlightsToLoad = [];
-            if (currentChatId !== null && currentChatId !== undefined && preloadedHighlights[currentChatId]) {
-                // Load highlights for specific chat
-                highlightsToLoad = preloadedHighlights[currentChatId];
             } else if (Array.isArray(preloadedHighlights)) {
                 // Load all highlights if it's an array
                 highlightsToLoad = preloadedHighlights;
             } else if (typeof preloadedHighlights === 'object') {
-                // If it's an object without chatId, take all values
                 highlightsToLoad = Object.values(preloadedHighlights).flat();
             }
@@ -104,13 +128,24 @@ const DocumentViewer = ({ selectedFile, documentData, onPageChange, preloadedHig
                 .map(normalizeHighlight)
                 .filter(Boolean);
-            console.log(`🎨 Loading ${validHighlights.length} preloaded highlights${currentChatId ? ` for chat ${currentChatId}` : ''}`);
             setHighlights(validHighlights);
         } else {
             // Clear highlights if no preloaded data
             setHighlights([]);
         }
-    }, [preloadedHighlights, currentChatId]);
     // Handle selection - log coordinates and add debugging
     const handleSelection = (selection) => {
@@ -159,6 +194,7 @@ const DocumentViewer = ({ selectedFile, documentData, onPageChange, preloadedHig
                             pdfDocument={pdfDocument}
                             utilsRef={(_pdfHighlighterUtils) => {
                                 highlighterUtilsRef.current = _pdfHighlighterUtils;
                             }}
                             highlights={highlights}
                             onSelection={handleSelection}

   return component;
 };
+const DocumentViewer = ({ selectedFile, documentData, onPageChange, preloadedHighlights = null, currentChunkIndex = null, onDocumentReady = null }) => {
     const [highlights, setHighlights] = useState([]);
     const [pdfUrl, setPdfUrl] = useState(null);
     /** Refs for PdfHighlighter utilities */
     const highlighterUtilsRef = useRef();
+    const documentReadyCalledRef = useRef(false);
+    // Function to scroll to a specific chunk's highlight
+    const scrollToChunk = (chunkIndex) => {
+        if (highlighterUtilsRef.current && preloadedHighlights) {
+            const chunkHighlights = preloadedHighlights[chunkIndex];
+            if (chunkHighlights && chunkHighlights.length > 0) {
+                const firstHighlightInChunk = chunkHighlights[0];
+                highlighterUtilsRef.current.scrollToHighlight(firstHighlightInChunk);
+            }
+        }
+    };
+    // Function to scroll to the first highlight (for backwards compatibility)
+    const scrollToFirstChunk = () => {
+        scrollToChunk(0);
+    };
+    // Call onDocumentReady only once when utils become available
+    const callOnDocumentReady = () => {
+        if (onDocumentReady && !documentReadyCalledRef.current && highlighterUtilsRef.current) {
+            documentReadyCalledRef.current = true;
+            onDocumentReady({ scrollToFirstChunk });
+        }
+    };
     // Utility function to normalize highlight data
     const normalizeHighlight = (highlightData) => {
         // Ensure the highlight has the required structure
         if (!highlightData.id || !highlightData.position || !highlightData.content) {
             return null;
         }
         }
     }, [selectedFile]);
+    // Load preloaded highlights when component mounts or when currentChunkIndex changes
     useEffect(() => {
         if (preloadedHighlights) {
             let highlightsToLoad = [];
+            if (currentChunkIndex !== null && currentChunkIndex !== undefined && preloadedHighlights[currentChunkIndex]) {
+                // Load highlights for specific chunk
+                highlightsToLoad = preloadedHighlights[currentChunkIndex];
             } else if (Array.isArray(preloadedHighlights)) {
                 // Load all highlights if it's an array
                 highlightsToLoad = preloadedHighlights;
             } else if (typeof preloadedHighlights === 'object') {
+                // If it's an object without chunkIndex, take all values
                 highlightsToLoad = Object.values(preloadedHighlights).flat();
             }
                 .map(normalizeHighlight)
                 .filter(Boolean);
+            console.log(`🎨 Loading ${validHighlights.length} preloaded highlights${currentChunkIndex !== null ? ` for chunk ${currentChunkIndex}` : ''}`);
             setHighlights(validHighlights);
         } else {
             // Clear highlights if no preloaded data
             setHighlights([]);
         }
+    }, [preloadedHighlights, currentChunkIndex]);
+    // Auto-scroll to current chunk when currentChunkIndex changes (only on navigation, not during streaming)
+    useEffect(() => {
+        // Only auto-scroll if we have highlighter utils and this is a valid chunk navigation
+        if (highlighterUtilsRef.current && currentChunkIndex !== null && currentChunkIndex !== undefined && currentChunkIndex >= 0) {
+            // Small delay to ensure highlights are loaded
+            setTimeout(() => {
+                scrollToChunk(currentChunkIndex);
+            }, 200);
+        }
+    }, [currentChunkIndex]); // Only depend on currentChunkIndex, not preloadedHighlights
     // Handle selection - log coordinates and add debugging
     const handleSelection = (selection) => {
                             pdfDocument={pdfDocument}
                             utilsRef={(_pdfHighlighterUtils) => {
                                 highlighterUtilsRef.current = _pdfHighlighterUtils;
+                                callOnDocumentReady();
                             }}
                             highlights={highlights}
                             onSelection={handleSelection}

frontend/src/components/WelcomeScreen.jsx ADDED Viewed

	@@ -0,0 +1,84 @@

+import { useState } from 'react';
+const WelcomeScreen = ({ onGetStarted }) => {
+  return (
+    <div className="h-full flex flex-col items-center justify-center p-8 bg-gradient-to-br from-blue-50 to-indigo-100">
+      <div className="max-w-lg text-center space-y-6">
+        <div className="space-y-4">
+          <h1 className="text-4xl font-bold text-gray-900">
+            Welcome to SokratesAI
+          </h1>
+          <p className="text-lg text-gray-600 leading-relaxed">
+            Master complex documents without the overwhelm.
+            Your document becomes your tutor, questioning you to deepen understanding.
+          </p>
+        </div>
+        <div className="space-y-6">
+          <div className="text-sm text-gray-700">
+            <h3 className="font-semibold text-gray-900 mb-3">How it works:</h3>
+            <div className="space-y-2">
+              <div className="flex items-start space-x-3">
+                <div className="w-2 h-2 bg-blue-500 rounded-full mt-1.5"></div>
+                <span>Document appears highlighted in digestible sections</span>
+              </div>
+              <div className="flex items-start space-x-3">
+                <div className="w-2 h-2 bg-green-500 rounded-full mt-1.5"></div>
+                <span>AI tutor questions <em>you</em> about each chunk</span>
+              </div>
+              <div className="flex items-start space-x-3">
+                <div className="w-2 h-2 bg-purple-500 rounded-full mt-1.5"></div>
+                <span>Progress only when you truly understand</span>
+              </div>
+            </div>
+          </div>
+          <div className="text-sm text-gray-600 bg-gray-50 p-4 rounded-lg">
+            <div className="grid grid-cols-2 gap-4">
+              <div className="flex items-center space-x-2">
+                <div className="p-1.5 rounded-full bg-green-100">
+                  <svg className="w-4 h-4 text-green-600" fill="currentColor" viewBox="0 0 20 20">
+                    <path fillRule="evenodd" d="M16.707 5.293a1 1 0 010 1.414l-8 8a1 1 0 01-1.414 0l-4-4a1 1 0 011.414-1.414L8 12.586l7.293-7.293a1 1 0 011.414 0z" clipRule="evenodd" />
+                  </svg>
+                </div>
+                <span>Master current topic</span>
+              </div>
+              <div className="flex items-center space-x-2">
+                <div className="p-1.5 rounded-full bg-gray-100">
+                  <svg className="w-4 h-4 text-gray-600" fill="currentColor" viewBox="0 0 20 20">
+                    <path fillRule="evenodd" d="M7.293 14.707a1 1 0 010-1.414L10.586 10 7.293 6.707a1 1 0 011.414-1.414l4 4a1 1 0 010 1.414l-4 4a1 1 0 01-1.414 0z" clipRule="evenodd" />
+                    <path fillRule="evenodd" d="M12.293 14.707a1 1 0 010-1.414L15.586 10l-3.293-3.293a1 1 0 011.414-1.414l4 4a1 1 0 010 1.414l-4 4a1 1 0 01-1.414 0z" clipRule="evenodd" />
+                  </svg>
+                </div>
+                <span>Focus elsewhere</span>
+              </div>
+              <div className="flex items-center space-x-2">
+                <div className="p-1.5 rounded-full bg-gray-100">
+                  <svg className="w-4 h-4 text-gray-600" fill="currentColor" viewBox="0 0 20 20">
+                    <path fillRule="evenodd" d="M12.707 5.293a1 1 0 010 1.414L9.414 10l3.293 3.293a1 1 0 01-1.414 1.414l-4-4a1 1 0 010-1.414l4-4a1 1 0 011.414 0z" clipRule="evenodd" />
+                  </svg>
+                </div>
+                <span>Review previous sections</span>
+              </div>
+              <div className="flex items-center space-x-2">
+                <div className="w-6 h-2 bg-blue-200 rounded-full overflow-hidden">
+                  <div className="w-2/3 h-full bg-blue-500 rounded-full"></div>
+                </div>
+                <span>Track your journey</span>
+              </div>
+            </div>
+          </div>
+        </div>
+        <button
+          onClick={onGetStarted}
+          className="bg-blue-600 hover:bg-blue-700 text-white font-semibold py-4 px-8 rounded-lg transition-all duration-200 transform hover:scale-105 shadow-lg hover:shadow-xl"
+        >
+          Let's Start
+        </button>
+      </div>
+    </div>
+  );
+};
+export default WelcomeScreen;

frontend/src/hooks/useDocumentProcessor.js CHANGED Viewed

@@ -47,119 +47,44 @@ export const useDocumentProcessor = () => {
             // Use hardcoded chunks for the document
             const hardcodedChunks = [
   {
-    "topic": "The Dominance of Recurrent Models",
-    "text": "Recurrent neural networks, long short-term memory [\\\[13\\\]](#page-10-0) and gated recurrent [\\\[7\\\]](#page-10-1) neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [\\\[35,](#page-11-0) [2,](#page-9-0) [5\\\]](#page-10-2). Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [\\\[38,](#page-11-1) [24,](#page-10-3) [15\\\]](#page-10-4).",
-    "page": 2
   },
   {
-    "topic": "The Sequential Bottleneck of RNNs",
-    "text": "Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−<sup>1</sup> and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [\\\[21\\\]](#page-10-5) and conditional computation [\\\[32\\\]](#page-11-2), while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.",
-    "page": 2
   },
   {
-    "topic": "The Rise of Attention Mechanisms",
-    "text": "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [\\\[2,](#page-9-0) [19\\\]](#page-10-6). In all but a few cases [\\\[27\\\]](#page-11-3), however, such attention mechanisms are used in conjunction with a recurrent network.",
-    "page": 2
   },
   {
-    "topic": "Alternative Architectures to Reduce Sequential Computation",
-    "text": "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [\\\[16\\\]](#page-10-7), ByteNet [\\\[18\\\]](#page-10-8) and ConvS2S [\\\[9\\\]](#page-10-9), all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [\\\[12\\\]](#page-10-10). In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section [3.2.](#page-2-0)",
-    "page": 2
   },
   {
-    "topic": "Self-Attention (Intra-Attention)",
-    "text": "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [\\\[4,](#page-9-1) [27,](#page-11-3) [28,](#page-11-4) [22\\\]](#page-10-11). To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [\\\[17,](#page-10-12) [18\\\]](#page-10-8) and [\\\[9\\\]](#page-10-9).",
-    "page": 2
   },
   {
-    "topic": "The Standard Encoder-Decoder Structure",
-    "text": "Most competitive neural sequence transduction models have an encoder-decoder structure [\\\[5,](#page-10-2) [2,](#page-9-0) [35\\\]](#page-11-0). Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [\\\[10\\\]](#page-10-13), consuming the previously generated symbols as additional input when generating the next.",
-    "page": 2
   },
   {
-    "topic": "Inside the Encoder and Decoder Stacks",
-    "text": "### 3.1 Encoder and Decoder Stacks\n\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [\\\[11\\\]](#page-10-14) around each of the two sub-layers, followed by layer normalization [\\\[1\\\]](#page-9-2). That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.\n\nDecoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.",
-    "page": 3
   },
   {
-    "topic": "The Attention Function: Query, Key, Value",
-    "text": "### 3.2 Attention\n\nAn attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n\n![](_page_3_Figure_0.jpeg)\n\n<span id=\"page-3-0\"></span><span id=\"page-3-0\"></span>Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.\n\nof the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.",
-    "page": 3
   },
   {
-    "topic": "Core Mechanism: Scaled Dot-Product Attention",
-    "text": "### 3.2.1 Scaled Dot-Product Attention\n\nWe call our particular attention \"Scaled Dot-Product Attention\" (Figure 2). The input consists of queries and keys of dimension  $d_k$ , and values of dimension  $d_v$ . We compute the dot products of the query with all keys, divide each by  $\\sqrt{d_k}$ , and apply a softmax function to obtain the weights on the values.\n\nIn practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix  $Q$ . The keys and values are also packed together into matrices  $K$  and  $V$ . We compute the matrix of outputs as:\n\n$$Attention(Q, K, V) = softmax(\\frac{QK^{T}}{\\sqrt{d_{k}}})V$$\n(1)",
-    "page": 4
   },
   {
-    "topic": "Why Scaling is Crucial",
-    "text": "The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of  $\\frac{1}{\\sqrt{d_k}}$ . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.\n\nWhile for small values of  $d_k$  the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of  $d_k$  [3]. We suspect that for large values of  $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients<sup>4</sup>. To counteract this effect, we scale the dot products by  $\\frac{1}{\\sqrt{d}}$ .",
-    "page": 4
-  },
-  {
-    "topic": "Innovation: Multi-Head Attention",
-    "text": "### 3.2.2 Multi-Head Attention\n\nInstead of performing a single attention function with  $d_{\\text{model}}$ -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values  $h$  times with different, learned linear projections to  $d_k$ ,  $d_k$  and  $d_v$  dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding  $d_v$ -dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure [2.](#page-3-0)",
-    "page": 4
-  },
-  {
-    "topic": "The Power of Multi-Head Attention",
-    "text": "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.\n\n$$\\begin{aligned} \\text{MultiHead}(Q, K, V) &= \\text{Concat}(\\text{head}_1, ..., \\text{head}_h)W^O \\\\ \\text{where } \\text{head}_i &= \\text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \\end{aligned}$$\n\nWhere the projections are parameter matrices W Q <sup>i</sup> <sup>∈</sup> <sup>R</sup> <sup>d</sup>model×d<sup>k</sup> , W <sup>K</sup> <sup>i</sup> ∈ R <sup>d</sup>model×d<sup>k</sup> , W<sup>V</sup> <sup>i</sup> ∈ R dmodel×d<sup>v</sup> and W<sup>O</sup> ∈ R hdv×dmodel .\n\nIn this work we employ h = 8 parallel attention layers, or heads. For each of these we use d<sup>k</sup> = d<sup>v</sup> = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.",
-    "page": 5
-  },
-  {
-    "topic": "Three Uses of Attention in the Model",
-    "text": "### 3.2.3 Applications of Attention in our Model\n\nThe Transformer uses multi-head attention in three different ways:\n\n- In \"encoder-decoder attention\" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [\\\[38,\\\](#page-11-1) [2,\\\](#page-9-0) [9\\\]](#page-10-9).\n- The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.\n- Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure [2.](#page-3-0)",
-    "page": 5
-  },
-  {
-    "topic": "The Role of Position-wise Feed-Forward Networks",
-    "text": "### 3.3 Position-wise Feed-Forward Networks\n\nIn addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.\n\n$$FFN(x) = \\max(0, xW_1 + b_1)W_2 + b_2 \\tag{2}$$\n\nWhile the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality df f = 2048.",
-    "page": 5
-  },
-  {
-    "topic": "Input/Output: Embeddings and Softmax",
-    "text": "### 3.4 Embeddings and Softmax\n\nSimilarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [\\\[30\\\]](#page-11-6). In the embedding layers, we multiply those weights by <sup>√</sup> dmodel.",
-    "page": 5
-  },
-  {
-    "topic": "Solving Sequence Order: Positional Encodings",
-    "text": "### 3.5 Positional Encoding\n\nSince our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add \"positional encodings\" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension  $d_{\\text{model}}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].",
-    "page": 6
-  },
-  {
-    "topic": "The Sinusoidal Positional Encoding Function",
-    "text": "In this work, we use sine and cosine functions of different frequencies:\n\n$$PE_{(pos,2i)} = sin(pos/10000^{2i/d_{\\text{model}}})$$\n$$PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{\\text{model}}})$$\n\nwhere  $pos$  is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from  $2\\pi$  to  $10000 \\cdot 2\\pi$ . We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset  $k$ ,  $PE_{pos+k}$  can be represented as a linear function of  $PE_{pos}$ .\n\nWe also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row  $(\\bar{E})$ ). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.",
-    "page": 6
-  },
-  {
-    "topic": "Why Self-Attention? The Three Desiderata",
-    "text": "#### Why Self-Attention 4\n\nIn this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations  $(x_1,...,x_n)$  to another sequence of equal length  $(z_1,...,z_n)$ , with  $x_i,z_i\\in\\mathbb{R}^d$ , such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.\n\nOne is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.\n\nThe third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.",
-    "page": 6
-  },
-  {
-    "topic": "Comparing Layer Types by Key Metrics",
-    "text": "As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires  $O(n)$  sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence\n\n<span id=\"page-5-0\"></span><span id=\"page-5-0\"></span>Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types.  $n$  is the sequence length,  $d$  is the representation dimension,  $k$  is the kernel size of convolutions and  $r$  the size of the neighborhood in restricted self-attention.\n\n| Layer Type                  | Complexity per Layer     | Sequential<br>Operations | Maximum Path Length |\n|-----------------------------|--------------------------|--------------------------|---------------------|\n| Self-Attention              | $O(n^2 \\cdot d)$         | O(1)                     | O(1)                |\n| Recurrent                   | $O(n \\cdot d^2)$         | O(n)                     | O(n)                |\n| Convolutional               | $O(k \\cdot n \\cdot d^2)$ | O(1)                     | $O(log_k(n))$       |\n| Self-Attention (restricted) | $O(r \\cdot n \\cdot d)$   | $\\mathcal{O}(1)$         | O(n/r)              |\n\nlength n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [\\\[38\\\]](#page-11-1) and byte-pair [\\\[31\\\]](#page-11-7) representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work.",
-    "page": 6
-  },
-  {
-    "topic": "A Side Benefit: Interpretability",
-    "text": "As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.",
-    "page": 7
-  },
-  {
-    "topic": "Training Data, Batching, and Hardware",
-    "text": "### 5.1 Training Data and Batching\n\nWe trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [\\\[3\\\]](#page-9-3), which has a shared sourcetarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [\\\[38\\\]](#page-11-1). Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.\n\n### 5.2 Hardware and Schedule\n\nWe trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table [3\\)](#page-8-0), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).",
-    "page": 7
-  },
-  {
-    "topic": "The Adam Optimizer and Learning Rate Schedule",
-    "text": "### 5.3 Optimizer\n\nWe used the Adam optimizer [\\\[20\\\]](#page-10-16) with β<sup>1</sup> = 0.9, β<sup>2</sup> = 0.98 and ϵ = 10<sup>−</sup><sup>9</sup> . We varied the learning rate over the course of training, according to the formula:\n\n$$lrate = d_{\\text{model}}^{-0.5} \\cdot \\min(\\text{step\\_num}^{-0.5}, \\text{step\\_num} \\cdot \\text{warmup\\_steps}^{-1.5})$$\n (3)\n\nThis corresponds to increasing the learning rate linearly for the first warmup\\_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup\\_steps = 4000.",
-    "page": 7
-  },
-  {
-    "topic": "Regularization Techniques",
-    "text": "### 5.4 Regularization\n\nWe employ three types of regularization during training:\n\n**Residual Dropout** We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of  $P_{drop} = 0.1.$\n\n**Label Smoothing** During training, we employed label smoothing of value  $\\epsilon_{ls} = 0.1$  [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.",
-    "page": 7
   }
 ];

             // Use hardcoded chunks for the document
             const hardcodedChunks = [
   {
+    "topic": "The Foundation: Proximal Policy Optimization (PPO)",
+    "text": "### 4.1.1. From PPO to GRPO\n\nProximal Policy Optimization (PPO) (Schulman et al., 2017) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang et al., 2022). In particular, it optimizes LLMs by maximizing the following surrogate objective:\n\n$$\\mathcal{J}_{PPO}(\\theta) = \\mathbb{E}\\left[q \\sim P(Q), o \\sim \\pi_{\\theta_{old}}(O|q)\\right] \\frac{1}{|o|} \\sum_{t=1}^{|o|} \\min\\left[\\frac{\\pi_{\\theta}(o_t|q, o_{\\leq t})}{\\pi_{\\theta_{old}}(o_t|q, o_{\\leq t})} A_t, \\text{clip}\\left(\\frac{\\pi_{\\theta}(o_t|q, o_{\\leq t})}{\\pi_{\\theta_{old}}(o_t|q, o_{\\leq t})}, 1 - \\varepsilon, 1 + \\varepsilon\\right) A_t\\right], \\tag{1}$$\n\nwhere  $\\pi_{\\theta}$  and  $\\pi_{\\theta_{old}}$  are the current and old policy models, and *q*, *o* are questions and outputs sampled from the question dataset and the old policy  $\\pi_{\\theta_{old}}$ , respectively.  $\\varepsilon$  is a clipping-related hyper-parameter introduced in PPO for stabilizing training.  $A_t$  is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman et al., 2015), based on the rewards  $\\{r_{\\geq t}\\}$  and a learned value function  $V_{\\psi}$ . Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a reference model in the reward at each token (Ouyang et al., 2022), i.e.,\n\n$$r_t = r_{\\varphi}(q, o_{\\leq t}) - \\beta \\log \\frac{\\pi_{\\theta}(o_t|q, o_{\\leq t})}{\\pi_{ref}(o_t|q, o_{\\leq t})},\\tag{2}$$\n\nwhere  $r_{\\varphi}$  is the reward model,  $\\pi_{ref}$  is the reference model, which is usually the initial SFT model, and  $\\beta$  is the coefficient of the KL penalty."
   },
   {
+    "topic": "The Problem with PPO: Why a New Approach is Needed",
+    "text": "As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token."
   },
   {
+    "topic": "The Solution: Introducing Group Relative Policy Optimization (GRPO)",
+    "text": "To address this, as shown in Figure 4, we propose Group Relative Policy Optimization (GRPO), which obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline.\n\n![](_page_12_Figure_0.jpeg)\n\nFigure 4 | Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead estimating the baseline from group scores, significantly reducing training resources."
   },
   {
+    "topic": "The GRPO Objective Function (Equation 3)",
+    "text": "More specifically, for each question  $q$ , GRPO samples a group of outputs  $\\{o_1, o_2, \\cdots, o_G\\}$  from the old policy  $\\pi_{\\theta_{old}}$  and then optimizes the policy model by maximizing the following objective:\n\n$$\\mathcal{J}_{GRPO}(\\theta) = \\mathbb{E}[q \\sim P(Q), \\{o_{i}\\}_{i=1}^{G} \\sim \\pi_{\\theta_{old}}(O|q)]\\n$$\n\n$$\\n\\frac{1}{G} \\sum_{i=1}^{G} \\frac{1}{|o_{i}|} \\sum_{t=1}^{|o_{i}|} \\left\\{ \\min \\left[ \\frac{\\pi_{\\theta}(o_{i,t}|q, o_{i,< t})}{\\pi_{\\theta_{old}}(o_{i,t}|q, o_{i,< t})} \\hat{A}_{i,t}, \\operatorname{clip} \\left( \\frac{\\pi_{\\theta}(o_{i,t}|q, o_{i,< t})}{\\pi_{\\theta_{old}}(o_{i,t}|q, o_{i,< t})}, 1 - \\varepsilon, 1 + \\varepsilon \\right) \\hat{A}_{i,t} \\right] - \\beta \\mathbb{D}_{KL} \\left[ \\pi_{\\theta} || \\pi_{ref} \\right] \\right\\}, \\n$$\n(3)\n\nwhere  $\\varepsilon$  and  $\\beta$  are hyper-parameters, and  $\\hat{A}_{i,t}$  is the advantage calculated based on relative rewards of the outputs inside each group only, which will be detailed in the following subsections."
   },
   {
+    "topic": "Key Feature 1: Group Relative Advantage Calculation",
+    "text": "The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of  $\\hat{A}_{i,t}$ ."
   },
   {
+    "topic": "Key Feature 2: KL Divergence as a Direct Penalty (Equation 4)",
+    "text": "And different from the KL penalty term used in (2), we estimate the KL divergence with the following unbiased estimator (Schulman, 2020):\n\n$$\\mathbb{D}_{KL}\\left[\\pi_{\\theta}||\\pi_{ref}\\right] = \\frac{\\pi_{ref}(o_{i,t}|q, o_{i,< t})}{\\pi_{\\theta}(o_{i,t}|q, o_{i,< t})} - \\log\\frac{\\pi_{ref}(o_{i,t}|q, o_{i,< t})}{\\pi_{\\theta}(o_{i,t}|q, o_{i,< t})} - 1,\\tag{4}$$\n\nwhich is guaranteed to be positive."
   },
   {
+    "topic": "Application 1: Outcome Supervision RL with GRPO",
+    "text": "#### 4.1.2. Outcome Supervision RL with GRPO\n\nFormally, for each question q, a group of outputs  $\\{o_1, o_2, \\cdots, o_G\\}$  are sampled from the old policy model  $\\pi_{\\theta_{old}}$ . A reward model is then used to score the outputs, yielding *G* rewards  $\\mathbf{r} = \\{r_1, r_2, \\cdots, r_G\\}$  correspondingly. Subsequently, these rewards are normalized by subtracting the group average and dividing by the group standard deviation. Outcome supervision provides the normalized reward at the end of each output  $o_i$  and sets the advantages  $\\hat{A}_{i,t}$  of all tokens in the output as the normalized reward, i.e.,  $\\hat{A}_{i,t} = \\widetilde{r}_i = \\frac{r_i - \\text{mean}(\\mathbf{r})}{\\text{std}(\\mathbf{r})}$ , and then optimizes the policy by maximizing the objective defined in equation  $(3)$ ."
   },
   {
+    "topic": "Application 2: Process Supervision RL with GRPO",
+    "text": "### 4.1.3. Process Supervision RL with GRPO\n\nOutcome supervision only provides a reward at the end of each output, which may not be sufficient and efficient to supervise the policy in complex mathematical tasks. Following Wang et al. (2023b), we also explore process supervision, which provides a reward at the end of each reasoning step. Formally, given the question q and G sampled outputs  $\\{o_1, o_2, \\cdots, o_G\\}$ , a process reward model is used to score each step of the outputs, yielding corresponding rewards:  $\\mathbf{R} = \\{\\{r_1^{index(1)}, \\cdots, r_1^{index(K_1)}\\}, \\cdots, \\{r_G^{index(1)}, \\cdots, r_G^{index(K_G)}\\}\\}, \\text{ where } index(j) \\text{ is the end token index}$ of the  $j$ -th step, and  $K_i$  is the total number of steps in the  $i$ -th output. We also normalize these rewards with the average and the standard deviation, i.e.,  $\\widetilde{r}_{i}^{\\text{index}(j)} = \\frac{r_{i}^{\\text{index}(j)} - \\text{mean}(\\mathbf{R})}{\\text{std}(\\mathbf{R})}$ . Subsequently, the process supervision calculates the advantage of each token as the sum of the normalized rewards from the following steps, i.e.,  $\\hat{A}_{i,t} = \\sum_{index(j) \\ge t} \\tilde{r}_i^{index(j)}$ , and then optimizes the policy by maximizing the objective defined in equation  $(3)$ ."
   },
   {
+    "topic": "The Full Training Loop: Iterative RL with GRPO",
+    "text": "### 4.1.4. Iterative RL with GRPO\n\nAs the reinforcement learning training process progresses, the old reward model may not be sufficient to supervise the current policy model. Therefore, we also explore the iterative RL with GRPO. As shown in Algorithm 1, in iterative GRPO, we generate new training sets for the reward model based on the sampling results from the policy model and continually train the old reward model using a replay mechanism that incorporates 10% of historical data. Then, we set the reference model as the policy model, and continually train the policy model with the new reward model.\n\n### **Algorithm 1** Iterative Group Relative Policy Optimization\n\n**Input** initial policy model  $\\pi_{\\theta_{\\text{init}}}$ ; reward models  $r_{\\varphi}$ ; task prompts  $\\mathcal{D}$ ; hyperparameters  $\\varepsilon$ ,  $\\beta$ ,  $\\mu$ \n\n- 1: policy model  $\\pi_{\\theta} \\leftarrow \\pi_{\\theta_{\\text{init}}}$ 2: **for** iteration =  $1, \\ldots, I$  **do** 3: reference model  $\\pi_{ref} \\leftarrow \\pi_{\\theta}$\n- 4: for step =  $1, \\ldots, M$  do\n- Sample a batch  $\\mathcal{D}_b$  from  $\\mathcal{D}$ 5:\n- Update the old policy model  $\\pi_{\\theta_{old}} \\leftarrow \\pi_{\\theta}$ 6:\n- 7:\n- Sample *G* outputs  $\\{o_i\\}_{i=1}^G \\sim \\pi_{\\theta_{old}}(\\cdot \\mid q)$  for each question  $q \\in \\mathcal{D}_b$ <br>Compute rewards  $\\{r_i\\}_{i=1}^G$  for each sampled output  $o_i$  by running  $r_{\\varphi}$ 8:\n- Compute  $\\hat{A}_{i,t}$  for the *t*-th token of  $o_i$  through group relative advantage estimation. 9:\n- **for** GRPO iteration =  $1, \\ldots, \\mu$  **do** 10:\n- Update the policy model  $\\pi_{\\theta}$  by maximizing the GRPO objective (Equation 21) 11:\n- 12: Update  $r_{\\varphi}$  through continuous training using a replay mechanism.\n\nOutput  $\\pi_{\\theta}$ "
   },
   {
+    "topic": "Why GRPO Makes Sense: The Benefit of a Graded Reward Signal",
+    "text": "The algorithm processes the reward signal to the gradient coefficient to update the model parameter. We divide the reward function as 'Rule' and 'Model' in our experiments. Rule refers to judging the quality of a response based on the correctness of the answer, and Model denotes that we train a reward model to score each response. The training data of the reward model is based on the rule judgment. Equations 10 and 21 highlight a key difference between GRPO and Online RFT: GRPO uniquely adjusts its gradient coefficient based on the reward value provided by the reward model. This allows for differential reinforcement and penalization of responses according to their varying magnitudes. In contrast, Online RFT lacks this feature; it does not penalize incorrect responses and uniformly reinforces all responses with correct answers at the same level of intensity.\n\nAs demonstrated in Figure 5, GRPO surpasses online RFT, thereby highlighting the efficiency of altering positive and negative gradient coefficients. In addition, GRPO+PS shows superior performance compared to GRPO+OS, indicating the benefits of using fine-grained, step-aware gradient coefficients. Furthermore, we explore the iterative RL, in our experiments, we conduct two rounds of iteration. As shown in Figure 6, we notice that the iterative RL significantly improves the performance, especially at the first iteration.\n\n![](_page_18_Figure_2.jpeg)\n\nFigure 5 | Performance of the DeepSeekMath-Instruct 1.3B model, which was further trained using various methods, on two benchmarks."
   }
 ];