Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| ## Appendix | |
| ### Details on the experiments | |
| For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice. | |
| ### Prompts | |
| #### BeyondWeb | |
| ##### continue | |
| ```text | |
| Continue the following text in the same style as the original. Start with the continuation directly. | |
| Text: | |
| [TEXT] | |
| ``` | |
| ##### summarize | |
| ```text | |
| Summarize the following text. Write a standalone summary without referencing the text. Directly start with the summary. Do not say anything else. | |
| Text: | |
| [TEXT] | |
| Summary: | |
| ``` | |
| #### Format | |
| ##### article | |
| ```text | |
| Transform the document into a magazine-style feature article. Open with an engaging lead, then blend narrative storytelling with factual explanation. Maintain an accessible yet polished tone suitable for a general but informed readership. Output only the feature article, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| ##### commentary | |
| ```text | |
| Summarize the document in a concise paragraph that captures its central arguments or findings. Then, write an expert commentary that critically reflects on its implications, limitations, or broader context. Maintain an analytical and professional tone throughout. Output only the summary and the commentary, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| ##### discussion | |
| ```text | |
| Reformulate the document as a dialogue between a teacher and a student. The teacher should guide the student toward understanding the key points while clarifying complex concepts. Keep the exchange natural, informative, and faithful to the original content. Output only the dialogue, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| ##### faq | |
| ```text | |
| Rewrite the document as a comprehensive FAQ (Frequently Asked Questions). Extract or infer the key questions a reader would have about this topic, then provide clear, direct answers. Order questions logically—from foundational to advanced, or by topic area. Each answer should be self-contained and understandable without reference to other answers. Ensure the FAQ works as a standalone document. Output only the FAQ, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| ##### math | |
| ```text | |
| Rewrite the document to create a mathematical word problem based on the numerical data or relationships in the text. Provide a step-by-step solution that shows the calculation process clearly. Create a problem that requires multi-step reasoning and basic arithmetic operations. It should include the question followed by a detailed solution showing each calculation step. Output only the problem and solution, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| ##### table | |
| ```text | |
| Rewrite the document as a structured table that organizes the key information, then generate one question-answer pair based on the table. First extract the main data points and organize them into a clear table format with appropriate headers using markdown table syntax with proper alignment. After the table, generate one insightful question that can be answered using the table data. Provide a clear, concise answer to the question based on the information in the table. Output only the table followed by the question-answer pair, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| ##### tutorial | |
| ```text | |
| Rewrite the document as a clear, step-by-step tutorial or instructional guide. Use numbered steps or bullet points where appropriate to enhance clarity. Preserve all essential information while ensuring the style feels didactic and easy to follow. Output only the tutorial, nothing else. | |
| Document: | |
| [TEXT] | |
| ``` | |
| #### Nemotron | |
| ##### distill | |
| ```text | |
| Your task is to read and paraphrase the provided text following these instructions: | |
| - Aim to create a condensed but accurate and informative version of the original text, not a simplistic summary. | |
| - Capture and preserve the crucial information, key concepts, important values, and factual details in the original text, while making it more readable and accessible. | |
| - Retain technical terms, specialized vocabulary, and complex concepts. | |
| - Retain examples, explanations of reasoning processes, and supporting evidence to maintain the text's depth and context. | |
| - Only include information that is present in the original text. Do not adding new or unsubstantiated claims. | |
| - Write in plain text. | |
| Here is the text: | |
| [TEXT] | |
| Task: | |
| After thoroughly reading the above text, paraphrase it in high-quality and clear English following the instructions. | |
| ``` | |
| ##### diverse_qa_pairs | |
| ```text | |
| Task: Read the text, ask questions and answer them. | |
| Follow these instructions: | |
| 1. Ask diverse questions that require different cognitive skills or cover different aspects of the text. | |
| 1. Ask questions in various forms such as: | |
| - Yes/No questions that require determining whether a statement is true or false. | |
| - Open-ended questions that begin with words like what, how, when, where, why and who. | |
| - Multi-choice questions that offers two or more options to choose from. Include the options in the question. | |
| - Comparison questions that compare two quantities or objects and determine the relationship between them. | |
| - Reading comprehension questions that test the ability to understand and analyze the text. | |
| - Problem-solving questions that test the ability to solve mathematical, physical, or logical problems. | |
| 1. Focus on asking questions about factual information, important knowledge, or concrete details in the text. | |
| 1. Write questions and answers using clear and concise language. | |
| 1. Use plain text. Do not use Markdown. | |
| 1. Each question and answer pair should be on a separate line. Tag the question with "Question:" and the answer with "Answer:". | |
| Text: | |
| [TEXT] | |
| Task: | |
| After reading the above text, ask up to 8 questions and provide the correct answers following the instructions. Give your response in this format: | |
| Here are the questions and answers based on the provided text: | |
| - Question: [first question] Answer: [first answer] | |
| - Question: [second question] Answer: [second answer] | |
| .... | |
| ``` | |
| ##### extract_knowledge | |
| ```text | |
| Your task is to rewrite knowledge from the provided text following these instructions: | |
| - Rewrite the text as a passage or passages using easy-to-understand and high-quality English like sentences in textbooks and Wikipedia. | |
| - Focus on content in disciplines such as humanities, social sciences, natural sciences, technology, engineering, math, law and legal, business, management, art, education, agricultural sciences, politics, and history. | |
| - Disregard content that does not contain useful facts or knowledge. | |
| - Retain examples, explanations of reasoning processes, and supporting evidence to maintain the text's depth and context. | |
| - Do not add or alter details. Only restate what is already in the text. | |
| - Write in plain text. | |
| - Do not add titles, subtitles, note, or comment. | |
| Text: | |
| [TEXT] | |
| Task: | |
| Rewrite facts and knowledge from the above text as a passage or passages following the instructions. | |
| ``` | |
| ##### knowledge_list | |
| ```text | |
| Review the text and extract the key information. Follow these instructions: | |
| - Carefully read the above text and provide a concise and organized list of factual information, concrete details, key concepts, and important numbers and statistics extracted from the text. | |
| - Ensure each point is clear, specific, and supported by the original text. | |
| - Ensure the extract text is information-dense and easier to learn from. | |
| - Do not add titles or headings. | |
| Text: | |
| [TEXT] | |
| Task: | |
| Extract the factual information, concrete details, and key concepts from the above text following the instructions. | |
| ``` | |
| ##### wikipedia_style_rephrasing | |
| ```text | |
| For the following paragraph give me a diverse paraphrase of the same in high quality English language as in sentences on Wikipedia. Begin your answer on a separate line with "Here is a paraphrased version:". | |
| Text: | |
| [TEXT] | |
| ``` | |
| #### REWIRE | |
| ##### guided_rewrite_improved | |
| ```text | |
| Below is a draft from an AI Assistant when trying to accomplish a task or solve a problem. Analyze and understand the task and problem(s) to be solved. Then pretend to be the expert who is most skillful to accomplish this task, and use detailed thinking and internal reasoning to identify a strategy and develop a plan about how to solve this problem. Experts usually apply meta-reasoning and planning to reason about how to best accomplish the task before jumping to a solution. | |
| Deliberate meta-reasoning also involves reflection which can help identify issues and take a step back to explore other paths. Below are some generic examples of starting questions experts could ask themselves during the meta-reasoning process. The expert will come up with the most relevant questions that can help with their thinking process, which are also very specific to the task. | |
| Consider these questions during your internal reasoning process: | |
| - What is the core issue or problem that needs to be addressed? What are the key assumptions underlying this problem? | |
| - How can I break down this problem into smaller, more manageable parts? How can I simplify the problem so that it is easier to solve? | |
| - What kinds of solutions are typically produced for this kind of problem specification? Given the problem specification and the current best solution, what other possible solutions exist? If the current best solution is totally wrong, what other ways are there to think about the problem specifically? | |
| - What is the best way to modify this current best solution, given what you know about these kinds of problem specifications? | |
| - Am I on the right track? Check your progress so far. | |
| - Develop a step by step plan internally. | |
| Finally, rewrite the original content from the author's perspective, maintaining their voice and intent while making substantial improvements. Take information and details from the original draft whenever they are useful. The rewritten content should not be shorter than the original response. The improved version should have significantly better formatting and readability, with more coherent and in-depth reasoning, enhanced clarity, stronger structure, and removal of any noise or digression. Write as if you are the original author meaningfully improving their own work - not just making minor edits. | |
| IMPORTANT: Your output must be ONLY the actual rewritten content itself - nothing else. Do NOT include any analysis, commentary, description, summary, or explanation about the improvements made. Do NOT add any meta-commentary like "This version improves..." or similar statements. Do NOT reference "the original draft" or "the draft" in your output. Output ONLY the content as if it were the final published piece that readers would see, with absolutely no additional text before or after it. | |
| Original Draft: | |
| [TEXT] | |
| ``` | |
| ##### guided_rewrite_original | |
| ```text | |
| Below is a draft from an AI Assistant when trying to accomplish task or solving a problem. Analyze and understand the task and problem(s) to be solved. Then pretend to be the expert who is most skillful to acomplish this task, write down the detailed thinking process and internal monologue that went into identifying a strategy and lay out a plan about how to solve this problem. Experts usually apply meta-reasoning and planning to reason about how to best accomplish the task before jumping to solution. | |
| Deliberate meta-reasoning also involves reflection which can help identify issues and take a step back to explore other paths. Below are some generic examples of starting questions experts could ask themselves during meta-reasoning process. The expert will come up with the most relevant questions that can help with their thinking process, which are also very specific to the task. | |
| Let's first try to understand the task and exactly what problem(s) to be solved. What is the core issue or problem that needs to be addressed? What are the key assumptions underlying this problem? | |
| How can I break down this problem into smaller, more manageable parts? How can I simplify the problem so that it is easier to solve? | |
| What kinds of solution typically are produced for this kind of problem specification? Given the problem specification and the current best solution, have a guess about other possible solutions. Let's imagine the current best solution is totally wrong, what other ways are there to think about the problem specific | |
| What is the best way to modify this current best solution, given what you know about these kinds of problem specification? | |
| Am I on the right track? Let's check our progress so far. | |
| Let's make a step by step plan and implement it with good notion and explanation. | |
| Finally, write an improved response after thinking about how to accomplish the task. Take information and details from the original draft whenever they are useful. Therefore, the improved response should not be shorter than the original response. The improved response should have better formatting and readability, with more coherent and in-depth reasoning, while removing any noise or digression. Note that the best experts chosen to answer each prompt may be different, so please make sure the you do not sound like the same expert for all tasks. | |
| IMPORTANT: Start your analysis and thinking right away. DO NOT add any filler text, explanations or notes about your response. Put the thinking and planning between {'<'}thinking starts{'>'} and {'<'}thinking ends{'>'}, and the improved response between {'<'}improved response starts{'>'} and {'<'}improved response ends{'>'}. | |
| Original Draft: [TEXT] | |
| ``` | |
| ### Decay vs Scratch | |
| We explored two distinct training paradigms. In the **from-scratch** setup (`decay_exp=false`), models were trained for the full 10,000 steps (~21B tokens) on a single dataset or mixture of datasets. In contrast, the **decay** experiments (`decay_exp=true`) aimed to obtain quicker signal with fewer rephrased tokens by leveraging a two-stage training approach. These decay experiments resumed training from a checkpoint at step 9,000 of a model previously trained on lower-quality data (FineWeb-Edu-LQ), then continued training with a new dataset (or mixture) for the final 1,000 steps (~2B tokens) during the learning rate decay phase. We selected FineWeb-Edu-LQ for the first training phase so we can see effects of the ablated data mixtures more clearly. This design allowed us to evaluate the impact of high-quality rephrased or synthetic data more efficiently, requiring around 2B rephrased tokens rather than the full 21B needed for from-scratch training, thus reducing computational costs by 90% per experimental condition while still providing meaningful signal about data quality effects. To enable the decay experiments, we used a warmup-stable-decay (WSD) [@minicpm] learning rate schedule with 1% warmup (100 steps), 89% stable training, and 10% linear decay (1,000 steps) to a minimum of 5×10⁻⁵. | |
| #### Variance across seeds and data seeds | |
| The seed parameter sets the global random seed for the training experiment, ensuring reproducibility for model weight initialization and other global operations across different runs. The data-seed parameter specifically controls the randomness of the data pipeline, such as dataset shuffling and sampling, ensuring reproducible data ordering across different training runs. | |
| As a first validation of the decay experiment, we were interested in the variance across runs. So we ran a grid of 3x3 seeds (1,2,3) and data seeds (1,2,3) for 3 datasets, vanilla FineWeb-Edu-HQ, mix-fw_edu_hq-continue_1b_hq and mix-fw_edu_hq-tutorial_12b_hq. Overall we found the variance to be fairly small, giving us early confidence in the setup. | |
| Decaying with FineWeb-Edu-HQ the minimum macro averaged score is 10.73 and the maximum 11.05 across a grid of 3x3 seeds and data seeds. Decaying with mix-fw_edu_hq-continue_1b_hq ranges from 12.9 to 13.21 macro averaged score. Finally, decaying with mix-fw_edu_hq-tutorial_12b_hq ranges from 13.25 to 13.43. | |
| #### Correlation to runs from scratch | |
| From scratch the ranking is DCLM (13.77) > Nemotron-HQ-Synth (13.54) > FineWeb-Edu-HQ (11.82) > Cosmopedia (10.33) > SYNTH (10.03). For decay the ranking is Nemotron-HQ-Synth (12.35) > DCLM (11.80) > FineWeb-Edu-HQ (10.66) > Cosmopedia (10.57) > SYNTH (10.50). So while we see a meaningful difference between FineWeb-Edu-HQ and Cosmopedia/SYNTH from scratch, they are very close in the decay. Additionally, DCLM and Nemotron-HQ-Synth are flipped. This can serve as a fast vibe-check if the dataset is useful or not. | |