File size: 9,865 Bytes
b03f016
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
[
  {
    "id": "onboarding__note__Users_hungting_Desktop_Lab_Projects_notes_experiments_onboarding_EXPERIMENT_README_md",
    "experiment_id": "onboarding",
    "title": "EXPERIMENT_README.md",
    "filename": "EXPERIMENT_README.md",
    "relative_path": "/Users/hungting/Desktop/Lab/Projects/notes/experiments/onboarding/EXPERIMENT_README.md",
    "content_md": "# Welcome to RACA\n\nThis is a sample experiment to show you how the dashboard works. You're looking at the **Overview** tab right now \u2014 it displays the experiment's README (this file).\n\nEverything you see here is generated from plain files in `notes/experiments/onboarding/`. You can browse them in your editor anytime.\n\n## How This Dashboard Works\n\nEach experiment has several tabs at the top. Here's what they do:\n\n### Overview (you are here)\n\nDisplays the experiment's README and any notes you've written in the `user/` folder. This is the main landing page for each experiment \u2014 a summary of what the experiment is, what you're investigating, and what you found.\n\n### Red Team Brief\n\nBefore any experiment runs, RACA reviews the design for potential problems \u2014 wrong evaluation metrics, truncated outputs, missing baselines, wasted compute. The brief lives at `red_team_brief.md`. This tab will be empty until you run your first real experiment.\n\n### Timeline\n\nA chronological log of everything that happened: when jobs were submitted, when artifacts were uploaded, when bugs were found and fixed. This is auto-generated from `activity_log.jsonl` \u2014 RACA writes to it as events happen.\n\n### Runs\n\nTracks each job submission \u2014 which model, which cluster, what status (pending, running, completed, failed), and links to the HuggingFace dataset with the results. Empty until you run something.\n\n### Artifacts\n\nLinks to all HuggingFace datasets produced by this experiment \u2014 canary runs, partial results, final data. Each artifact has metadata about what generated it. Empty until artifacts are uploaded.\n\n### Files\n\nAll the markdown and YAML files in the experiment folder. Click any file to read it. This is a quick way to browse the experiment's configuration and notes without leaving the dashboard.\n\n## Folder Structure\n\n```\nnotes/experiments/onboarding/\n  EXPERIMENT_README.md    \u2190 this file (shows in Overview tab)\n  experiment.yaml         \u2190 config: hypothesis, models, tasks\n  flow_state.json         \u2190 current phase (design/running/complete)\n  HUGGINGFACE_REPOS.md    \u2190 links to all uploaded datasets\n  questions.md            \u2190 research questions (read-only)\n  red_team_brief.md       \u2190 created during preflight review\n  activity_log.jsonl      \u2190 timeline entries (auto-generated)\n  user/                   \u2190 YOUR notes \u2014 RACA doesn't touch these\n    README.md             \u2190 your interpretation and observations\n    FINDINGS.md           \u2190 key results and surprises\n    DECISIONS.md          \u2190 design decisions and rationale\n    summary.md            \u2190 one-paragraph summary when done\n```\n\n**Most of this is automated.** RACA creates and updates the experiment files, uploads artifacts, and keeps the timeline current. The only files you write are in `user/` \u2014 that's your space for notes, findings, and decisions.\n\n## What's Next\n\nThis sample experiment hasn't been run yet \u2014 it's just here to show you the structure. When you're ready to run a real experiment, just tell RACA:\n\n> *I want to test whether Qwen3-8B follows complex instructions better than Llama-3.1-8B*\n\nOr try the full guided tutorial:\n\n> */raca:experiment-tutorial*\n",
    "created": "",
    "updated": ""
  },
  {
    "id": "onboarding__note__Users_hungting_Desktop_Lab_Projects_notes_experiments_onboarding_HUGGINGFACE_REPOS_md",
    "experiment_id": "onboarding",
    "title": "HUGGINGFACE_REPOS.md",
    "filename": "HUGGINGFACE_REPOS.md",
    "relative_path": "/Users/hungting/Desktop/Lab/Projects/notes/experiments/onboarding/HUGGINGFACE_REPOS.md",
    "content_md": "# HuggingFace Repositories\n\n| Dataset | Date | Rows | Purpose |\n|---------|------|------|---------|\n| [onboarding-countdown-qwen3-1.7b \u2014 10 rows, 10/10 correct (2026-04-06)](https://huggingface.co/datasets/timchen0618/onboarding-countdown-qwen3-1.7b) | 2026-04-06 | 10 | Qwen3-1.7B on Countdown (3-4 operands, targets 1-99, seed=42) |\n",
    "created": "",
    "updated": ""
  },
  {
    "id": "onboarding__note__Users_hungting_Desktop_Lab_Projects_notes_experiments_onboarding_questions_md",
    "experiment_id": "onboarding",
    "title": "questions.md",
    "filename": "questions.md",
    "relative_path": "/Users/hungting/Desktop/Lab/Projects/notes/experiments/onboarding/questions.md",
    "content_md": "# Research Questions\n\n1. Can Qwen3-1.7B solve basic Countdown problems (4 numbers, targets < 100)?\n2. What reasoning strategies does the model use (trial-and-error, systematic search, pattern matching)?\n3. Where does the model fail \u2014 wrong arithmetic, giving up, or invalid expressions?\n",
    "created": "",
    "updated": ""
  },
  {
    "id": "onboarding__note__Users_hungting_Desktop_Lab_Projects_notes_experiments_onboarding_red_team_brief_md",
    "experiment_id": "onboarding",
    "title": "red_team_brief.md",
    "filename": "red_team_brief.md",
    "relative_path": "/Users/hungting/Desktop/Lab/Projects/notes/experiments/onboarding/red_team_brief.md",
    "content_md": "# Red Team Brief \u2014 onboarding\n\n**Experiment:** Qwen3-1.7B on Countdown (tutorial/canary)\n**Reviewer:** agent\n**Date:** 2026-04-06\n**Status:** PASS\n\n---\n\n## Hypothesis\n\n> Qwen3-1.7B can solve basic Countdown arithmetic problems (>50% accuracy on 3-4 operand problems)\n\nThis is an exploratory baseline \u2014 a reasonable first question. No prior published result exists for Qwen3-1.7B specifically on this exact config, so the hypothesis is testable and non-trivial.\n\n---\n\n## Config Review\n\n### Model\n- **Qwen/Qwen3-1.7B** \u2014 small instruct model, appropriate for a tutorial canary\n- Known to handle arithmetic reasoning; 1.7B is at the edge of competence for Countdown (expect 20-50% accuracy)\n\n### max_tokens: 4096\n- **Status: PASS (marginal)**\n- Reference minimum is 2048-4096. We are at the ceiling of \"minimum\". For Qwen3-1.7B reasoning traces, 4096 should be sufficient for 3-4 operand problems.\n- **Watch for truncation in outputs** \u2014 if `finish_reason == \"length\"` appears in any response, flag immediately.\n\n### Prompt Format\n- **Must use the CoT + in-context examples prompt from the reference** (not the TinyZero `<think>` variant, since we are not RL-training)\n- Template:\n  ```\n  Answer the following problem. Explain your reasoning step by step. When you are finished, give your answer in this format: <answer>(your answer)</answer>.\n  \n  # Problem\n  Using the numbers in the list [{numbers}], create an equation that equals {target}. ...\n  ```\n- Answer extraction: regex on `<answer>...</answer>` tags\n\n### Evaluation Method\n- **Must use `CountdownJudge.validate_countdown_solution()` \u2014 NOT string matching**\n- Located at: `packages/custom_evaluations/custom_evaluations/sources/countdown/countdown_judge.py`\n- Handles: AST-based evaluation, Unicode operator normalization, multiset number validation\n- **Failure mode if string match is used:** \"3 + 5\" and \"5 + 3\" would score differently. This is wrong.\n\n### Dataset Generation\n- 10 samples (canary size \u2014 fine)\n- Use forward generation (inline, no dependencies): 3-4 operands, targets 1-99\n- Do NOT require all numbers used \u2014 subset is sufficient (this is the easier, more standard setting)\n\n---\n\n## Failure Modes & Mitigations\n\n| Risk | Severity | Mitigation |\n|------|----------|------------|\n| Truncated outputs (`finish_reason=length`) | HIGH | Check every row \u2014 flag if any truncation occurs |\n| String matching instead of equation eval | HIGH | Use `CountdownJudge` \u2014 verified AST-based |\n| Model outputs `\u00d7` instead of `*` | MEDIUM | `CountdownJudge` handles Unicode normalization |\n| `<answer>` tag missing in output | MEDIUM | Log separately as \"format failures\" vs \"wrong answer\" |\n| Duplicate problems in 10 samples | LOW | Use `random.seed(42)` for reproducibility |\n\n---\n\n## Expected Results\n\nBased on typical results from the reference file:\n- Qwen3-1.7B is smaller than Qwen2.5-3B (which achieves 70-80% after GRPO training)\n- For a **pre-GRPO instruct model** at this size, expect roughly **20-50% accuracy**\n- If accuracy is <10%, suspect prompt format issue or evaluation bug \u2014 investigate before scaling\n\n---\n\n## Output Schema\n\nExpected columns in the HuggingFace dataset:\n\n| Column | Type | Description |\n|--------|------|-------------|\n| `prompt` | str | Full prompt sent to the model |\n| `model_response` | str | Full untruncated model output |\n| `model` | str | `Qwen/Qwen3-1.7B` |\n| `numbers` | list[int] | Available operands |\n| `target` | int | Target value |\n| `correct` | bool | Whether the answer evaluates correctly |\n| `extracted_answer` | str | Parsed content from `<answer>` tags (null if missing) |\n| `finish_reason` | str | `stop` or `length` \u2014 flag any `length` rows |\n\n---\n\n## Validation Criteria (for data-validator)\n\nA healthy artifact must satisfy ALL of the following:\n1. All rows have non-empty `model_response`\n2. Zero rows with `finish_reason == \"length\"` (no truncation)\n3. `correct` column is boolean, computed via equation evaluation (not string match)\n4. `extracted_answer` is null only when model didn't produce `<answer>` tags\n5. Row count matches n_samples (10)\n6. No degenerate outputs (empty strings, repeated tokens, only whitespace)\n",
    "created": "",
    "updated": ""
  }
]