[ { "id": "onboarding", "name": "Onboarding", "research_project": "", "hypothesis": { "statement": "Qwen3-1.7B can solve basic Countdown arithmetic problems", "type": "exploratory", "status": "active", "success_criteria": "Model produces valid arithmetic expressions that reach the target number on >50% of problems" }, "stage": "active", "completeness": 4, "models": [], "tasks": [], "tags": [ "countdown", "reasoning", "onboarding", "tutorial" ], "hf_repos": [ { "repo": "timchen0618/onboarding-countdown-qwen3-1.7b", "description": "onboarding-countdown-qwen3-1.7b \u2014 10 rows, 10/10 correct (2026-04-06)", "date": "" } ], "wandb_url": "", "notes": "# Welcome to RACA\n\nThis is a sample experiment to show you how the dashboard works. You're looking at the **Overview** tab right now \u2014 it displays the experiment's README (this file).\n\nEverything you see here is generated from plain files in `notes/experiments/onboarding/`. You can browse them in your editor anytime.\n\n## How This Dashboard Works\n\nEach experiment has several tabs at the top. Here's what they do:\n\n### Overview (you are here)\n\nDisplays the experiment's README and any notes you've written in the `user/` folder. This is the main landing page for each experiment \u2014 a summary of what the experiment is, what you're investigating, and what you found.\n\n### Red Team Brief\n\nBefore any experiment runs, RACA reviews the design for potential problems \u2014 wrong evaluation metrics, truncated outputs, missing baselines, wasted compute. The brief lives at `red_team_brief.md`. This tab will be empty until you run your first real experiment.\n\n### Timeline\n\nA chronological log of everything that happened: when jobs were submitted, when artifacts were uploaded, when bugs were found and fixed. This is auto-generated from `activity_log.jsonl` \u2014 RACA writes to it as events happen.\n\n### Runs\n\nTracks each job submission \u2014 which model, which cluster, what status (pending, running, completed, failed), and links to the HuggingFace dataset with the results. Empty until you run something.\n\n### Artifacts\n\nLinks to all HuggingFace datasets produced by this experiment \u2014 canary runs, partial results, final data. Each artifact has metadata about what generated it. Empty until artifacts are uploaded.\n\n### Files\n\nAll the markdown and YAML files in the experiment folder. Click any file to read it. This is a quick way to browse the experiment's configuration and notes without leaving the dashboard.\n\n## Folder Structure\n\n```\nnotes/experiments/onboarding/\n EXPERIMENT_README.md \u2190 this file (shows in Overview tab)\n experiment.yaml \u2190 config: hypothesis, models, tasks\n flow_state.json \u2190 current phase (design/running/complete)\n HUGGINGFACE_REPOS.md \u2190 links to all uploaded datasets\n questions.md \u2190 research questions (read-only)\n red_team_brief.md \u2190 created during preflight review\n activity_log.jsonl \u2190 timeline entries (auto-generated)\n user/ \u2190 YOUR notes \u2014 RACA doesn't touch these\n README.md \u2190 your interpretation and observations\n FINDINGS.md \u2190 key results and surprises\n DECISIONS.md \u2190 design decisions and rationale\n summary.md \u2190 one-paragraph summary when done\n```\n\n**Most of this is automated.** RACA creates and updates the experiment files, uploads artifacts, and keeps the timeline current. The only files you write are in `user/` \u2014 that's your space for notes, findings, and decisions.\n\n## What's Next\n\nThis sample experiment hasn't been run yet \u2014 it's just here to show you the structure. When you're ready to run a real experiment, just tell RACA:\n\n> *I want to test whether Qwen3-8B follows complex instructions better than Llama-3.1-8B*\n\nOr try the full guided tutorial:\n\n> */raca:experiment-tutorial*\n", "zayne_summary": "# Summary\n\n_Write a one-paragraph summary of the experiment and its outcome when you're done._\n\n## Status: active\n\n## Next Steps\n\n_What to do next based on findings._", "zayne_readme": "# Onboarding Experiment \u2014 Your Notes\n\n## What I'm investigating\n\nThis is the tutorial experiment \u2014 testing Qwen3-1.7B on Countdown to learn the RACA pipeline.\n\n## Key observations\n\n_Fill this in as you review the results._\n\n## Open questions\n\n_Anything you want to follow up on._", "zayne_findings": "# Welcome to Your Dashboard\n\nThis is a sample experiment to show you how the dashboard works. Everything you see here is generated from plain files in `notes/experiments/onboarding/`.\n\n## Dashboard Tabs\n\nEach experiment has tabs at the top:\n\n- **Overview** \u2014 the experiment's README and your notes (you're reading this now)\n- **Red Team Brief** \u2014 RACA reviews experiment designs for problems before running. Empty until your first real experiment.\n- **Timeline** \u2014 chronological log of everything that happened (auto-generated from `activity_log.jsonl`)\n- **Runs** \u2014 tracks each job submission: model, cluster, status, HuggingFace dataset links\n- **Artifacts** \u2014 links to all HuggingFace datasets produced by this experiment\n- **Files** \u2014 browse all experiment files without leaving the dashboard\n\n## What's Automated vs What You Write\n\nMost of this is automated. RACA creates and updates experiment files, uploads artifacts, and keeps the timeline current.\n\nThe `user/` folder is yours \u2014 RACA doesn't touch it:\n- `user/FINDINGS.md` \u2014 key results and surprises (this file)\n- `user/README.md` \u2014 your interpretation and observations\n- `user/DECISIONS.md` \u2014 design decisions and rationale\n- `user/summary.md` \u2014 one-paragraph summary when done\n\n## What's Next\n\nThis sample experiment hasn't been run yet \u2014 it's here to show you the structure. When you're ready:\n\n> *I want to test whether Qwen3-8B follows complex instructions better than Llama-3.1-8B*\n\nOr try the full guided tutorial: `/raca:experiment-tutorial`", "zayne_decisions": "# Decisions\n\n| Date | Decision | Rationale |\n|------|----------|-----------|", "red_team_brief": "# Red Team Brief \u2014 onboarding\n\n**Experiment:** Qwen3-1.7B on Countdown (tutorial/canary)\n**Reviewer:** agent\n**Date:** 2026-04-06\n**Status:** PASS\n\n---\n\n## Hypothesis\n\n> Qwen3-1.7B can solve basic Countdown arithmetic problems (>50% accuracy on 3-4 operand problems)\n\nThis is an exploratory baseline \u2014 a reasonable first question. No prior published result exists for Qwen3-1.7B specifically on this exact config, so the hypothesis is testable and non-trivial.\n\n---\n\n## Config Review\n\n### Model\n- **Qwen/Qwen3-1.7B** \u2014 small instruct model, appropriate for a tutorial canary\n- Known to handle arithmetic reasoning; 1.7B is at the edge of competence for Countdown (expect 20-50% accuracy)\n\n### max_tokens: 4096\n- **Status: PASS (marginal)**\n- Reference minimum is 2048-4096. We are at the ceiling of \"minimum\". For Qwen3-1.7B reasoning traces, 4096 should be sufficient for 3-4 operand problems.\n- **Watch for truncation in outputs** \u2014 if `finish_reason == \"length\"` appears in any response, flag immediately.\n\n### Prompt Format\n- **Must use the CoT + in-context examples prompt from the reference** (not the TinyZero `` variant, since we are not RL-training)\n- Template:\n ```\n Answer the following problem. Explain your reasoning step by step. When you are finished, give your answer in this format: (your answer).\n \n # Problem\n Using the numbers in the list [{numbers}], create an equation that equals {target}. ...\n ```\n- Answer extraction: regex on `...` tags\n\n### Evaluation Method\n- **Must use `CountdownJudge.validate_countdown_solution()` \u2014 NOT string matching**\n- Located at: `packages/custom_evaluations/custom_evaluations/sources/countdown/countdown_judge.py`\n- Handles: AST-based evaluation, Unicode operator normalization, multiset number validation\n- **Failure mode if string match is used:** \"3 + 5\" and \"5 + 3\" would score differently. This is wrong.\n\n### Dataset Generation\n- 10 samples (canary size \u2014 fine)\n- Use forward generation (inline, no dependencies): 3-4 operands, targets 1-99\n- Do NOT require all numbers used \u2014 subset is sufficient (this is the easier, more standard setting)\n\n---\n\n## Failure Modes & Mitigations\n\n| Risk | Severity | Mitigation |\n|------|----------|------------|\n| Truncated outputs (`finish_reason=length`) | HIGH | Check every row \u2014 flag if any truncation occurs |\n| String matching instead of equation eval | HIGH | Use `CountdownJudge` \u2014 verified AST-based |\n| Model outputs `\u00d7` instead of `*` | MEDIUM | `CountdownJudge` handles Unicode normalization |\n| `` tag missing in output | MEDIUM | Log separately as \"format failures\" vs \"wrong answer\" |\n| Duplicate problems in 10 samples | LOW | Use `random.seed(42)` for reproducibility |\n\n---\n\n## Expected Results\n\nBased on typical results from the reference file:\n- Qwen3-1.7B is smaller than Qwen2.5-3B (which achieves 70-80% after GRPO training)\n- For a **pre-GRPO instruct model** at this size, expect roughly **20-50% accuracy**\n- If accuracy is <10%, suspect prompt format issue or evaluation bug \u2014 investigate before scaling\n\n---\n\n## Output Schema\n\nExpected columns in the HuggingFace dataset:\n\n| Column | Type | Description |\n|--------|------|-------------|\n| `prompt` | str | Full prompt sent to the model |\n| `model_response` | str | Full untruncated model output |\n| `model` | str | `Qwen/Qwen3-1.7B` |\n| `numbers` | list[int] | Available operands |\n| `target` | int | Target value |\n| `correct` | bool | Whether the answer evaluates correctly |\n| `extracted_answer` | str | Parsed content from `` tags (null if missing) |\n| `finish_reason` | str | `stop` or `length` \u2014 flag any `length` rows |\n\n---\n\n## Validation Criteria (for data-validator)\n\nA healthy artifact must satisfy ALL of the following:\n1. All rows have non-empty `model_response`\n2. Zero rows with `finish_reason == \"length\"` (no truncation)\n3. `correct` column is boolean, computed via equation evaluation (not string match)\n4. `extracted_answer` is null only when model didn't produce `` tags\n5. Row count matches n_samples (10)\n6. No degenerate outputs (empty strings, repeated tokens, only whitespace)\n", "created": "", "updated": "" } ]