lsnu commited on
Commit
df456b3
·
verified ·
1 Parent(s): 8d1e257

Clarify y_ready caveat and oven task structure

Browse files

State that y_ready should not be treated as a decisive metric for the current oven benchmark and that the oven phase handoff is highly structured, so this task is mainly a smoke test / base-finetune comparison rather than strong evidence of general reveal-and-retrieve reasoning.

Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -77,7 +77,9 @@ The main new fix in `iter24` is the assisted-door contact scoring inside `p_pre`
77
 
78
  The current repo state should therefore be treated as the repaired benchmark snapshot with geometry-aware door assistance, not the final metric design.
79
 
80
- Brief caveat: the current `y_ready` label still gates on low oven-door angular speed after extraction feasibility persists. In this task, the retriever arm can legitimately nudge the door while already committing to retrieval, so `y_ready` can still switch later than the true reveal-to-retrieve boundary. Treat that as a known label-design limitation in the current artifacts.
 
 
81
 
82
  ## What Is In This Upload
83
 
 
77
 
78
  The current repo state should therefore be treated as the repaired benchmark snapshot with geometry-aware door assistance, not the final metric design.
79
 
80
+ Brief caveat: the current `y_ready` label still gates on low oven-door angular speed after extraction feasibility persists. In this task, the retriever arm can legitimately nudge the door while already committing to retrieval, so `y_ready` can still switch later than the true reveal-to-retrieve boundary. For the current oven benchmark, `y_ready` should therefore not be treated as a decisive validation metric or a trusted phase-switch target.
81
+
82
+ The oven task also has a highly structured reveal-to-retrieve handoff in the expert demos: both arms reposition, the revealer opens and clears the door, then the retriever commits. Because that phase pattern is so standardized, good results on this task are most useful as a task-specific smoke test or a "does the adaptor beat a base finetune here?" check, not as strong evidence of general reveal-and-retrieve reasoning.
83
 
84
  ## What Is In This Upload
85