File size: 15,665 Bytes

2a55985

# Capability 数据基座 — 复用规则 + 自产规则

> **任务来源**:Task #241 (A0)。本文件定义 Capability 框架"哪些数据可以
> 跨 capability 共享 / 谁读谁写 / 怎么自产新基座"。
>
> 配套实施任务:
>
> - **B2** — 已落地外部知识 adapter(`internal-experience` 真读自产 DF)
> - **B7** — 自产 DF distill / reuse / 表 schema(待实施)

---

## 第 1 章 复用规则(精确到表名 + join key + 读/写权限边界)

> **判读原则**:每张表只能有一个**真写入方**(其他 capability 只读)。
> 写权限分散等于失去 source-of-truth,审计时无法回放。

### 1.1 因果网三件套 — `local_causal_nets` / `causal_edges` / `literature_buckets` / `causal_buckets`

| 表 | 真写入方 | 读权限 | join key | 备注 |
|---|---|---|---|---|
| `local_causal_nets` | research-engine M3 (`pipelines/runners.py:M3`) | 任何 capability | `paper_id` | per-paper 因果图,M3 fulltext_upgrade 落地 |
| `causal_edges` | research-engine M3 + M5 | 任何 capability | `(source_node_id, target_node_id, paper_id)` | 边级原始 |
| `literature_buckets` | research-engine M5 (`networks/causal.py:215-217`) | 任何 capability | `bucket_key k(ε)` | bucket 聚合 |
| `causal_buckets` | research-engine M5 | 任何 capability | `bucket_key` | 同上,因果系数 + CI |

- **写入闸门**:M3/M5 跑完一份 paper 才落库,**不允许 capability 直接 INSERT**。
  写入路径冻结在 `docs/architecture/causal-network.md`。
- **对照例(target_discovery)**:走 `causal_buckets` LEFT JOIN
  `disease_target_mapping` 得到"该疾病有多少 paper 支持该靶点"。

### 1.2 药物网四件套 — `drug_nodes` / `target_nodes` / `outcome_nodes` / `*_edges`

| 表 | 真写入方 | 读权限 | join key | 备注 |
|---|---|---|---|---|
| `drug_nodes` | `PostgresDrugNetwork.bootstrap` (`drug_store.py:29-595`) | 任何 capability | `drug_id` | 2336 drugs(实测) |
| `target_nodes` | 同上 | 同上 | `target_id` | |
| `outcome_nodes` | 同上 | 同上 | `outcome_id` | |
| 三类边 (`drug_target_edges` / `drug_outcome_edges` / `trial_edges`) | 同上 | 同上 | 复合 key | 6069 trials(实测) |

- **写入闸门**:`drug_trial_mapping.csv` 是唯一来源,bootstrap 只跑一次/启动。
- **对照例(ligand_screening)**:走 `target_nodes` ∩ `drug_target_edges` 找
  已知 active ligand 作为 query expansion;**不允许往 `drug_nodes` 写
  syntheticActive**(这正是 CONT-001 已封印的欺骗路径)。

### 1.3 DuckDB 文献索引 + 文献 snapshot

| 资源 | 真写入方 | 读权限 | 备注 |
|---|---|---|---|
| `papers.duckdb`(snapshot) | Dataset Download workflow(本地手工/CI) | 任何 capability(只读 ATTACH) | mtime 变 → BM25 index 重建 |
| `fts_main_papers` index | `tool_handlers.py:_fts_get_or_build` lazy 建 | 同上 | in-memory,per-process |
| `pmid_paper_lookup` | research-engine 写,容器启动初始化 | 任何 capability | PMID → paper_id 反查 |

- **写入闸门**:snapshot 文件由独立的 dataset workflow 生成,read-only ATTACH;
  capability 跑 inner loop 时**不允许写** `papers.duckdb`。
- 真读路径 `external-knowledge/literature.adapter.ts`(三路 fallback,
  origin=`'duckdb:bm25'` / `'duckdb:cooccur'` / `'epmc'`)。

### 1.4 `submission_feedback_ledger` — 主办方真值回灌(已闭环)

| 表 | 真写入方 | 读权限 | join key | 备注 |
|---|---|---|---|---|
| `submission_feedback_ledger` | `routes/competition-feedback-import.ts`(操作员上传/HMAC bridge 同步) | reviewer.fitness | `(submission_id, metric_name)` | metric: ef1 / ef5 / auc;value 已归一到 [0,1] |

- **闭环路径**(diagnosis 1.2 第 6 通道):reviewer `external_truth` 通道
  对每行 metric 做 LEFT JOIN,有则升级到 `with-truth` 权重套,无则 fallback
  `truthIsFallback=true`。
- **对照例(admet_prediction)**:实验测得的 PK 数据回灌走同一表,
  `metric_name` 用 `clearance` / `t_half`,reviewer 同样 LEFT JOIN。

### 1.5 `user_memory_facts` — 跨 session 用户经验

| 表 | 真写入方 | 读权限 | join key | 备注 |
|---|---|---|---|---|
| `user_memory_facts` | `routes/messages.ts` (chat 端 distill) | 任何 capability(限当前 user) | `(user_id, fact_kind)` | 跨 session 长期记忆 |

- **写入闸门**:只在 chat 流程显式 `remember_fact` 调用时落库;**capability
  不直接写**(避免不同 capability 写入风格冲突)。
- 读权限:每个 capability 在 `observe` 步可以拿当前 user 的 facts 做 prompt
  增强,但**不允许跨 user 读**。

### 1.6 复用矩阵速查

| 表 / 资源 | ligand_screening | target_discovery | admet_prediction | meta_planner |
|---|---|---|---|---|
| `causal_*` | 读(target→outcome 路径) | 读(disease→target 证据) | 读(drug→outcome 副作用) | 读(理由摘要) |
| `drug_*` | 读 + 用 query expansion | 读(已知干预) | 读(已知 PK) | 读 |
| `papers.duckdb` | 读(target literature) | 读(disease mechanism) | 读(model literature) | 读 |
| `submission_feedback_ledger` | 读(EF1% 回灌) | 读(target hit rate 回灌) | 读(PK 测试值回灌) | 读(整体 thumb-up 回灌) |
| `user_memory_facts` | 读(用户偏好) | 同 | 同 | 读 + 路由偏好 |
| `capability_data_foundation`(B7) | 读 + 写 | 读 + 写 | 读 + 写 | 读 |

---

## 第 2 章 自产规则(B7 实施依据)

> 第 1 章是"已有的复用基座",本章是"capability 跑完一轮后能产出什么、
> 怎么落库、谁能读"。

### 2.1 命名空间约定

- `<capability_id>_<entity_kind>` — 每条 distill row 必须能反查 capability
  source。
- 例:
  - `cap_ligand_screening_drugclip / param-fingerprint`
  - `cap_target_discovery_omics / disease-target-prior`
  - `cap_admet_prediction_qsar / model-error-pattern`

### 2.2 表 schema 草案 — `capability_data_foundation`

```sql
CREATE TABLE capability_data_foundation (
  id              VARCHAR PRIMARY KEY DEFAULT gen_random_uuid(),
  capability_id   VARCHAR NOT NULL,                    -- 写入方,FK to capabilities.id
  kind            VARCHAR NOT NULL,                    -- 命名空间内 entity_kind
  payload         JSONB   NOT NULL,                    -- 实际内容,schema 由 kind 决定(见第 3 章)
  payload_schema_version  INTEGER NOT NULL DEFAULT 1, -- payload schema 版本号,B7 演进时 +1
  produced_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  source_run_id   VARCHAR NOT NULL,                    -- FK to capability_run_audit.run_id
  source_version  VARCHAR NOT NULL,                    -- 产出时的 active_version_id,审计可回放
  retention_days  INTEGER NOT NULL DEFAULT 90,         -- TTL,过期由 cleanup job 删
  is_failure_distill BOOLEAN NOT NULL DEFAULT false,   -- true 表示从失败路径 distill
  CONSTRAINT cap_df_kind_payload_check CHECK (jsonb_typeof(payload) = 'object'),
  CONSTRAINT cap_df_capability_kind_idx UNIQUE (capability_id, kind, source_run_id)
);

CREATE INDEX cap_df_lookup ON capability_data_foundation (capability_id, kind, produced_at DESC);
CREATE INDEX cap_df_source_run ON capability_data_foundation (source_run_id);
```

**关键约束**:
- `payload_schema_version` 必填 — 每个 `kind` 的 payload 演进时整体 +1,
  B2 `internal-experience.adapter.ts` 读时按 version 适配。
- `source_run_id` + `source_version` 必填 — 任何 distill 都能回放到产出
  时的 capability 配置。
- `is_failure_distill` — 失败路径也能 distill(见第 2.4 节)。

### 2.3 Distill 时机(成功路径)

每次 cap.run 完成后(无论 promote 与否),走以下决策:

```
if (run.reviewerScore > capability.distillThreshold):
    write_distill(kind='param-fingerprint', payload=run.config_summary)
if (run.has_novel_evidence):  # external_truth 通道有新真值回灌
    write_distill(kind='evidence-bucket-prior', payload=run.evidence_summary)
```

- **distillThreshold** 默认 0.7(每 capability 在 evolution_config 可覆盖)。
- distill 是 **fire-and-forget**,失败不阻塞 cap.run。

### 2.4 Distill 时机(失败路径)⭐

失败的 cap.run **同样有学习价值** — 失败模式是后续避坑的元知识:

```
if (run.judge.fallback_rate > 0.5):
    write_distill(kind='failure-pattern',
                  payload={ failed_channels, root_cause_diagnosis, retry_advice },
                  is_failure_distill=true)
if (run.quarantine_hits.length > 0):
    write_distill(kind='quarantine-trigger',
                  payload={ contamination_ids, run_input_fingerprint },
                  is_failure_distill=true)
```

后续同伴 capability 通过 `internal-experience` adapter 检索时,**默认不
返失败 distill**;只有在 `kind` 显式包含 `failure-pattern` 或 `evidence
.bias.confidence > 0.7` 时才返。

### 2.5 同伴 capability 通过 `internal-experience` adapter 读取

`external-knowledge/internal-experience.adapter.ts` 已在 B2 真接:

```ts
queryExternalKnowledge({
  kind: 'internal_experience',
  capabilityId: 'cap_target_discovery_omics_v1',
  query: { kind: 'disease-target-prior', filters: { disease: 'Alzheimer' } },
})
// → KnowledgeResult[]
// 内部走:SELECT payload FROM capability_data_foundation
//         WHERE kind = $1 AND payload @> $2
//         AND produced_at > now() - interval '90 days'
//         ORDER BY produced_at DESC LIMIT 20
```

**关键约束**:
- 读其他 capability 的自产 DF **不触发 budget guard**(因为是自家系统,
  不是外部 API);但**写自产 DF 的 distill 调用受 distill_budget 限制**
  (默认每小时每 capability 100 条)。
- `excludedSelf` 默认 true:同一 capability 不读自己的 distill(避免回声
  室)— `excludedSelf=false` 仅在 `meta_planner` 复用历史路由决策时启用。

---

## 第 3 章 示例 — 2 个 capability 的具体自产 schema

### 3.1 ligand_screening:`kind='param-fingerprint'`

```ts
interface ParamFingerprintPayload {
  pipeline_version: string;          // 产出时的 pipeline_skeleton 版本
  config: {
    shingleK: number;
    topK: number;
    fingerprintBackend: 'jaccard' | 'rdkit_morgan';  // B8 后才有 rdkit_morgan
    rerank: boolean;
  };
  outcome_metrics: {
    ef1: number | null;              // null 表示未计算(无 actives 标注)
    ef5: number | null;
    auc_roc: number | null;
    reviewer_score: number;          // 0..1
  };
  novel_actives: Array<{             // 跑出来的 top-K ligand 中,以前没遇到过的
    smiles: string;
    rank: number;
    score: number;
  }>;
}
```

- **复用场景**:同一 capability 下次 cold_start 时 `cap.observe` 步从这里
  拉历史最优 config 当 prior;`meta_planner` fan-out 时按
  `outcome_metrics.ef1 desc` 选最近 7 天表现最好的 capability 实例。

### 3.2 ligand_screening:`kind='failure-pattern'`(失败 distill)

```ts
interface FailurePatternPayload {
  failed_channels: string[];         // 例: ['intent_coverage', 'traceability']
  root_cause_diagnosis: string;      // 由 diagnoser 角色 LLM 产出(B6)
  retry_advice: {
    config_changes: Record<string, unknown>;  // 建议改的 config
    avoid_inputs: string[];                    // 例: ["candidate_count > 10000"]
  };
  contamination_hits: string[];      // CONT-XXX 列表
}
```

### 3.3 target_discovery:`kind='disease-target-prior'`

```ts
interface DiseaseTargetPriorPayload {
  disease_term: string;              // 例: 'Alzheimer disease'
  disease_mesh_id: string | null;    // 标准化标识
  candidate_targets: Array<{
    target_id: string;
    target_symbol: string;
    evidence_score: number;          // 0..1,综合 causal + literature
    evidence_sources: {
      causal_bucket_count: number;
      paper_count: number;
      drug_intervention_count: number;
    };
    novelty_flag: boolean;           // 是否为新候选(不在已知 disease-target 库里)
  }>;
  generated_at: string;              // ISO 时间戳
}
```

- **复用场景**:`cap_ligand_screening_drugclip` 在 plan 步可读这个,直接
  用 `candidate_targets[0].target_id` 当 query expansion;
  `cap_admet_prediction` 在 observe 步读这个,过滤掉 novelty_flag=true 的
  靶点(实验数据少)。

### 3.4 target_discovery:`kind='evidence-bucket-prior'`

```ts
interface EvidenceBucketPriorPayload {
  bucket_key: string;                // 来自 causal_buckets,k(ε)
  source_node: { id: string, label: string };
  target_node: { id: string, label: string };
  pooled_estimate: number;           // 因果效应 pooled
  ci: { lower: number, upper: number };
  paper_evidence_count: number;
  validation_status: 'unvalidated' | 'validated_v2' | 'rejected_v2';
}
```

### 3.5 冷启动避坑 — 第一次跑时 DF 是空的,如何避免循环依赖

**问题**:Capability 第一次进入 `COLD_START`,`capability_data_foundation`
里没有任何 row;`internal-experience` adapter 返空数组,inner loop 的
`observe` 步拿不到 prior。

**解决方案**(B7 必须实施):

1. **空查询返显式 marker**:`internal-experience.adapter.ts` 在 0 行时
   返 `{ origin: 'internal:empty', results: [] }`,**不走 cache 路径**,
   inner loop 看到 `origin='internal:empty'` 就走"无 prior"分支。
2. **跨 capability seeding**:新 capability 注册时,操作员可以手工指定
   `seed_from_capabilities: string[]` — cold_start 阶段允许从兄弟 capability
   的 `param-fingerprint` 拉 N=10 条作为冷启 prior(只读,不写自家 DF)。
   这条路径需要在 `capabilities.config.cold_start_seeds` 显式开启,默认关。
3. **第一轮 100% 走外部 knowledge**:cold_start 阶段 router 强制把
   `internal_experience` adapter 的预算设为 0,所有候选生成都依赖
   literature / causal / drug 三个真后端 + 操作员指定的 seed 数据。
4. **graduate 后再开自产读**:`COLD_START → GRADUATED` 转换条件之一是
   "至少产出 30 条 `param-fingerprint`"(不强求全部高分,只要有数据)。
   GRADUATED 后 `internal_experience` 预算回到默认,自产 DF 进入复用。

---

## 第 4 章 表 schema 落地清单(B7 工作项)

下表是 B7 真实施时需要 db:push 的 schema 改动清单:

| 表 / 字段 | 类型 | 备注 |
|---|---|---|
| `capability_data_foundation` (整表) | 见 2.2 | 新建,带 4 个 index |
| `capabilities.config.distillThreshold` | jsonb 字段内 number | 默认 0.7 |
| `capabilities.config.distill_budget_per_hour` | jsonb 字段内 number | 默认 100 |
| `capabilities.config.cold_start_seeds` | jsonb 字段内 string[] | 默认 `[]` |
| `capability_run_audit.distilled_count` | integer | 本次 run distill 出的条数 |
| cleanup cron `cap_df_ttl_cleanup` | sql job | 删 `produced_at < now() - retention_days` 的 row |

**注意 ID 安全**:`capability_data_foundation.id` 用 `varchar` + UUID 默认值
(对照本 repo 既有 `varchar` 主键风格,见 `lib/quarantine/index.ts` 的
`newId('qhit')` 风格)。**不要用 serial**(参见 important_database_safety_rules)。

---

## 第 5 章 一句话结论

1. **5 类已有数据基座**(因果网 / 药物网 / DuckDB 文献 /
   submission_feedback_ledger / user_memory_facts)分别有**唯一真写入方**,
   capability 只读不写。
2. **自产新基座**走唯一表 `capability_data_foundation`,带 capability_id +
   kind + payload jsonb + retention_days。
3. **失败路径也 distill**(`is_failure_distill=true`),失败模式是元知识。
4. **冷启动必须有显式空 marker**,`internal:empty` origin 让 inner loop
   走"无 prior"分支,避免兜底假装有数据。
5. **跨 capability seeding 默认关**;开了之后 cold_start 阶段允许从兄弟
   capability 的 param-fingerprint 拉 N=10 条。