File size: 8,126 Bytes
5e71e74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: Tice
emoji: 🧬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
base_path: /web
---
### Check out the blog post at: https://huggingface.co/spaces/fierce74/Tice/blob/main/Blog.md

# TICE

Tumor Immune Control Environment (TICE) is an OpenEnv reinforcement learning environment where an LLM learns to coordinate immune cells against an evolving tumor under partial observability.


This makes TICE a compact but meaningful test of three things current LLMs still struggle with:

1. long-horizon planning
2. subsystem coordination
3. reasoning under hidden state

## The Problem

Most LLM environments are short-horizon, fully observable, and forgiving. TICE targets a harder capability gap:

- planning over 50-step episodes
- coordinating interdependent subsystems
- acting under hidden state and noisy signals
- managing the tradeoff between aggression and resource conservation

The biological theme is intentional, but the real target is general agent capability. The orchestrator never sees the tumor directly. It has to infer what is happening from indirect signals, then decide how to use a detection subsystem and an attack subsystem before the tumor adapts.

## The Environment

Every episode begins with a tumor sampled from TCGA-inspired distributions derived from real cancer cohorts. TICE currently exposes three archetypes:

- `immune_hot`: easier to detect, but still dangerous if you waste time
- `immune_cold`: hard to detect and harder to attack effectively
- `high_mutation`: adapts quickly and punishes rigid policies

Concretely, tumor parameters are sampled from archetype-specific distributions inspired by:

- `immune_hot`: Skin Cutaneous Melanoma (SKCM), \(n=440\)
- `immune_cold`: Glioblastoma Multiforme (GBM), \(n=397\)
- `high_mutation`: top 25% TMB cohort from SKCM, \(n=110\)

The sampler draws tumor mutational burden (TMB) and mutation count from lognormal distributions, and genomic instability from a normal distribution (clipped to \([0, 1]\)). Difficulty (`easy`, `medium`, `hard`) scales downstream dynamics, and additional quantities like visibility and suppression are derived from these draws with noise.

The agent controls two linked subsystems:

- **B cells** build and maintain tumor detection
- **T cells** attack the tumor

The action space is deliberately compact: 4 B-cell actions Γ— 4 T-cell actions = 16 joint decisions per step.

### What the agent sees

The agent does not get true tumor state. It only sees a partial observation:

- tumor trend
- noisy detection signal
- T-cell effectiveness bucket
- resource level
- B-cell fatigue
- T-cell fatigue
- recent outcome
- timestep and episode phase
- archetype and difficulty

This makes TICE a world-modeling problem, not just a control problem.

### What the agent does

At each timestep, the orchestrator chooses:

- a B-cell command: `INCREASE_HIGH`, `INCREASE_LOW`, `MAINTAIN`, or `REDUCE`
- a T-cell command: `ATTACK_HIGH`, `ATTACK_MEDIUM`, `ATTACK_LOW`, or `REST`

These actions matter only through the system dynamics:

- detection boosts downstream attack quality
- fatigue reduces effectiveness
- exhaustion downgrades T-cell aggression
- the tumor can mutate into stealth, resistance, faster growth, or stronger suppression

### What the agent is rewarded for

The reward function pushes toward coordinated, efficient immune control:

- positive reward for tumor reduction
- large bonus for eradication
- large penalty for tumor escape
- penalties for B-cell fatigue, T-cell fatigue, tissue damage, wasted energy, and time

That means naive policies fail for understandable reasons:

- pure aggression burns out T cells
- pure passivity lets the tumor escape
- building detection without converting it into attacks wastes resources

## What changed after training?

We trained a compact text-only model ('Qwen/Qwen2.5-1.5B-Instruct') in three stages:

1. **Base model**: no task-specific adaptation
2. **SFT**: supervised fine-tuning on planner-generated demonstrations (trained on 6,381 total examples; 5,742 training, 639 validation)
3. **GRPO**: reward-driven refinement using TICE reward tables

![Training curves](training.png)

SFT loss drops quickly and stabilizes, while GRPO reward is noisier step to step but reflects the same shift toward better environment-specific behavior.

![TICE results](tice_results.png)

### Evaluation summary

Across 27 held-out episodes:

| Policy | Avg return | Win rate | Loss rate | Timeout rate | Avg final tumor |
|---|---:|---:|---:|---:|---:|
| Planner teacher | -9.37 | 29.6% | 0.0% | 70.4% | 0.612 |
| GRPO model | -31.52 | 25.9% | 55.6% | 18.5% | 0.676 |
| SFT model | -36.40 | 25.9% | 48.1% | 25.9% | 0.630 |
| Random | -64.96 | 0.0% | 77.8% | 22.2% | 0.927 |
| Base model | -73.54 | 3.7% | 63.0% | 33.3% | 0.852 |

### The key improvement story

- **SFT vs base**
  - return improved by `+37.13`
  - win rate improved by `+22.2 percentage points`
  - average final tumor size dropped by `0.222`

- **GRPO vs base**
  - return improved by `+42.02`
  - win rate improved by `+22.2 percentage points`
  - timeout rate dropped by `14.8 percentage points`
  - average final tumor size dropped by `0.176`

- **GRPO vs SFT**
  - return improved by `+4.88`
  - timeout rate dropped by `7.4 percentage points`

The strongest takeaway is not that the learned policy beats the planner teacher. It does not. The takeaway is that a compact text policy moved from near-failure (`3.7%` wins) to meaningful competence (`25.9%` wins) in a partially observable, multi-step control setting with coherent reward shaping.

## Why this matters

TICE is useful for anyone interested in training agents that must:

- plan across long horizons
- coordinate multiple subsystems instead of picking one-shot answers
- infer hidden state from noisy observations
- adapt when the world changes underneath them

That makes it relevant beyond the biological theme. The same structure shows up in operations, robotics, scientific workflows, and decision support systems where the agent cannot directly observe the full system state and still has to act well.

## Why this environment is interesting

From a benchmark design perspective, TICE combines four things that are rarely present together:

- a compact action space that compact policies can learn
- meaningful hidden state
- nontrivial reward shaping
- real-data-inspired episode diversity


## Try it

Run the environment locally:

```bash
cd tice
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

Run the LLM-driven inference client (make sure to set the ENV variables):

```bash
python inference.py
```

Training and evaluation notebooks:

- `tice_sft_grpo_training_final.ipynb`



## Where the core logic lives

If you want to understand the β€œtheory” of the environment from code, these are the main files:

- `data/tcga_params.py` + `data/sampler.py`: TCGA-inspired archetype distributions and episode sampling
- `core/tumor.py`: tumor growth, resistance, mutation, and suppression dynamics
- `core/b_cell.py` + `core/t_cell.py`: detection and attack subsystem dynamics (including fatigue/energy)
- `core/reward.py`: reward shaping (tumor reduction, eradication/escape terms, and resource/damage costs)
- `server/tice_environment.py`: ties dynamics, observations, actions, and termination into the OpenEnv environment

## Repo structure

```text
tice/
β”œβ”€β”€ core/                   # tumor, B-cell, T-cell, reward logic
β”œβ”€β”€ data/                   # TCGA-inspired parameter sampling
β”œβ”€β”€ server/                 # OpenEnv server implementation (FastAPI)
β”œβ”€β”€ tice_benchmark_outputs/ # evaluation CSVs / harness outputs
β”œβ”€β”€ client.py               # OpenEnv client (Python)
β”œβ”€β”€ inference.py            # LLM-driven agent loop
β”œβ”€β”€ models.py               # action/observation schema (Pydantic)
β”œβ”€β”€ Dockerfile              # container build for server deployment
β”œβ”€β”€ openenv.yaml            # OpenEnv/HF metadata
β”œβ”€β”€ pyproject.toml          # dependencies (uv/PEP 621)
└── tice_sft_grpo_training_final.ipynb
```