DecentSanage commited on
Commit
0e2d163
·
verified ·
1 Parent(s): c2629fe

Upload folder using huggingface_hub

Browse files
Dockerfile CHANGED
@@ -65,12 +65,17 @@ COPY --from=builder /app/env/.venv /app/.venv
65
  # Copy the environment code
66
  COPY --from=builder /app/env /app/env
67
 
 
 
 
68
  # Set PATH to use the virtual environment
69
  ENV PATH="/app/.venv/bin:$PATH"
70
 
71
  # Set PYTHONPATH so imports work correctly
72
  ENV PYTHONPATH="/app/env:$PYTHONPATH"
73
 
 
 
74
  ENV ENABLE_WEB_INTERFACE='true'
75
 
76
  # HF Spaces uses port 7860 by default; override with PORT env var for local use
 
65
  # Copy the environment code
66
  COPY --from=builder /app/env /app/env
67
 
68
+ # Copy README to /app/README.md where the OpenEnv web UI looks for it
69
+ COPY --from=builder /app/env/README.md /app/README.md
70
+
71
  # Set PATH to use the virtual environment
72
  ENV PATH="/app/.venv/bin:$PATH"
73
 
74
  # Set PYTHONPATH so imports work correctly
75
  ENV PYTHONPATH="/app/env:$PYTHONPATH"
76
 
77
+ # Tell the web UI where to find the README (belt-and-suspenders)
78
+ ENV ENV_README_PATH="/app/README.md"
79
  ENV ENABLE_WEB_INTERFACE='true'
80
 
81
  # HF Spaces uses port 7860 by default; override with PORT env var for local use
README.md CHANGED
@@ -12,7 +12,6 @@ base_path: /web
12
 
13
  # Constraint Environment
14
 
15
-
16
  This is the environment for training LLMs to learn a specific DSL made for time table scheduling. Model can then directly output constraints from natural language. Why this is needed? Usually time table generation is an NP hard problem, for humans it could take weeks to generate a conflict free tiem table. To solve this problem, tools are created to generate them in reasonable time. One example of those tools is CP SAT. Users can write the hardcoded constraints and the solver will generate a time table based on those constraints. Well, what happens when you want to add new constraints? Yes, you have directly change the code. What if there is a way to directly define constraints in natural language and the solver understands that automatically ? That's what we have tried to do with this project. LLM might not be good at scheduling time tables which have dozens of constraints but what it is good at is understanding natural language. For the specific purpose of defining constraints for university time tables a DSL was created whose specification is as follows:
17
 
18
  ```
@@ -67,6 +66,41 @@ number ::= digit { digit }
67
 
68
  The model outputs a json which follows the above format which can directly be converted into CP-SAT constraints.
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Development & Testing
71
 
72
  ### Direct Environment Testing
 
12
 
13
  # Constraint Environment
14
 
 
15
  This is the environment for training LLMs to learn a specific DSL made for time table scheduling. Model can then directly output constraints from natural language. Why this is needed? Usually time table generation is an NP hard problem, for humans it could take weeks to generate a conflict free tiem table. To solve this problem, tools are created to generate them in reasonable time. One example of those tools is CP SAT. Users can write the hardcoded constraints and the solver will generate a time table based on those constraints. Well, what happens when you want to add new constraints? Yes, you have directly change the code. What if there is a way to directly define constraints in natural language and the solver understands that automatically ? That's what we have tried to do with this project. LLM might not be good at scheduling time tables which have dozens of constraints but what it is good at is understanding natural language. For the specific purpose of defining constraints for university time tables a DSL was created whose specification is as follows:
16
 
17
  ```
 
66
 
67
  The model outputs a json which follows the above format which can directly be converted into CP-SAT constraints.
68
 
69
+ ## Action and Observation
70
+
71
+ The dataset for the training is in this format:
72
+
73
+ ```python
74
+ {
75
+ "prompt": (
76
+ "No classes should be scheduled on Saturday."
77
+ ),
78
+ "target_ast": {
79
+ "type": "hard",
80
+ "name": "no_saturday_classes",
81
+ "forall": [
82
+ {"var": "b", "domain": "branches"},
83
+ {"var": "sub", "domain": "subjects"},
84
+ {"var": "d", "domain": "days"},
85
+ {"var": "s", "domain": "slots"},
86
+ ],
87
+ "where": "d == 5",
88
+ "assert": "schedule(b, sub, d, s) == 0",
89
+ },
90
+ }
91
+ ```
92
+
93
+
94
+ The model will get the prompt from the environment and will guess the target_ast. It will get the rewards based on how good the guess is.
95
+
96
+
97
+ | Output | Reward |
98
+ | ------------------- | -------- |
99
+ | Valid Json | 0.125 |
100
+ | Correct Structure | 0.250 |
101
+ | Same as the target | 0.625 |
102
+
103
+
104
  ## Development & Testing
105
 
106
  ### Direct Environment Testing
inference.py CHANGED
@@ -210,7 +210,8 @@ def _run_task(task_id: str, env_url: str = "http://localhost:8000") -> None:
210
  step_result = await env.step(action)
211
 
212
  step_count = 1
213
- reward = float(step_result.reward or 0.0)
 
214
  done = bool(step_result.done)
215
  last_error = step_result.observation.info.get("error") if step_result.observation else None
216
  rewards.append(reward)
 
210
  step_result = await env.step(action)
211
 
212
  step_count = 1
213
+ raw_reward = float(step_result.reward or 0.0)
214
+ reward = max(0.01, min(0.99, raw_reward)) # clamp [0.01, 0.99]
215
  done = bool(step_result.done)
216
  last_error = step_result.observation.info.get("error") if step_result.observation else None
217
  rewards.append(reward)
openenv.yaml CHANGED
@@ -13,12 +13,15 @@ tasks:
13
  - id: easy
14
  description: Single quantifier, direct assert, no WHERE clause
15
  difficulty: easy
 
16
  - id: medium
17
  description: Two quantifiers with a WHERE filter clause and combined assert
18
  difficulty: medium
 
19
  - id: hard
20
  description: Multiple quantifiers, nested WHERE with AND/OR, minimize objective
21
  difficulty: hard
 
22
  tags:
23
  - openenv
24
  - scheduling
 
13
  - id: easy
14
  description: Single quantifier, direct assert, no WHERE clause
15
  difficulty: easy
16
+ grader: constraint_env.server.constraint_env_environment:ConstraintEnvironment
17
  - id: medium
18
  description: Two quantifiers with a WHERE filter clause and combined assert
19
  difficulty: medium
20
+ grader: constraint_env.server.constraint_env_environment:ConstraintEnvironment
21
  - id: hard
22
  description: Multiple quantifiers, nested WHERE with AND/OR, minimize objective
23
  difficulty: hard
24
+ grader: constraint_env.server.constraint_env_environment:ConstraintEnvironment
25
  tags:
26
  - openenv
27
  - scheduling
server/constraint_env_environment.py CHANGED
@@ -59,6 +59,11 @@ _PENALTY_BAD_STRUCTURE = -0.250
59
  _PENALTY_INVALID_JSON = -0.250
60
 
61
 
 
 
 
 
 
62
  # ---------------------------------------------------------------------------
63
  # Environment
64
  # ---------------------------------------------------------------------------
@@ -135,7 +140,7 @@ class ConstraintEnvironment(Environment):
135
  return ConstraintObservation(
136
  prompt=self._current_sample["prompt"],
137
  done=True,
138
- reward=round(reward, 4),
139
  info={**info, "error": "invalid_json"},
140
  )
141
 
@@ -157,7 +162,7 @@ class ConstraintEnvironment(Environment):
157
  return ConstraintObservation(
158
  prompt=self._current_sample["prompt"],
159
  done=True,
160
- reward=round(reward, 4),
161
  info=info,
162
  )
163
 
 
59
  _PENALTY_INVALID_JSON = -0.250
60
 
61
 
62
+ def _clamp_reward(raw: float) -> float:
63
+ """Clamp reward to [0.01, 0.99] — autograder rejects exact 0.00 / 1.00."""
64
+ return round(max(0.01, min(0.99, float(raw))), 4)
65
+
66
+
67
  # ---------------------------------------------------------------------------
68
  # Environment
69
  # ---------------------------------------------------------------------------
 
140
  return ConstraintObservation(
141
  prompt=self._current_sample["prompt"],
142
  done=True,
143
+ reward=_clamp_reward(reward),
144
  info={**info, "error": "invalid_json"},
145
  )
146
 
 
162
  return ConstraintObservation(
163
  prompt=self._current_sample["prompt"],
164
  done=True,
165
+ reward=_clamp_reward(reward),
166
  info=info,
167
  )
168