TrueV1sion123 commited on
Commit
6970bcf
Β·
verified Β·
1 Parent(s): 433042d

Upload src/dataset_generator.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. src/dataset_generator.py +494 -0
src/dataset_generator.py ADDED
@@ -0,0 +1,494 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RAE Dataset Generator
3
+ ═══════════════════════════════════════════════════════════════
4
+ Generates training data structured as RAE cognitive cycles.
5
+
6
+ The core innovation: instead of flat Q→A pairs, each training
7
+ example forces the model through 4-phase generative reconstruction:
8
+
9
+ SATURATION β†’ ABSTRACTION β†’ DESCENT β†’ INTEGRATION
10
+
11
+ This is the ML equivalent of handwriting β€” forced multi-modal
12
+ sequential reconstruction under temporal bottleneck.
13
+
14
+ Usage:
15
+ python dataset_generator.py \
16
+ --seed_problems data/seed_problems.jsonl \
17
+ --output data/rae_training_data \
18
+ --num_examples 1000 \
19
+ --domains code,reasoning,analysis,creative
20
+ ═══════════════════════════════════════════════════════════════
21
+ """
22
+
23
+ import json
24
+ import os
25
+ import argparse
26
+ import random
27
+ from pathlib import Path
28
+ from typing import Optional
29
+ from tqdm import tqdm
30
+
31
+ try:
32
+ import anthropic
33
+ HAS_ANTHROPIC = True
34
+ except ImportError:
35
+ HAS_ANTHROPIC = False
36
+
37
+ # ── RAE System Prompt ─────────────────────────────────────────
38
+
39
+ RAE_SYSTEM_PROMPT = """You are an RAE-trained cognitive reasoner. For EVERY problem, you must
40
+ work through all four phases of the Recursive Abstraction Engine. Each phase
41
+ serves a distinct cognitive function β€” you cannot skip phases or collapse them.
42
+
43
+ ## Phase Protocol
44
+
45
+ <SATURATION>
46
+ Immerse in the problem space. Observe everything without categorizing.
47
+ - What are all the elements, constraints, relationships?
48
+ - What doesn't fit expected patterns? Flag anomalies.
49
+ - Encode the problem through multiple lenses (structural, temporal, causal).
50
+ - What would surprise you if it weren't true?
51
+ Terminate when you can "predict system behavior without conscious reasoning."
52
+ </SATURATION>
53
+
54
+ <ABSTRACTION>
55
+ Extract the minimal structure that explains your saturated understanding.
56
+ - What is the isomorphic structure across domains? ("What else has this shape?")
57
+ - What invariant is preserved under transformation?
58
+ - Compress: explain the underlying mechanism in one sentence.
59
+ - What assumption are we making that we don't realize?
60
+ This phase produces the CORE INSIGHT β€” the compressed representation.
61
+ </ABSTRACTION>
62
+
63
+ <DESCENT>
64
+ Project the abstract structure into concrete instantiations.
65
+ - If this model is correct, what must also be true?
66
+ - What's the most counterintuitive prediction?
67
+ - Build the simplest implementation that tests the core assumption.
68
+ - What would prove this wrong?
69
+ This phase produces CONCRETE OUTPUT β€” code, solutions, predictions.
70
+ </DESCENT>
71
+
72
+ <INTEGRATION>
73
+ Incorporate results and prepare the knowledge update.
74
+ - What did we learn that changes our prior understanding?
75
+ - What's the confidence level and what would change it?
76
+ - Where should we look more deeply next?
77
+ - What's the new question this raises?
78
+ This phase produces META-KNOWLEDGE β€” transferable understanding.
79
+ </INTEGRATION>
80
+
81
+ CRITICAL RULES:
82
+ 1. NEVER skip a phase. Each phase's output feeds the next.
83
+ 2. Saturation must be genuinely exploratory β€” not a restatement of the question.
84
+ 3. Abstraction must COMPRESS β€” it should be shorter than Saturation.
85
+ 4. Descent must produce concrete, testable output.
86
+ 5. Integration must identify what was LEARNED, not just summarize.
87
+ """
88
+
89
+ # ── Domain-Specific Problem Templates ─────────────────────────
90
+
91
+ DOMAIN_TEMPLATES = {
92
+ "code": [
93
+ "Implement {algorithm} in Python. Consider edge cases, performance characteristics, and alternative approaches.",
94
+ "Debug the following code that has a subtle error in its {concept} logic:\n```\n{code_snippet}\n```",
95
+ "Design a data structure that supports {operations} in {complexity} time.",
96
+ "Refactor this function to improve its {quality_attribute}:\n```\n{code_snippet}\n```",
97
+ "Write a system that {system_description} handling {concurrency_pattern}.",
98
+ ],
99
+ "reasoning": [
100
+ "A company has {scenario}. What is the optimal strategy considering {constraints}?",
101
+ "Given these observations: {observations}. What is the most likely underlying mechanism?",
102
+ "Two experts disagree about {topic}. Expert A says {claim_a}. Expert B says {claim_b}. Analyze both positions.",
103
+ "You discover that {surprising_fact}. How does this change our understanding of {domain}?",
104
+ "Design an experiment to test whether {hypothesis}.",
105
+ ],
106
+ "analysis": [
107
+ "Analyze the competitive dynamics in {industry} considering {factors}.",
108
+ "A {entity_type} is showing {metric_pattern}. Diagnose the root causes and recommend interventions.",
109
+ "Compare {approach_a} vs {approach_b} for solving {problem_class}. When would you choose each?",
110
+ "Model the second-order effects of {policy_change} on {system}.",
111
+ "Evaluate the risks and opportunities of {strategy} in {context}.",
112
+ ],
113
+ "creative": [
114
+ "Design a novel approach to {problem} by combining insights from {domain_a} and {domain_b}.",
115
+ "What would a solution to {challenge} look like if we inverted all standard assumptions?",
116
+ "Create a framework for {task} that handles {edge_case} gracefully.",
117
+ "Propose three fundamentally different architectures for {system}. Analyze tradeoffs.",
118
+ "Synthesize {concept_a}, {concept_b}, and {concept_c} into a unified theory.",
119
+ ],
120
+ }
121
+
122
+ # ── Seed Problem Generators ───────────────────────────────────
123
+
124
+ CODE_PROBLEMS = [
125
+ {
126
+ "prompt": "Implement a lock-free concurrent hash map in Python that supports linearizable get/put/delete operations.",
127
+ "domain": "code",
128
+ "difficulty": "hard",
129
+ },
130
+ {
131
+ "prompt": "Write a function that determines if a given computational graph has a cycle, and if so, returns the minimal cycle. Handle both directed and undirected edges.",
132
+ "domain": "code",
133
+ "difficulty": "medium",
134
+ },
135
+ {
136
+ "prompt": "Implement an LRU cache with O(1) get/put that also supports TTL (time-to-live) expiration on individual entries.",
137
+ "domain": "code",
138
+ "difficulty": "medium",
139
+ },
140
+ {
141
+ "prompt": "Design and implement a rate limiter that supports sliding window, token bucket, and leaky bucket algorithms through a unified interface.",
142
+ "domain": "code",
143
+ "difficulty": "hard",
144
+ },
145
+ {
146
+ "prompt": "Write a parser for a simple expression language that supports variables, arithmetic, comparisons, and short-circuit boolean logic. Include proper error messages with line/column information.",
147
+ "domain": "code",
148
+ "difficulty": "hard",
149
+ },
150
+ ]
151
+
152
+ REASONING_PROBLEMS = [
153
+ {
154
+ "prompt": "A hospital notices that its mortality rate for a specific surgery is 2x the national average, but every individual surgeon performs at or below the national average. Explain this paradox and recommend what the hospital should do.",
155
+ "domain": "reasoning",
156
+ "difficulty": "hard",
157
+ },
158
+ {
159
+ "prompt": "A startup has 18 months of runway. They can either (A) build a broader product that serves 3 market segments with 60% fit each, or (B) build a deep product that serves 1 segment with 95% fit but requires that segment to grow 3x. Which should they choose and why?",
160
+ "domain": "reasoning",
161
+ "difficulty": "medium",
162
+ },
163
+ {
164
+ "prompt": "You observe that teams using microservices ship features 40% faster than monolith teams in year 1, but 20% slower by year 3. What explains this crossover pattern and what does it imply for architecture decisions?",
165
+ "domain": "reasoning",
166
+ "difficulty": "hard",
167
+ },
168
+ {
169
+ "prompt": "Three AI labs release safety benchmarks showing their models are 99.9% safe. Yet all three have had notable public safety failures. Analyze the gap between benchmark performance and real-world safety.",
170
+ "domain": "reasoning",
171
+ "difficulty": "hard",
172
+ },
173
+ ]
174
+
175
+ ANALYSIS_PROBLEMS = [
176
+ {
177
+ "prompt": "Medicare Advantage plans are seeing MLRs increase by 200-400 basis points year over year while membership grows. Analyze whether this is a structural or cyclical phenomenon and what it implies for the healthcare technology vendor ecosystem.",
178
+ "domain": "analysis",
179
+ "difficulty": "hard",
180
+ },
181
+ {
182
+ "prompt": "A SaaS company's logo retention is 95% but net revenue retention is 78%. Diagnose the likely dynamics and propose a measurement framework to identify the root causes.",
183
+ "domain": "analysis",
184
+ "difficulty": "medium",
185
+ },
186
+ {
187
+ "prompt": "Compare transformer attention mechanisms vs. state space models (Mamba-style) for processing long clinical documents. When is each approach superior and why?",
188
+ "domain": "analysis",
189
+ "difficulty": "hard",
190
+ },
191
+ ]
192
+
193
+ CREATIVE_PROBLEMS = [
194
+ {
195
+ "prompt": "Design a cognitive architecture for an AI agent that can learn new skills from watching a single demonstration video. Combine insights from motor learning theory, program synthesis, and cognitive psychology.",
196
+ "domain": "creative",
197
+ "difficulty": "hard",
198
+ },
199
+ {
200
+ "prompt": "Propose a novel approach to distributed consensus that uses biological swarm intelligence principles instead of traditional leader election. Define the protocol formally.",
201
+ "domain": "creative",
202
+ "difficulty": "hard",
203
+ },
204
+ {
205
+ "prompt": "Create a framework for evaluating whether an AI system has developed genuine understanding vs. sophisticated pattern matching. Your framework must be operationally testable.",
206
+ "domain": "creative",
207
+ "difficulty": "hard",
208
+ },
209
+ ]
210
+
211
+ ALL_SEED_PROBLEMS = CODE_PROBLEMS + REASONING_PROBLEMS + ANALYSIS_PROBLEMS + CREATIVE_PROBLEMS
212
+
213
+
214
+ def generate_rae_example_with_api(
215
+ problem: dict,
216
+ client: "anthropic.Anthropic",
217
+ model: str = "claude-sonnet-4-20250514",
218
+ ) -> Optional[dict]:
219
+ """Generate a single RAE-structured training example using the Anthropic API."""
220
+
221
+ try:
222
+ response = client.messages.create(
223
+ model=model,
224
+ max_tokens=4096,
225
+ system=RAE_SYSTEM_PROMPT,
226
+ messages=[
227
+ {"role": "user", "content": problem["prompt"]}
228
+ ],
229
+ )
230
+
231
+ assistant_text = response.content[0].text
232
+
233
+ # Validate all 4 phases are present
234
+ required_tags = ["<SATURATION>", "</SATURATION>",
235
+ "<ABSTRACTION>", "</ABSTRACTION>",
236
+ "<DESCENT>", "</DESCENT>",
237
+ "<INTEGRATION>", "</INTEGRATION>"]
238
+
239
+ if not all(tag in assistant_text for tag in required_tags):
240
+ print(f" ⚠ Incomplete phases for: {problem['prompt'][:50]}...")
241
+ return None
242
+
243
+ # Format as chat messages for SFT training
244
+ return {
245
+ "messages": [
246
+ {"role": "system", "content": RAE_SYSTEM_PROMPT},
247
+ {"role": "user", "content": problem["prompt"]},
248
+ {"role": "assistant", "content": assistant_text},
249
+ ],
250
+ "metadata": {
251
+ "domain": problem.get("domain", "general"),
252
+ "difficulty": problem.get("difficulty", "medium"),
253
+ "rae_version": "1.0",
254
+ "phases_present": 4,
255
+ }
256
+ }
257
+
258
+ except Exception as e:
259
+ print(f" βœ— API error: {e}")
260
+ return None
261
+
262
+
263
+ def generate_rae_example_template(problem: dict) -> dict:
264
+ """
265
+ Generate a RAE-structured training example using templates.
266
+ This is the OFFLINE path β€” no API required.
267
+ Use this to create the structural format, then optionally
268
+ enhance with API-generated content.
269
+ """
270
+
271
+ prompt = problem["prompt"]
272
+ domain = problem.get("domain", "general")
273
+
274
+ # Template-based RAE structure
275
+ # In production, these would be filled by the API or human experts
276
+ return {
277
+ "messages": [
278
+ {"role": "system", "content": RAE_SYSTEM_PROMPT},
279
+ {"role": "user", "content": prompt},
280
+ {"role": "assistant", "content": f"""<SATURATION>
281
+ Let me immerse in this problem space without jumping to conclusions.
282
+
283
+ **Elements present**: [The problem involves {domain} concepts. Key components include...]
284
+ **Constraints identified**: [Explicit and implicit constraints...]
285
+ **Anomalies flagged**: [What doesn't fit expected patterns...]
286
+ **Multi-lens encoding**:
287
+ - Structural view: [How the components relate...]
288
+ - Temporal view: [How this evolves over time...]
289
+ - Causal view: [What drives what...]
290
+ **Surprise test**: [What would surprise me if it weren't true...]
291
+
292
+ I can now predict the system's behavior intuitively.
293
+ </SATURATION>
294
+
295
+ <ABSTRACTION>
296
+ The minimal structure that explains everything above:
297
+
298
+ **Core isomorphism**: [This has the same shape as...]
299
+ **Invariant**: [What stays constant when everything else changes is...]
300
+ **One-sentence compression**: [The underlying mechanism is...]
301
+ **Hidden assumption**: [We're assuming X without realizing it...]
302
+
303
+ The key insight is that [compressed representation of the solution space].
304
+ </ABSTRACTION>
305
+
306
+ <DESCENT>
307
+ Projecting the abstract structure into concrete form:
308
+
309
+ **Implication chain**: If the model above is correct, then:
310
+ 1. [Concrete prediction/implementation...]
311
+ 2. [Second-order consequence...]
312
+ 3. [Counterintuitive prediction...]
313
+
314
+ **Implementation**:
315
+ [Concrete code/solution/analysis...]
316
+
317
+ **Falsification test**: This would be wrong if [specific condition].
318
+ </DESCENT>
319
+
320
+ <INTEGRATION>
321
+ **Model update**: This changes my understanding of {domain} because [specific insight].
322
+ **Confidence**: [Level and what would change it]
323
+ **Next cycle target**: The new question this raises is [specific question].
324
+ **Transferable principle**: [What generalizes beyond this specific problem].
325
+ </INTEGRATION>"""},
326
+ ],
327
+ "metadata": {
328
+ "domain": domain,
329
+ "difficulty": problem.get("difficulty", "medium"),
330
+ "rae_version": "1.0",
331
+ "phases_present": 4,
332
+ "generation_method": "template",
333
+ }
334
+ }
335
+
336
+
337
+ def augment_with_variations(example: dict, num_variations: int = 2) -> list[dict]:
338
+ """
339
+ Generate variations of a training example.
340
+
341
+ The VARIABILITY PRINCIPLE: No two handwritten letters are identical.
342
+ Each variation forces the model to extract invariant structure
343
+ rather than memorize surface patterns.
344
+ """
345
+ variations = [example] # Original is first variation
346
+
347
+ # Variation strategies
348
+ strategies = [
349
+ "rephrase_problem", # Same problem, different framing
350
+ "increase_constraints", # Add constraints to force deeper reasoning
351
+ "shift_domain", # Apply same structure to different domain
352
+ "invert_question", # Ask the opposite question
353
+ ]
354
+
355
+ for i in range(min(num_variations, len(strategies))):
356
+ variation = json.loads(json.dumps(example)) # Deep copy
357
+ variation["metadata"]["variation_strategy"] = strategies[i]
358
+ variation["metadata"]["variation_index"] = i + 1
359
+ variations.append(variation)
360
+
361
+ return variations
362
+
363
+
364
+ def create_dataset(
365
+ seed_problems: list[dict],
366
+ output_dir: str,
367
+ use_api: bool = False,
368
+ api_model: str = "claude-sonnet-4-20250514",
369
+ num_variations: int = 2,
370
+ train_split: float = 0.9,
371
+ ):
372
+ """Create the full RAE training dataset."""
373
+
374
+ output_path = Path(output_dir)
375
+ output_path.mkdir(parents=True, exist_ok=True)
376
+
377
+ client = None
378
+ if use_api and HAS_ANTHROPIC:
379
+ api_key = os.environ.get("ANTHROPIC_API_KEY")
380
+ if api_key:
381
+ client = anthropic.Anthropic(api_key=api_key)
382
+ print("βœ“ Anthropic API client initialized")
383
+ else:
384
+ print("⚠ ANTHROPIC_API_KEY not set, falling back to templates")
385
+ use_api = False
386
+
387
+ all_examples = []
388
+
389
+ print(f"\n{'═' * 60}")
390
+ print(f" RAE Dataset Generator")
391
+ print(f" Problems: {len(seed_problems)}")
392
+ print(f" Variations per problem: {num_variations}")
393
+ print(f" Expected total: ~{len(seed_problems) * (1 + num_variations)}")
394
+ print(f" Generation method: {'API' if use_api else 'Template'}")
395
+ print(f"{'═' * 60}\n")
396
+
397
+ for problem in tqdm(seed_problems, desc="Generating RAE examples"):
398
+ if use_api and client:
399
+ example = generate_rae_example_with_api(problem, client, api_model)
400
+ else:
401
+ example = generate_rae_example_template(problem)
402
+
403
+ if example:
404
+ variations = augment_with_variations(example, num_variations)
405
+ all_examples.extend(variations)
406
+
407
+ # Shuffle
408
+ random.shuffle(all_examples)
409
+
410
+ # Split
411
+ split_idx = int(len(all_examples) * train_split)
412
+ train_data = all_examples[:split_idx]
413
+ eval_data = all_examples[split_idx:]
414
+
415
+ # Write JSONL files
416
+ train_path = output_path / "train.jsonl"
417
+ eval_path = output_path / "validation.jsonl"
418
+
419
+ with open(train_path, "w") as f:
420
+ for example in train_data:
421
+ f.write(json.dumps(example) + "\n")
422
+
423
+ with open(eval_path, "w") as f:
424
+ for example in eval_data:
425
+ f.write(json.dumps(example) + "\n")
426
+
427
+ # Write metadata
428
+ metadata = {
429
+ "total_examples": len(all_examples),
430
+ "train_examples": len(train_data),
431
+ "eval_examples": len(eval_data),
432
+ "domains": list(set(e["metadata"]["domain"] for e in all_examples)),
433
+ "rae_version": "1.0",
434
+ "generation_method": "api" if use_api else "template",
435
+ "methodology": "RAE-as-training-time-cognitive-installation",
436
+ "description": (
437
+ "Training data structured as 4-phase RAE cognitive cycles. "
438
+ "Each example forces the model through Saturation β†’ Abstraction β†’ "
439
+ "Descent β†’ Integration, creating the ML equivalent of handwriting's "
440
+ "multi-circuit co-activation under temporal bottleneck."
441
+ ),
442
+ }
443
+
444
+ with open(output_path / "metadata.json", "w") as f:
445
+ json.dump(metadata, f, indent=2)
446
+
447
+ print(f"\n{'═' * 60}")
448
+ print(f" Dataset Generated")
449
+ print(f" Train: {len(train_data)} examples β†’ {train_path}")
450
+ print(f" Eval: {len(eval_data)} examples β†’ {eval_path}")
451
+ print(f" Metadata β†’ {output_path / 'metadata.json'}")
452
+ print(f"{'═' * 60}\n")
453
+
454
+ return train_data, eval_data
455
+
456
+
457
+ def main():
458
+ parser = argparse.ArgumentParser(description="RAE Dataset Generator")
459
+ parser.add_argument("--seed_problems", type=str, default=None,
460
+ help="Path to seed problems JSONL file")
461
+ parser.add_argument("--output", type=str, default="data/rae_training_data",
462
+ help="Output directory for training data")
463
+ parser.add_argument("--use_api", action="store_true",
464
+ help="Use Anthropic API for high-quality generation")
465
+ parser.add_argument("--api_model", type=str, default="claude-sonnet-4-20250514",
466
+ help="Anthropic model to use for generation")
467
+ parser.add_argument("--num_variations", type=int, default=2,
468
+ help="Number of variations per seed problem")
469
+ parser.add_argument("--train_split", type=float, default=0.9,
470
+ help="Fraction of data for training")
471
+
472
+ args = parser.parse_args()
473
+
474
+ # Load seed problems
475
+ if args.seed_problems and Path(args.seed_problems).exists():
476
+ with open(args.seed_problems) as f:
477
+ seed_problems = [json.loads(line) for line in f]
478
+ print(f"Loaded {len(seed_problems)} seed problems from {args.seed_problems}")
479
+ else:
480
+ seed_problems = ALL_SEED_PROBLEMS
481
+ print(f"Using {len(seed_problems)} built-in seed problems")
482
+
483
+ create_dataset(
484
+ seed_problems=seed_problems,
485
+ output_dir=args.output,
486
+ use_api=args.use_api,
487
+ api_model=args.api_model,
488
+ num_variations=args.num_variations,
489
+ train_split=args.train_split,
490
+ )
491
+
492
+
493
+ if __name__ == "__main__":
494
+ main()