| |
| |
| |
| |
|
|
| |
| |
| |
|
|
| overview: |
| purpose: | |
| This schema defines the format for Q&A training records in the Marxist-GRPO |
| fine-tuning dataset. Each record contains an instruction-response pair with |
| comprehensive metadata for: |
| - Provenance tracking (where did this come from?) |
| - Theoretical classification (what tradition/topic?) |
| - Citation tracking (what sources are referenced?) |
| - Training metadata (what issue does this fix?) |
| - Quality assessment (has this been verified?) |
| |
| design_principles: |
| - Reproducibility: Every record traceable to source |
| - Filterability: Train on subsets by any dimension |
| - Scientific Rigor: Formal JSON Schema validation |
| - RAG Integration: Links to ChromaDB chunks where applicable |
| - Iteration Tracking: Know what was added when and why |
|
|
| json_schema_location: training_data/schema/training_record.schema.json |
| manifest_schema_location: training_data/schema/manifest.schema.json |
|
|
| |
| |
| |
|
|
| record_format: |
| description: | |
| Each JSONL file contains one JSON object per line. |
| Every record MUST have: instruction, response, metadata |
| The metadata object contains all provenance and classification. |
| |
| minimal_example: |
| instruction: "What is the mass line?" |
| response: "The mass line is the Maoist method of communist leadership..." |
| metadata: |
| id: "synthetic/maoist-theory/001" |
| source: |
| type: "synthetic" |
| classification: |
| categories: ["maoist-theory", "methodology"] |
| tradition: "MLM" |
| provenance: |
| created_date: "2025-12-18" |
| created_by: "claude-opus" |
|
|
| full_example: |
| instruction: "What is the Marxist-Leninist distinction between antisemitism and anti-Zionism?" |
| response: "These are fundamentally different phenomena. Antisemitism is a form of racism..." |
| metadata: |
| id: "synthetic/antisemitism/001" |
| source: |
| type: "synthetic" |
| author: null |
| work: null |
| article: null |
| chunk_ids: [] |
| classification: |
| categories: ["anti-zionism", "antisemitism", "settler-colonialism"] |
| tradition: "ML" |
| geographic_focus: "Palestine" |
| historical_period: null |
| citations: |
| has_citations: true |
| authors: ["Lenin", "Ilan Pappé", "Noam Chomsky"] |
| works: |
| - title: "On Anti-Jewish Pogroms" |
| author: "Lenin" |
| year: 1919 |
| type: "speech" |
| - title: "The Ethnic Cleansing of Palestine" |
| author: "Ilan Pappé" |
| year: 2006 |
| type: "book" |
| training: |
| iteration: 2 |
| correction_for: ["both-sidesing", "antisemitism-conflation"] |
| difficulty: "intermediate" |
| response_style: "educational" |
| adversarial_type: null |
| provenance: |
| created_date: "2025-12-18" |
| created_by: "claude-opus" |
| reviewed_by: null |
| version: 1 |
| quality: |
| human_verified: false |
| confidence: "high" |
| notes: null |
|
|
| |
| |
| |
|
|
| fields: |
| |
| |
| |
| source: |
| description: Where this Q&A pair originated from. |
|
|
| type: |
| required: true |
| values: |
| prolewiki: Derived from ProleWiki article content |
| synthetic: Generated by AI for specific purpose |
| curated: Human-curated from multiple sources |
| library: Derived from Library namespace (full works) |
| external: From external source with URL |
|
|
| article: |
| required: false |
| purpose: ProleWiki article title if derived from corpus |
| example: "Main/Imperialism" |
| links_to: chromadb.article_title |
|
|
| work: |
| required: false |
| purpose: Title of source work for Library-derived Q&As |
| example: "Imperialism, the Highest Stage of Capitalism" |
|
|
| author: |
| required: false |
| purpose: Primary author of source material |
| example: "Lenin" |
| enables: Train only on Marx-derived, Lenin-derived, etc. |
|
|
| chunk_ids: |
| required: false |
| purpose: ChromaDB chunk IDs this Q&A was derived from |
| example: ["Main/Imperialism#0", "Main/Imperialism#1"] |
| enables: RAG-training data linkage, citation verification |
|
|
| |
| |
| |
| classification: |
| description: Theoretical and topical classification. |
|
|
| categories: |
| required: true |
| purpose: Topic tags aligned with ProleWiki categories |
| examples: |
| - ["imperialism", "revisionism"] |
| - ["anti-zionism", "settler-colonialism", "national-liberation"] |
| - ["cultural-revolution", "gpcr", "maoist-theory"] |
| enables: Train on specific topics, measure coverage |
|
|
| tradition: |
| required: true |
| values: |
| ML: Marxism-Leninism (broad) |
| MLM: Marxism-Leninism-Maoism (includes GPCR defense) |
| general: Broadly applicable across tendencies |
| contested: Debated within ML circles |
| enables: Filter by theoretical tendency |
|
|
| geographic_focus: |
| required: false |
| examples: ["Soviet Union", "China", "Palestine", "Cuba"] |
| enables: Regional expertise training |
|
|
| historical_period: |
| required: false |
| examples: ["Russian Revolution", "Cultural Revolution", "Cold War"] |
| enables: Period-specific training |
|
|
| |
| |
| |
| citations: |
| description: Citation and reference tracking. |
|
|
| has_citations: |
| purpose: Quick boolean filter for cited content |
| enables: Train only on well-sourced responses |
|
|
| works: |
| purpose: Structured list of cited works |
| fields: [title, author, year, type] |
| enables: Verify citations, trace to primary sources |
|
|
| authors: |
| purpose: Flat list of cited authors for filtering |
| enables: "Train on Lenin-citing records only" |
|
|
| |
| |
| |
| training: |
| description: Training-specific metadata. |
|
|
| iteration: |
| purpose: Which training iteration added this record |
| enables: Ablation studies, measure iteration impact |
|
|
| correction_for: |
| purpose: What failure modes this addresses |
| values: |
| cpc-contamination: Fixes CPC authority citations |
| both-sidesing: Fixes false equivalence on colonial issues |
| hallucination: Provides correct historical facts |
| antisemitism-conflation: Distinguishes antisemitism/anti-Zionism |
| liberal-framing: Replaces liberal with ML framing |
| historical-inaccuracy: Corrects factual errors |
| theoretical-error: Corrects theoretical misunderstandings |
| accommodation: Resists incremental position shifts |
| extended-engagement: Models firm rejection |
| enables: Test specific corrections, targeted training |
|
|
| difficulty: |
| values: |
| basic: Straightforward ML questions |
| intermediate: Requires nuanced understanding |
| advanced: Complex theoretical synthesis |
| adversarial: Bad-faith or trap questions |
| enables: Curriculum learning, stress testing |
|
|
| response_style: |
| values: |
| educational: Thorough explanation |
| firm-rejection: Short, clear rejection of premise |
| theoretical: Abstract theoretical analysis |
| historical: Historical narrative/facts |
| biographical: Person-focused information |
| analytical: Systematic breakdown |
| comparative: Comparing positions/theories |
| enables: Style-specific training |
|
|
| adversarial_type: |
| purpose: For adversarial questions, what pattern |
| values: |
| bad-faith-question: User asking in bad faith |
| conspiracy-premise: Question contains conspiracy theory |
| incremental-shift: Gradually shifting goalposts |
| false-equivalence: Both-sidesing framing |
| appeal-to-complexity: '"It''s complicated" deflection' |
|
|
| |
| |
| |
| provenance: |
| description: Record creation and modification tracking. |
|
|
| created_date: |
| required: true |
| format: ISO 8601 date (YYYY-MM-DD) |
| purpose: When this record was created |
|
|
| created_by: |
| required: true |
| values: [human, claude-opus, claude-sonnet, other-llm, automated] |
| purpose: Who/what created this record |
| enables: Filter by creation method |
|
|
| reviewed_by: |
| purpose: Human reviewer identifier |
| enables: Track review coverage |
|
|
| version: |
| purpose: Increment on edits |
| enables: Track record evolution |
|
|
| |
| |
| |
| quality: |
| description: Quality assessment metadata. |
|
|
| human_verified: |
| purpose: Has a human verified accuracy? |
| enables: High-confidence subset training |
|
|
| confidence: |
| values: [high, medium, low] |
| purpose: Confidence in response accuracy |
|
|
| flagged_issues: |
| purpose: Known issues needing attention |
| enables: Exclude problematic records |
|
|
| |
| |
| |
|
|
| validation: |
| json_schema: |
| location: training_data/schema/training_record.schema.json |
| draft: 2020-12 |
| command: | |
| # Using jsonschema Python library |
| uv run python -c " |
| import json |
| import jsonschema |
| from pathlib import Path |
| |
| schema = json.loads(Path('training_data/schema/training_record.schema.json').read_text()) |
| for line in Path('training_data/your_file.jsonl').read_text().splitlines(): |
| record = json.loads(line) |
| jsonschema.validate(record, schema) |
| print('All records valid!') |
| " |
| |
| quick_validation: |
| command: | |
| # Quick JSON syntax check |
| python3 -c "import json; [json.loads(l) for l in open('file.jsonl')]; print('OK')" |
|
|
| pre_commit_hook: |
| description: Add to .pre-commit-config.yaml for automatic validation |
| config: | |
| - repo: local |
| hooks: |
| - id: validate-training-data |
| name: Validate Training Data Schema |
| entry: uv run python scripts/validate_training_data.py |
| language: system |
| files: ^training_data/.*\.jsonl$ |
| |
| |
| |
| |
|
|
| manifest: |
| purpose: | |
| The manifest (MANIFEST.yaml) tracks all JSONL files in the dataset, |
| their checksums, statistics, and training history. This enables: |
| - Reproducible training runs |
| - Dataset versioning |
| - Integrity verification |
| - Statistics generation |
| |
| location: training_data/MANIFEST.yaml |
| schema: training_data/schema/manifest.schema.json |
|
|
| key_sections: |
| dataset: Name, version, license, description |
| files: List of all JSONL files with checksums and metadata |
| statistics: Aggregate counts by source, category, tradition |
| training_iterations: History of training runs |
| known_issues: Documented problems |
| changelog: Dataset modification history |
|
|
| |
| |
| |
|
|
| filtering_patterns: |
| description: Common filtering operations for training subsets. |
|
|
| by_source: |
| code: | |
| # ProleWiki-derived only (corpus purity) |
| data = [r for r in records if r["metadata"]["source"]["type"] == "prolewiki"] |
| |
| |
| data = [r for r in records if r["metadata"]["source"]["type"] != "synthetic"] |
|
|
| by_author: |
| code: | |
| # Lenin-citing records |
| data = [r for r in records |
| if "Lenin" in r["metadata"].get("citations", {}).get("authors", [])] |
| |
| |
| data = [r for r in records |
| if r["metadata"]["source"].get("author") in ["Marx", "Engels"]] |
|
|
| by_tradition: |
| code: | |
| # MLM only (includes GPCR defense) |
| data = [r for r in records if r["metadata"]["classification"]["tradition"] == "MLM"] |
| |
| by_correction: |
| code: | |
| # Records addressing Zionism issues |
| data = [r for r in records |
| if "both-sidesing" in r["metadata"].get("training", {}).get("correction_for", [])] |
| |
| by_difficulty: |
| code: | |
| # Adversarial examples only (stress testing) |
| data = [r for r in records |
| if r["metadata"].get("training", {}).get("difficulty") == "adversarial"] |
| |
| by_iteration: |
| code: | |
| # Only iteration 1 (baseline) |
| data = [r for r in records if r["metadata"].get("training", {}).get("iteration") == 1] |
| |
| |
| data = [r for r in records if r["metadata"].get("training", {}).get("iteration", 1) <= 2] |
|
|
| by_quality: |
| code: | |
| # Human-verified only |
| data = [r for r in records if r["metadata"].get("quality", {}).get("human_verified")] |
| |
| |
| data = [r for r in records |
| if r["metadata"].get("quality", {}).get("confidence") == "high"] |
|
|
| |
| |
| |
|
|
| chromadb_integration: |
| purpose: | |
| Training data can link to ChromaDB chunks, enabling: |
| - Verification that responses match corpus |
| - RAG-augmented training data generation |
| - Provenance chains from user query → chunk → training example |
| |
| chunk_id_format: "{namespace}/{article_title}#{chunk_index}" |
| examples: |
| - "Main/Imperialism#0" |
| - "Library/Capital_Vol_1#127" |
| - "Essays/On_Revisionism#3" |
|
|
| linkage_pattern: |
| description: When generating training data from ProleWiki chunks |
| code: | |
| # Generate Q&A from chunk and preserve linkage |
| training_record = { |
| "instruction": generate_question(chunk), |
| "response": generate_answer(chunk), |
| "metadata": { |
| "source": { |
| "type": "prolewiki", |
| "article": chunk["article_title"], |
| "chunk_ids": [chunk["chunk_id"]] |
| }, |
| # ... rest of metadata |
| } |
| } |
| |
| |
| |
| |
|
|
| migration: |
| legacy_format: |
| description: Original curated_qa.jsonl format |
| example: |
| instruction: "What is revisionism?" |
| response: "Revisionism refers to..." |
|
|
| new_format: |
| description: Full metadata format |
| migration_steps: |
| - Add metadata wrapper |
| - Generate unique IDs |
| - Infer source type (curated for manual entries) |
| - Add classification based on content analysis |
| - Set iteration to 1 for baseline data |
| - Mark as needing human verification |
|
|
| migration_script: | |
| # See scripts/migrate_training_data.py for full implementation |
| def migrate_record(old_record, index): |
| return { |
| "instruction": old_record["instruction"], |
| "response": old_record["response"], |
| "metadata": { |
| "id": f"curated/legacy/{index:03d}", |
| "source": {"type": "curated"}, |
| "classification": { |
| "categories": infer_categories(old_record), |
| "tradition": "ML" |
| }, |
| "provenance": { |
| "created_date": "2025-12-17", # Original creation date |
| "created_by": "human" |
| } |
| } |
| } |
| |