llm-training / ai-docs /training-schema.yaml
percyraskova's picture
Upload folder using huggingface_hub
81b3473 verified
# Training Data Schema Reference
# Purpose: Human-readable documentation for Marxist-GRPO training data format
# Formal Schema: training_data/schema/training_record.schema.json
# Updated: 2025-12-18
# =============================================================================
# OVERVIEW
# =============================================================================
overview:
purpose: |
This schema defines the format for Q&A training records in the Marxist-GRPO
fine-tuning dataset. Each record contains an instruction-response pair with
comprehensive metadata for:
- Provenance tracking (where did this come from?)
- Theoretical classification (what tradition/topic?)
- Citation tracking (what sources are referenced?)
- Training metadata (what issue does this fix?)
- Quality assessment (has this been verified?)
design_principles:
- Reproducibility: Every record traceable to source
- Filterability: Train on subsets by any dimension
- Scientific Rigor: Formal JSON Schema validation
- RAG Integration: Links to ChromaDB chunks where applicable
- Iteration Tracking: Know what was added when and why
json_schema_location: training_data/schema/training_record.schema.json
manifest_schema_location: training_data/schema/manifest.schema.json
# =============================================================================
# RECORD FORMAT
# =============================================================================
record_format:
description: |
Each JSONL file contains one JSON object per line.
Every record MUST have: instruction, response, metadata
The metadata object contains all provenance and classification.
minimal_example:
instruction: "What is the mass line?"
response: "The mass line is the Maoist method of communist leadership..."
metadata:
id: "synthetic/maoist-theory/001"
source:
type: "synthetic"
classification:
categories: ["maoist-theory", "methodology"]
tradition: "MLM"
provenance:
created_date: "2025-12-18"
created_by: "claude-opus"
full_example:
instruction: "What is the Marxist-Leninist distinction between antisemitism and anti-Zionism?"
response: "These are fundamentally different phenomena. Antisemitism is a form of racism..."
metadata:
id: "synthetic/antisemitism/001"
source:
type: "synthetic"
author: null
work: null
article: null
chunk_ids: []
classification:
categories: ["anti-zionism", "antisemitism", "settler-colonialism"]
tradition: "ML"
geographic_focus: "Palestine"
historical_period: null
citations:
has_citations: true
authors: ["Lenin", "Ilan Pappé", "Noam Chomsky"]
works:
- title: "On Anti-Jewish Pogroms"
author: "Lenin"
year: 1919
type: "speech"
- title: "The Ethnic Cleansing of Palestine"
author: "Ilan Pappé"
year: 2006
type: "book"
training:
iteration: 2
correction_for: ["both-sidesing", "antisemitism-conflation"]
difficulty: "intermediate"
response_style: "educational"
adversarial_type: null
provenance:
created_date: "2025-12-18"
created_by: "claude-opus"
reviewed_by: null
version: 1
quality:
human_verified: false
confidence: "high"
notes: null
# =============================================================================
# METADATA FIELDS REFERENCE
# =============================================================================
fields:
# ---------------------------------------------------------------------------
# SOURCE PROVENANCE
# ---------------------------------------------------------------------------
source:
description: Where this Q&A pair originated from.
type:
required: true
values:
prolewiki: Derived from ProleWiki article content
synthetic: Generated by AI for specific purpose
curated: Human-curated from multiple sources
library: Derived from Library namespace (full works)
external: From external source with URL
article:
required: false
purpose: ProleWiki article title if derived from corpus
example: "Main/Imperialism"
links_to: chromadb.article_title
work:
required: false
purpose: Title of source work for Library-derived Q&As
example: "Imperialism, the Highest Stage of Capitalism"
author:
required: false
purpose: Primary author of source material
example: "Lenin"
enables: Train only on Marx-derived, Lenin-derived, etc.
chunk_ids:
required: false
purpose: ChromaDB chunk IDs this Q&A was derived from
example: ["Main/Imperialism#0", "Main/Imperialism#1"]
enables: RAG-training data linkage, citation verification
# ---------------------------------------------------------------------------
# CLASSIFICATION
# ---------------------------------------------------------------------------
classification:
description: Theoretical and topical classification.
categories:
required: true
purpose: Topic tags aligned with ProleWiki categories
examples:
- ["imperialism", "revisionism"]
- ["anti-zionism", "settler-colonialism", "national-liberation"]
- ["cultural-revolution", "gpcr", "maoist-theory"]
enables: Train on specific topics, measure coverage
tradition:
required: true
values:
ML: Marxism-Leninism (broad)
MLM: Marxism-Leninism-Maoism (includes GPCR defense)
general: Broadly applicable across tendencies
contested: Debated within ML circles
enables: Filter by theoretical tendency
geographic_focus:
required: false
examples: ["Soviet Union", "China", "Palestine", "Cuba"]
enables: Regional expertise training
historical_period:
required: false
examples: ["Russian Revolution", "Cultural Revolution", "Cold War"]
enables: Period-specific training
# ---------------------------------------------------------------------------
# CITATIONS
# ---------------------------------------------------------------------------
citations:
description: Citation and reference tracking.
has_citations:
purpose: Quick boolean filter for cited content
enables: Train only on well-sourced responses
works:
purpose: Structured list of cited works
fields: [title, author, year, type]
enables: Verify citations, trace to primary sources
authors:
purpose: Flat list of cited authors for filtering
enables: "Train on Lenin-citing records only"
# ---------------------------------------------------------------------------
# TRAINING METADATA
# ---------------------------------------------------------------------------
training:
description: Training-specific metadata.
iteration:
purpose: Which training iteration added this record
enables: Ablation studies, measure iteration impact
correction_for:
purpose: What failure modes this addresses
values:
cpc-contamination: Fixes CPC authority citations
both-sidesing: Fixes false equivalence on colonial issues
hallucination: Provides correct historical facts
antisemitism-conflation: Distinguishes antisemitism/anti-Zionism
liberal-framing: Replaces liberal with ML framing
historical-inaccuracy: Corrects factual errors
theoretical-error: Corrects theoretical misunderstandings
accommodation: Resists incremental position shifts
extended-engagement: Models firm rejection
enables: Test specific corrections, targeted training
difficulty:
values:
basic: Straightforward ML questions
intermediate: Requires nuanced understanding
advanced: Complex theoretical synthesis
adversarial: Bad-faith or trap questions
enables: Curriculum learning, stress testing
response_style:
values:
educational: Thorough explanation
firm-rejection: Short, clear rejection of premise
theoretical: Abstract theoretical analysis
historical: Historical narrative/facts
biographical: Person-focused information
analytical: Systematic breakdown
comparative: Comparing positions/theories
enables: Style-specific training
adversarial_type:
purpose: For adversarial questions, what pattern
values:
bad-faith-question: User asking in bad faith
conspiracy-premise: Question contains conspiracy theory
incremental-shift: Gradually shifting goalposts
false-equivalence: Both-sidesing framing
appeal-to-complexity: '"It''s complicated" deflection'
# ---------------------------------------------------------------------------
# PROVENANCE
# ---------------------------------------------------------------------------
provenance:
description: Record creation and modification tracking.
created_date:
required: true
format: ISO 8601 date (YYYY-MM-DD)
purpose: When this record was created
created_by:
required: true
values: [human, claude-opus, claude-sonnet, other-llm, automated]
purpose: Who/what created this record
enables: Filter by creation method
reviewed_by:
purpose: Human reviewer identifier
enables: Track review coverage
version:
purpose: Increment on edits
enables: Track record evolution
# ---------------------------------------------------------------------------
# QUALITY
# ---------------------------------------------------------------------------
quality:
description: Quality assessment metadata.
human_verified:
purpose: Has a human verified accuracy?
enables: High-confidence subset training
confidence:
values: [high, medium, low]
purpose: Confidence in response accuracy
flagged_issues:
purpose: Known issues needing attention
enables: Exclude problematic records
# =============================================================================
# VALIDATION
# =============================================================================
validation:
json_schema:
location: training_data/schema/training_record.schema.json
draft: 2020-12
command: |
# Using jsonschema Python library
uv run python -c "
import json
import jsonschema
from pathlib import Path
schema = json.loads(Path('training_data/schema/training_record.schema.json').read_text())
for line in Path('training_data/your_file.jsonl').read_text().splitlines():
record = json.loads(line)
jsonschema.validate(record, schema)
print('All records valid!')
"
quick_validation:
command: |
# Quick JSON syntax check
python3 -c "import json; [json.loads(l) for l in open('file.jsonl')]; print('OK')"
pre_commit_hook:
description: Add to .pre-commit-config.yaml for automatic validation
config: |
- repo: local
hooks:
- id: validate-training-data
name: Validate Training Data Schema
entry: uv run python scripts/validate_training_data.py
language: system
files: ^training_data/.*\.jsonl$
# =============================================================================
# MANIFEST
# =============================================================================
manifest:
purpose: |
The manifest (MANIFEST.yaml) tracks all JSONL files in the dataset,
their checksums, statistics, and training history. This enables:
- Reproducible training runs
- Dataset versioning
- Integrity verification
- Statistics generation
location: training_data/MANIFEST.yaml
schema: training_data/schema/manifest.schema.json
key_sections:
dataset: Name, version, license, description
files: List of all JSONL files with checksums and metadata
statistics: Aggregate counts by source, category, tradition
training_iterations: History of training runs
known_issues: Documented problems
changelog: Dataset modification history
# =============================================================================
# FILTERING PATTERNS
# =============================================================================
filtering_patterns:
description: Common filtering operations for training subsets.
by_source:
code: |
# ProleWiki-derived only (corpus purity)
data = [r for r in records if r["metadata"]["source"]["type"] == "prolewiki"]
# Exclude synthetic for ablation
data = [r for r in records if r["metadata"]["source"]["type"] != "synthetic"]
by_author:
code: |
# Lenin-citing records
data = [r for r in records
if "Lenin" in r["metadata"].get("citations", {}).get("authors", [])]
# Marx or Engels sourced
data = [r for r in records
if r["metadata"]["source"].get("author") in ["Marx", "Engels"]]
by_tradition:
code: |
# MLM only (includes GPCR defense)
data = [r for r in records if r["metadata"]["classification"]["tradition"] == "MLM"]
by_correction:
code: |
# Records addressing Zionism issues
data = [r for r in records
if "both-sidesing" in r["metadata"].get("training", {}).get("correction_for", [])]
by_difficulty:
code: |
# Adversarial examples only (stress testing)
data = [r for r in records
if r["metadata"].get("training", {}).get("difficulty") == "adversarial"]
by_iteration:
code: |
# Only iteration 1 (baseline)
data = [r for r in records if r["metadata"].get("training", {}).get("iteration") == 1]
# Iterations 1-2 combined
data = [r for r in records if r["metadata"].get("training", {}).get("iteration", 1) <= 2]
by_quality:
code: |
# Human-verified only
data = [r for r in records if r["metadata"].get("quality", {}).get("human_verified")]
# High confidence
data = [r for r in records
if r["metadata"].get("quality", {}).get("confidence") == "high"]
# =============================================================================
# INTEGRATION WITH CHROMADB
# =============================================================================
chromadb_integration:
purpose: |
Training data can link to ChromaDB chunks, enabling:
- Verification that responses match corpus
- RAG-augmented training data generation
- Provenance chains from user query → chunk → training example
chunk_id_format: "{namespace}/{article_title}#{chunk_index}"
examples:
- "Main/Imperialism#0"
- "Library/Capital_Vol_1#127"
- "Essays/On_Revisionism#3"
linkage_pattern:
description: When generating training data from ProleWiki chunks
code: |
# Generate Q&A from chunk and preserve linkage
training_record = {
"instruction": generate_question(chunk),
"response": generate_answer(chunk),
"metadata": {
"source": {
"type": "prolewiki",
"article": chunk["article_title"],
"chunk_ids": [chunk["chunk_id"]]
},
# ... rest of metadata
}
}
# =============================================================================
# MIGRATION FROM LEGACY FORMAT
# =============================================================================
migration:
legacy_format:
description: Original curated_qa.jsonl format
example:
instruction: "What is revisionism?"
response: "Revisionism refers to..."
new_format:
description: Full metadata format
migration_steps:
- Add metadata wrapper
- Generate unique IDs
- Infer source type (curated for manual entries)
- Add classification based on content analysis
- Set iteration to 1 for baseline data
- Mark as needing human verification
migration_script: |
# See scripts/migrate_training_data.py for full implementation
def migrate_record(old_record, index):
return {
"instruction": old_record["instruction"],
"response": old_record["response"],
"metadata": {
"id": f"curated/legacy/{index:03d}",
"source": {"type": "curated"},
"classification": {
"categories": infer_categories(old_record),
"tradition": "ML"
},
"provenance": {
"created_date": "2025-12-17", # Original creation date
"created_by": "human"
}
}
}