prolewiki-llm

GRPO fine-tuning infrastructure for training Marxist-Leninist language models.

Overview

This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes:

  • Multi-Layer Reward System: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring)
  • Headless Training: Docker container for automated RunPod deployment with auto-shutoff
  • Jupyter Notebook: Production-ready notebook optimized for A40/A100 GPUs
  • Comprehensive Tests: Unit and integration tests for all components

Quick Start

RunPod Deployment (Recommended)

# 1. Build Docker image
docker build -t marxist-grpo:latest -f docker/Dockerfile .

# 2. Push to registry and deploy on RunPod
# Use A40 (48GB, $0.35/hr) for best cost/performance

# 3. Set environment variables on pod:
#    - HF_TOKEN
#    - WANDB_API_KEY
#    - HF_REPO (optional, for model upload)

Local Development

# Install dependencies
uv sync --group dev

# Download spaCy model (required for rewards)
python -m spacy download en_core_web_sm

# Run tests
uv run pytest -m "not slow and not gpu"

Repository Structure

prolewiki-llm/
β”œβ”€β”€ src/prolewiki_llm/
β”‚   β”œβ”€β”€ grpo_rewards.py       # Multi-layer reward functions
β”‚   β”œβ”€β”€ train_headless.py     # Headless training script
β”‚   β”œβ”€β”€ export_grpo_dataset.py # Dataset conversion
β”‚   └── wandb_logging.py      # W&B integration
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ Dockerfile            # Training container
β”‚   β”œβ”€β”€ start.sh              # Entrypoint with auto-shutoff
β”‚   └── .env.example          # Environment reference
β”œβ”€β”€ notebooks/
β”‚   └── Marxist_GRPO_RunPod_Optimized.ipynb
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/                 # Unit tests
β”‚   β”œβ”€β”€ integration/          # Shell script tests
β”‚   └── fixtures/             # Mock commands
└── training_data/
    └── grpo_dataset.jsonl    # Training data

Reward Functions

The reward system uses multiple layers to ensure quality responses:

Layer Function Purpose
1 match_format_exactly Validate <think>...</think> tags
2 nli_coherence_reward Response entails ground truth (BART-MNLI)
3 self_consistency_reward No internal contradictions
4 structural_coherence_reward Terms in proper syntactic roles (spaCy)
5 topic_relevance_reward Answer addresses the question
6 interconnection_depth_reward Rewards analysis, penalizes buzzword salad

Use full_coherence_reward() for the complete 6-layer check, or robust_coherence_reward() for a faster 3-layer version.

Training Configuration

Key environment variables for train_headless.py:

Variable Default Description
MODEL_NAME unsloth/DeepSeek-R1-0528-Qwen3-8B Base model
MAX_STEPS 500 Training steps
BATCH_SIZE 2 Per-device batch size
LEARNING_RATE 5e-6 Learning rate
REWARD_MODE FULL FULL, ROBUST, or LEGACY
HF_REPO prolewiki/marxist-grpo-lora Upload destination

GPU Requirements

GPU VRAM Price Recommendation
A40 48GB $0.35/hr Best value for 8B models
A100 80GB $1.19/hr Overkill for this use case
RTX 4090 24GB $0.34/hr Too small for 16-bit GRPO

Critical Notes

  1. torch.compile must be disabled on RunPod/Jupyter (causes hangs)
  2. load_in_4bit=False is required for GRPO (16-bit LoRA adapters)
  3. use_gradient_checkpointing=True (not "unsloth") for stability

Related Projects

  • ProleWiki - The Marxist-Leninist encyclopedia
  • pw-mcp - MCP server for ProleWiki semantic search

License

AGPL-3.0-only

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for percyraskova/llm-training