--- language: - en license: agpl-3.0 library_name: transformers tags: - grpo - rlhf - fine-tuning - marxism - political-theory - lora - deepseek - qwen datasets: - prolewiki/qa-corpus base_model: unsloth/DeepSeek-R1-0528-Qwen3-8B pipeline_tag: text-generation --- # prolewiki-llm GRPO fine-tuning infrastructure for training Marxist-Leninist language models. ## Overview This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes: - **Multi-Layer Reward System**: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring) - **Headless Training**: Docker container for automated RunPod deployment with auto-shutoff - **Jupyter Notebook**: Production-ready notebook optimized for A40/A100 GPUs - **Comprehensive Tests**: Unit and integration tests for all components ## Quick Start ### RunPod Deployment (Recommended) ```bash # 1. Build Docker image docker build -t marxist-grpo:latest -f docker/Dockerfile . # 2. Push to registry and deploy on RunPod # Use A40 (48GB, $0.35/hr) for best cost/performance # 3. Set environment variables on pod: # - HF_TOKEN # - WANDB_API_KEY # - HF_REPO (optional, for model upload) ``` ### Local Development ```bash # Install dependencies uv sync --group dev # Download spaCy model (required for rewards) python -m spacy download en_core_web_sm # Run tests uv run pytest -m "not slow and not gpu" ``` ## Repository Structure ``` prolewiki-llm/ ├── src/prolewiki_llm/ │ ├── grpo_rewards.py # Multi-layer reward functions │ ├── train_headless.py # Headless training script │ ├── export_grpo_dataset.py # Dataset conversion │ └── wandb_logging.py # W&B integration ├── docker/ │ ├── Dockerfile # Training container │ ├── start.sh # Entrypoint with auto-shutoff │ └── .env.example # Environment reference ├── notebooks/ │ └── Marxist_GRPO_RunPod_Optimized.ipynb ├── tests/ │ ├── unit/ # Unit tests │ ├── integration/ # Shell script tests │ └── fixtures/ # Mock commands └── training_data/ └── grpo_dataset.jsonl # Training data ``` ## Reward Functions The reward system uses multiple layers to ensure quality responses: | Layer | Function | Purpose | |-------|----------|---------| | 1 | `match_format_exactly` | Validate `...` tags | | 2 | `nli_coherence_reward` | Response entails ground truth (BART-MNLI) | | 3 | `self_consistency_reward` | No internal contradictions | | 4 | `structural_coherence_reward` | Terms in proper syntactic roles (spaCy) | | 5 | `topic_relevance_reward` | Answer addresses the question | | 6 | `interconnection_depth_reward` | Rewards analysis, penalizes buzzword salad | Use `full_coherence_reward()` for the complete 6-layer check, or `robust_coherence_reward()` for a faster 3-layer version. ## Training Configuration Key environment variables for `train_headless.py`: | Variable | Default | Description | |----------|---------|-------------| | `MODEL_NAME` | `unsloth/DeepSeek-R1-0528-Qwen3-8B` | Base model | | `MAX_STEPS` | `500` | Training steps | | `BATCH_SIZE` | `2` | Per-device batch size | | `LEARNING_RATE` | `5e-6` | Learning rate | | `REWARD_MODE` | `FULL` | `FULL`, `ROBUST`, or `LEGACY` | | `HF_REPO` | `prolewiki/marxist-grpo-lora` | Upload destination | ## GPU Requirements | GPU | VRAM | Price | Recommendation | |-----|------|-------|----------------| | **A40** | 48GB | $0.35/hr | Best value for 8B models | | A100 | 80GB | $1.19/hr | Overkill for this use case | | RTX 4090 | 24GB | $0.34/hr | Too small for 16-bit GRPO | ## Critical Notes 1. **torch.compile must be disabled** on RunPod/Jupyter (causes hangs) 2. **load_in_4bit=False** is required for GRPO (16-bit LoRA adapters) 3. **use_gradient_checkpointing=True** (not `"unsloth"`) for stability ## Related Projects - [ProleWiki](https://en.prolewiki.org/) - The Marxist-Leninist encyclopedia - [pw-mcp](https://github.com/prolewiki/pw-mcp) - MCP server for ProleWiki semantic search ## License AGPL-3.0-only