| | --- |
| | language: |
| | - en |
| | license: agpl-3.0 |
| | library_name: transformers |
| | tags: |
| | - grpo |
| | - rlhf |
| | - fine-tuning |
| | - marxism |
| | - political-theory |
| | - lora |
| | - deepseek |
| | - qwen |
| | datasets: |
| | - prolewiki/qa-corpus |
| | base_model: unsloth/DeepSeek-R1-0528-Qwen3-8B |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # prolewiki-llm |
| |
|
| | GRPO fine-tuning infrastructure for training Marxist-Leninist language models. |
| |
|
| | ## Overview |
| |
|
| | This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes: |
| |
|
| | - **Multi-Layer Reward System**: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring) |
| | - **Headless Training**: Docker container for automated RunPod deployment with auto-shutoff |
| | - **Jupyter Notebook**: Production-ready notebook optimized for A40/A100 GPUs |
| | - **Comprehensive Tests**: Unit and integration tests for all components |
| |
|
| | ## Quick Start |
| |
|
| | ### RunPod Deployment (Recommended) |
| |
|
| | ```bash |
| | # 1. Build Docker image |
| | docker build -t marxist-grpo:latest -f docker/Dockerfile . |
| | |
| | # 2. Push to registry and deploy on RunPod |
| | # Use A40 (48GB, $0.35/hr) for best cost/performance |
| | |
| | # 3. Set environment variables on pod: |
| | # - HF_TOKEN |
| | # - WANDB_API_KEY |
| | # - HF_REPO (optional, for model upload) |
| | ``` |
| |
|
| | ### Local Development |
| |
|
| | ```bash |
| | # Install dependencies |
| | uv sync --group dev |
| | |
| | # Download spaCy model (required for rewards) |
| | python -m spacy download en_core_web_sm |
| | |
| | # Run tests |
| | uv run pytest -m "not slow and not gpu" |
| | ``` |
| |
|
| | ## Repository Structure |
| |
|
| | ``` |
| | prolewiki-llm/ |
| | βββ src/prolewiki_llm/ |
| | β βββ grpo_rewards.py # Multi-layer reward functions |
| | β βββ train_headless.py # Headless training script |
| | β βββ export_grpo_dataset.py # Dataset conversion |
| | β βββ wandb_logging.py # W&B integration |
| | βββ docker/ |
| | β βββ Dockerfile # Training container |
| | β βββ start.sh # Entrypoint with auto-shutoff |
| | β βββ .env.example # Environment reference |
| | βββ notebooks/ |
| | β βββ Marxist_GRPO_RunPod_Optimized.ipynb |
| | βββ tests/ |
| | β βββ unit/ # Unit tests |
| | β βββ integration/ # Shell script tests |
| | β βββ fixtures/ # Mock commands |
| | βββ training_data/ |
| | βββ grpo_dataset.jsonl # Training data |
| | ``` |
| |
|
| | ## Reward Functions |
| |
|
| | The reward system uses multiple layers to ensure quality responses: |
| |
|
| | | Layer | Function | Purpose | |
| | |-------|----------|---------| |
| | | 1 | `match_format_exactly` | Validate `<think>...</think>` tags | |
| | | 2 | `nli_coherence_reward` | Response entails ground truth (BART-MNLI) | |
| | | 3 | `self_consistency_reward` | No internal contradictions | |
| | | 4 | `structural_coherence_reward` | Terms in proper syntactic roles (spaCy) | |
| | | 5 | `topic_relevance_reward` | Answer addresses the question | |
| | | 6 | `interconnection_depth_reward` | Rewards analysis, penalizes buzzword salad | |
| |
|
| | Use `full_coherence_reward()` for the complete 6-layer check, or `robust_coherence_reward()` for a faster 3-layer version. |
| |
|
| | ## Training Configuration |
| |
|
| | Key environment variables for `train_headless.py`: |
| |
|
| | | Variable | Default | Description | |
| | |----------|---------|-------------| |
| | | `MODEL_NAME` | `unsloth/DeepSeek-R1-0528-Qwen3-8B` | Base model | |
| | | `MAX_STEPS` | `500` | Training steps | |
| | | `BATCH_SIZE` | `2` | Per-device batch size | |
| | | `LEARNING_RATE` | `5e-6` | Learning rate | |
| | | `REWARD_MODE` | `FULL` | `FULL`, `ROBUST`, or `LEGACY` | |
| | | `HF_REPO` | `prolewiki/marxist-grpo-lora` | Upload destination | |
| |
|
| | ## GPU Requirements |
| |
|
| | | GPU | VRAM | Price | Recommendation | |
| | |-----|------|-------|----------------| |
| | | **A40** | 48GB | $0.35/hr | Best value for 8B models | |
| | | A100 | 80GB | $1.19/hr | Overkill for this use case | |
| | | RTX 4090 | 24GB | $0.34/hr | Too small for 16-bit GRPO | |
| |
|
| | ## Critical Notes |
| |
|
| | 1. **torch.compile must be disabled** on RunPod/Jupyter (causes hangs) |
| | 2. **load_in_4bit=False** is required for GRPO (16-bit LoRA adapters) |
| | 3. **use_gradient_checkpointing=True** (not `"unsloth"`) for stability |
| |
|
| | ## Related Projects |
| |
|
| | - [ProleWiki](https://en.prolewiki.org/) - The Marxist-Leninist encyclopedia |
| | - [pw-mcp](https://github.com/prolewiki/pw-mcp) - MCP server for ProleWiki semantic search |
| |
|
| | ## License |
| |
|
| | AGPL-3.0-only |
| |
|