File size: 4,354 Bytes
1bc8867
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81b3473
 
1bc8867
81b3473
 
 
1bc8867
81b3473
1bc8867
 
 
 
81b3473
1bc8867
81b3473
1bc8867
81b3473
 
1bc8867
 
81b3473
1bc8867
 
81b3473
1bc8867
 
 
 
81b3473
 
1bc8867
81b3473
1bc8867
 
 
81b3473
1bc8867
 
81b3473
1bc8867
 
81b3473
 
1bc8867
81b3473
 
 
 
1bc8867
 
 
 
 
 
 
 
81b3473
1bc8867
81b3473
1bc8867
 
 
 
 
81b3473
 
1bc8867
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81b3473
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
  - en
license: agpl-3.0
library_name: transformers
tags:
  - grpo
  - rlhf
  - fine-tuning
  - marxism
  - political-theory
  - lora
  - deepseek
  - qwen
datasets:
  - prolewiki/qa-corpus
base_model: unsloth/DeepSeek-R1-0528-Qwen3-8B
pipeline_tag: text-generation
---

# prolewiki-llm

GRPO fine-tuning infrastructure for training Marxist-Leninist language models.

## Overview

This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes:

- **Multi-Layer Reward System**: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring)
- **Headless Training**: Docker container for automated RunPod deployment with auto-shutoff
- **Jupyter Notebook**: Production-ready notebook optimized for A40/A100 GPUs
- **Comprehensive Tests**: Unit and integration tests for all components

## Quick Start

### RunPod Deployment (Recommended)

```bash
# 1. Build Docker image
docker build -t marxist-grpo:latest -f docker/Dockerfile .

# 2. Push to registry and deploy on RunPod
# Use A40 (48GB, $0.35/hr) for best cost/performance

# 3. Set environment variables on pod:
#    - HF_TOKEN
#    - WANDB_API_KEY
#    - HF_REPO (optional, for model upload)
```

### Local Development

```bash
# Install dependencies
uv sync --group dev

# Download spaCy model (required for rewards)
python -m spacy download en_core_web_sm

# Run tests
uv run pytest -m "not slow and not gpu"
```

## Repository Structure

```
prolewiki-llm/
β”œβ”€β”€ src/prolewiki_llm/
β”‚   β”œβ”€β”€ grpo_rewards.py       # Multi-layer reward functions
β”‚   β”œβ”€β”€ train_headless.py     # Headless training script
β”‚   β”œβ”€β”€ export_grpo_dataset.py # Dataset conversion
β”‚   └── wandb_logging.py      # W&B integration
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ Dockerfile            # Training container
β”‚   β”œβ”€β”€ start.sh              # Entrypoint with auto-shutoff
β”‚   └── .env.example          # Environment reference
β”œβ”€β”€ notebooks/
β”‚   └── Marxist_GRPO_RunPod_Optimized.ipynb
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/                 # Unit tests
β”‚   β”œβ”€β”€ integration/          # Shell script tests
β”‚   └── fixtures/             # Mock commands
└── training_data/
    └── grpo_dataset.jsonl    # Training data
```

## Reward Functions

The reward system uses multiple layers to ensure quality responses:

| Layer | Function | Purpose |
|-------|----------|---------|
| 1 | `match_format_exactly` | Validate `<think>...</think>` tags |
| 2 | `nli_coherence_reward` | Response entails ground truth (BART-MNLI) |
| 3 | `self_consistency_reward` | No internal contradictions |
| 4 | `structural_coherence_reward` | Terms in proper syntactic roles (spaCy) |
| 5 | `topic_relevance_reward` | Answer addresses the question |
| 6 | `interconnection_depth_reward` | Rewards analysis, penalizes buzzword salad |

Use `full_coherence_reward()` for the complete 6-layer check, or `robust_coherence_reward()` for a faster 3-layer version.

## Training Configuration

Key environment variables for `train_headless.py`:

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `unsloth/DeepSeek-R1-0528-Qwen3-8B` | Base model |
| `MAX_STEPS` | `500` | Training steps |
| `BATCH_SIZE` | `2` | Per-device batch size |
| `LEARNING_RATE` | `5e-6` | Learning rate |
| `REWARD_MODE` | `FULL` | `FULL`, `ROBUST`, or `LEGACY` |
| `HF_REPO` | `prolewiki/marxist-grpo-lora` | Upload destination |

## GPU Requirements

| GPU | VRAM | Price | Recommendation |
|-----|------|-------|----------------|
| **A40** | 48GB | $0.35/hr | Best value for 8B models |
| A100 | 80GB | $1.19/hr | Overkill for this use case |
| RTX 4090 | 24GB | $0.34/hr | Too small for 16-bit GRPO |

## Critical Notes

1. **torch.compile must be disabled** on RunPod/Jupyter (causes hangs)
2. **load_in_4bit=False** is required for GRPO (16-bit LoRA adapters)
3. **use_gradient_checkpointing=True** (not `"unsloth"`) for stability

## Related Projects

- [ProleWiki](https://en.prolewiki.org/) - The Marxist-Leninist encyclopedia
- [pw-mcp](https://github.com/prolewiki/pw-mcp) - MCP server for ProleWiki semantic search

## License

AGPL-3.0-only