Upload folder using huggingface_hub

6926b69 verified about 1 month ago

6.55 kB

	---
	license: mit
	language:
	- en
	tags:
	- sat-solver
	- variable-selection
	- dpo
	- direct-preference-optimization
	- qwen3
	- combinatorial-optimization
	- neural-symbolic
	base_model: Qwen/Qwen3-4B
	datasets:
	- Yale-ROSE/SAT-CNF-Rewards
	pipeline_tag: text-classification
	---

	# Qwen3-4B-SAT-VarSelector-Sym-Aug-DPO

	A neural SAT variable selector fine-tuned with Direct Preference Optimization (DPO) on cube-score-aligned preferences. This model predicts which variable to branch on next when solving SAT problems, optimized for actual solving performance rather than imitation of expert choices.

	## Model Description

	This model is a `QwenVarClassifier` built on Qwen3-4B backbone, trained in two stages:

	1. Stage 1 - SFT: Supervised fine-tuning on expert variable selections from MiniZinc/CaDiCaL with symmetry augmentation
	2. Stage 2 - DPO: Direct Preference Optimization using cube-score-aligned preference pairs

	### Key Innovation: Cube-Score Alignment

	Traditional supervised learning imitates expert choices, but experts don't always choose optimally. DPO training aligns the model with cube scores - the actual reward signal indicating how much a variable choice helps the SAT solver make progress.

	## Performance

	\| Model \| Avg Reward (Cube Score) \| Reward Gap \| Top-1 Accuracy \|
	\|-------\|------------------------\|------------\|----------------\|
	\| DPO-7500 (this model) \| 1.489 \| 0.239 \| 18.51% \|
	\| SFT Baseline \| 1.368 \| 0.218 \| 25.06% \|
	\| GRPO-5x \| 1.418 \| 0.221 \| 16.51% \|

	Key Results:
	- +8.8% improvement in average cube score over SFT baseline
	- DPO trades accuracy (matching expert labels) for better actual solving performance
	- When DPO disagrees with expert, it often picks variables with higher cube scores

	### Why Accuracy Drops But Performance Improves

	Top-1 accuracy measures agreement with expert labels. But experts don't always choose the variable with the highest cube score. DPO learns to optimize for cube scores directly, so it may disagree with experts when better options exist - this is desirable behavior.

	## Architecture

	```python
	class QwenVarClassifier(nn.Module):
	"""
	Input: CNF formula in DIMACS format
	Output: Logits for each variable (1 to max_vars)
	"""
	def __init__(self, model_name="Qwen/Qwen3-4B", max_vars=600):
	self.qwen = AutoModelForCausalLM.from_pretrained(model_name)
	self.head = nn.Sequential(
	nn.LayerNorm(hidden_size),
	nn.Linear(hidden_size, max_vars + 1)
	)

	def forward(self, input_ids, attention_mask):
	# Get last token representation
	outputs = self.qwen(input_ids, attention_mask, output_hidden_states=True)
	last_hidden = outputs.hidden_states[-1] # (batch, seq, hidden)

	# Pool using last token
	seq_lengths = attention_mask.sum(dim=1) - 1
	last_token_hidden = last_hidden[range(batch_size), seq_lengths]

	# Classify
	logits = self.head(last_token_hidden)
	return logits
	```

	## Usage

	### Loading the Model

	```python
	import torch
	from transformers import AutoTokenizer

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug-DPO")

	# Load model
	checkpoint = torch.load("model.pt", map_location="cuda")

	# Initialize model architecture
	from sft_qwen_var_classifier import QwenVarClassifier
	model = QwenVarClassifier("Qwen/Qwen3-4B", max_vars=600)
	model.load_state_dict(checkpoint)
	model.eval()
	```

	### Making Predictions

	```python
	def predict_variable(cnf_text: str, model, tokenizer, device="cuda"):
	"""Predict the best variable to branch on."""
	# Tokenize
	inputs = tokenizer(cnf_text, return_tensors="pt", max_length=8192, truncation=True)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# Get valid variable mask from CNF
	valid_mask = cnf_valid_mask(cnf_text, max_vars=600)
	valid_mask = torch.tensor(valid_mask, device=device)

	# Predict
	with torch.no_grad():
	logits = model(**inputs)
	# Mask invalid variables
	logits = logits.masked_fill(~valid_mask.bool(), float('-inf'))
	pred_var = logits.argmax(dim=-1).item()

	return pred_var

	# Example CNF (DIMACS format)
	cnf = """p cnf 5 3
	1 -2 3 0
	-1 2 0
	2 -3 0
	"""

	pred = predict_variable(cnf, model, tokenizer)
	print(f"Predicted variable: {pred}")
	```

	### Input Format

	The model expects CNF formulas in DIMACS format:
	```
	p cnf <num_vars> <num_clauses>
	<literal1> <literal2> ... 0
	<literal1> <literal2> ... 0
	...
	```

	## Training Details

	### DPO Training Configuration

	```yaml
	# Training hyperparameters
	beta: 0.1 # DPO temperature
	learning_rate: 1e-6 # Lower LR for DPO
	epochs: 3
	batch_size: 1
	gradient_accumulation: 16
	max_length: 8192

	# Data
	train_pairs: 48,982 # Filtered (margin ≥ 0.3)
	valid_pairs: 5,393
	```

	### DPO Loss Function

	```python
	def dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1):
	"""DPO loss for classifier architecture."""
	policy_diff = policy_chosen - policy_rejected
	ref_diff = ref_chosen - ref_rejected
	logits = beta * (policy_diff - ref_diff)
	return -F.logsigmoid(logits).mean()
	```

	### Training Progress

	The model was trained for 9,000 steps. Checkpoint-7500 achieved the best average reward on validation.

	\| Checkpoint \| Avg Reward \| Reward Gap \|
	\|------------\|-----------\|------------\|
	\| 1000 \| 1.335 \| 0.192 \|
	\| 3500 \| 1.427 \| 0.226 \|
	\| 5000 \| 1.460 \| 0.235 \|
	\| 7500 \| 1.489 \| 0.239 \|
	\| 9000 \| 1.461 \| 0.234 \|

	## Training Data

	- SFT Data: `Yale-ROSE/SAT-CNF-Rewards` (sym_augmented split)
	- DPO Preferences: `Yale-ROSE/SAT-CNF-Rewards` (dpo/ folder)
	- Generated using cube scores from CaDiCaL solver
	- Filtered for margin ≥ 0.3

	## Related Models

	- SFT Base: [Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug](https://huggingface.co/Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug)
	- GRPO Variant: [Yale-ROSE/Qwen3-4B-SAT-VarSelector-GRPO-5x](https://huggingface.co/Yale-ROSE/Qwen3-4B-SAT-VarSelector-GRPO-5x)

	## Limitations

	- Optimized for SAT problems with up to 600 variables
	- Input length limited to 8192 tokens
	- Cube scores computed with 100ms solver timeout

	## Citation

	```bibtex
	@misc{sat-var-dpo-2026,
	title={Neural SAT Variable Selection with Direct Preference Optimization},
	author={Yale ROSE Lab},
	year={2026},
	howpublished={HuggingFace Models},
	url={https://huggingface.co/Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug-DPO}
	}
	```

	## License

	MIT License