| | --- |
| | license: mit |
| | language: |
| | - en |
| | tags: |
| | - sat-solver |
| | - variable-selection |
| | - dpo |
| | - direct-preference-optimization |
| | - qwen3 |
| | - combinatorial-optimization |
| | - neural-symbolic |
| | base_model: Qwen/Qwen3-4B |
| | datasets: |
| | - Yale-ROSE/SAT-CNF-Rewards |
| | pipeline_tag: text-classification |
| | --- |
| | |
| | # Qwen3-4B-SAT-VarSelector-Sym-Aug-DPO |
| |
|
| | A neural SAT variable selector fine-tuned with **Direct Preference Optimization (DPO)** on cube-score-aligned preferences. This model predicts which variable to branch on next when solving SAT problems, optimized for **actual solving performance** rather than imitation of expert choices. |
| |
|
| | ## Model Description |
| |
|
| | This model is a `QwenVarClassifier` built on Qwen3-4B backbone, trained in two stages: |
| |
|
| | 1. **Stage 1 - SFT**: Supervised fine-tuning on expert variable selections from MiniZinc/CaDiCaL with symmetry augmentation |
| | 2. **Stage 2 - DPO**: Direct Preference Optimization using cube-score-aligned preference pairs |
| |
|
| | ### Key Innovation: Cube-Score Alignment |
| |
|
| | Traditional supervised learning imitates expert choices, but experts don't always choose optimally. DPO training aligns the model with **cube scores** - the actual reward signal indicating how much a variable choice helps the SAT solver make progress. |
| |
|
| | ## Performance |
| |
|
| | | Model | Avg Reward (Cube Score) | Reward Gap | Top-1 Accuracy | |
| | |-------|------------------------|------------|----------------| |
| | | **DPO-7500** (this model) | **1.489** | **0.239** | 18.51% | |
| | | SFT Baseline | 1.368 | 0.218 | 25.06% | |
| | | GRPO-5x | 1.418 | 0.221 | 16.51% | |
| |
|
| | **Key Results:** |
| | - **+8.8% improvement** in average cube score over SFT baseline |
| | - DPO trades accuracy (matching expert labels) for better actual solving performance |
| | - When DPO disagrees with expert, it often picks variables with higher cube scores |
| |
|
| | ### Why Accuracy Drops But Performance Improves |
| |
|
| | Top-1 accuracy measures agreement with expert labels. But experts don't always choose the variable with the highest cube score. DPO learns to optimize for cube scores directly, so it may disagree with experts when better options exist - this is **desirable behavior**. |
| |
|
| | ## Architecture |
| |
|
| | ```python |
| | class QwenVarClassifier(nn.Module): |
| | """ |
| | Input: CNF formula in DIMACS format |
| | Output: Logits for each variable (1 to max_vars) |
| | """ |
| | def __init__(self, model_name="Qwen/Qwen3-4B", max_vars=600): |
| | self.qwen = AutoModelForCausalLM.from_pretrained(model_name) |
| | self.head = nn.Sequential( |
| | nn.LayerNorm(hidden_size), |
| | nn.Linear(hidden_size, max_vars + 1) |
| | ) |
| | |
| | def forward(self, input_ids, attention_mask): |
| | # Get last token representation |
| | outputs = self.qwen(input_ids, attention_mask, output_hidden_states=True) |
| | last_hidden = outputs.hidden_states[-1] # (batch, seq, hidden) |
| | |
| | # Pool using last token |
| | seq_lengths = attention_mask.sum(dim=1) - 1 |
| | last_token_hidden = last_hidden[range(batch_size), seq_lengths] |
| | |
| | # Classify |
| | logits = self.head(last_token_hidden) |
| | return logits |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ### Loading the Model |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer |
| | |
| | # Load tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug-DPO") |
| | |
| | # Load model |
| | checkpoint = torch.load("model.pt", map_location="cuda") |
| | |
| | # Initialize model architecture |
| | from sft_qwen_var_classifier import QwenVarClassifier |
| | model = QwenVarClassifier("Qwen/Qwen3-4B", max_vars=600) |
| | model.load_state_dict(checkpoint) |
| | model.eval() |
| | ``` |
| |
|
| | ### Making Predictions |
| |
|
| | ```python |
| | def predict_variable(cnf_text: str, model, tokenizer, device="cuda"): |
| | """Predict the best variable to branch on.""" |
| | # Tokenize |
| | inputs = tokenizer(cnf_text, return_tensors="pt", max_length=8192, truncation=True) |
| | inputs = {k: v.to(device) for k, v in inputs.items()} |
| | |
| | # Get valid variable mask from CNF |
| | valid_mask = cnf_valid_mask(cnf_text, max_vars=600) |
| | valid_mask = torch.tensor(valid_mask, device=device) |
| | |
| | # Predict |
| | with torch.no_grad(): |
| | logits = model(**inputs) |
| | # Mask invalid variables |
| | logits = logits.masked_fill(~valid_mask.bool(), float('-inf')) |
| | pred_var = logits.argmax(dim=-1).item() |
| | |
| | return pred_var |
| | |
| | # Example CNF (DIMACS format) |
| | cnf = """p cnf 5 3 |
| | 1 -2 3 0 |
| | -1 2 0 |
| | 2 -3 0 |
| | """ |
| | |
| | pred = predict_variable(cnf, model, tokenizer) |
| | print(f"Predicted variable: {pred}") |
| | ``` |
| |
|
| | ### Input Format |
| |
|
| | The model expects CNF formulas in DIMACS format: |
| | ``` |
| | p cnf <num_vars> <num_clauses> |
| | <literal1> <literal2> ... 0 |
| | <literal1> <literal2> ... 0 |
| | ... |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### DPO Training Configuration |
| |
|
| | ```yaml |
| | # Training hyperparameters |
| | beta: 0.1 # DPO temperature |
| | learning_rate: 1e-6 # Lower LR for DPO |
| | epochs: 3 |
| | batch_size: 1 |
| | gradient_accumulation: 16 |
| | max_length: 8192 |
| | |
| | # Data |
| | train_pairs: 48,982 # Filtered (margin ≥ 0.3) |
| | valid_pairs: 5,393 |
| | ``` |
| |
|
| | ### DPO Loss Function |
| |
|
| | ```python |
| | def dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1): |
| | """DPO loss for classifier architecture.""" |
| | policy_diff = policy_chosen - policy_rejected |
| | ref_diff = ref_chosen - ref_rejected |
| | logits = beta * (policy_diff - ref_diff) |
| | return -F.logsigmoid(logits).mean() |
| | ``` |
| |
|
| | ### Training Progress |
| |
|
| | The model was trained for 9,000 steps. Checkpoint-7500 achieved the best average reward on validation. |
| |
|
| | | Checkpoint | Avg Reward | Reward Gap | |
| | |------------|-----------|------------| |
| | | 1000 | 1.335 | 0.192 | |
| | | 3500 | 1.427 | 0.226 | |
| | | 5000 | 1.460 | 0.235 | |
| | | **7500** | **1.489** | **0.239** | |
| | | 9000 | 1.461 | 0.234 | |
| |
|
| | ## Training Data |
| |
|
| | - **SFT Data**: `Yale-ROSE/SAT-CNF-Rewards` (sym_augmented split) |
| | - **DPO Preferences**: `Yale-ROSE/SAT-CNF-Rewards` (dpo/ folder) |
| | - Generated using cube scores from CaDiCaL solver |
| | - Filtered for margin ≥ 0.3 |
| | |
| | ## Related Models |
| | |
| | - **SFT Base**: [Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug](https://huggingface.co/Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug) |
| | - **GRPO Variant**: [Yale-ROSE/Qwen3-4B-SAT-VarSelector-GRPO-5x](https://huggingface.co/Yale-ROSE/Qwen3-4B-SAT-VarSelector-GRPO-5x) |
| | |
| | ## Limitations |
| | |
| | - Optimized for SAT problems with up to 600 variables |
| | - Input length limited to 8192 tokens |
| | - Cube scores computed with 100ms solver timeout |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @misc{sat-var-dpo-2026, |
| | title={Neural SAT Variable Selection with Direct Preference Optimization}, |
| | author={Yale ROSE Lab}, |
| | year={2026}, |
| | howpublished={HuggingFace Models}, |
| | url={https://huggingface.co/Yale-ROSE/Qwen3-4B-SAT-VarSelector-Sym-Aug-DPO} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | MIT License |
| | |