FOL SLM — Natural Language → First-Order Logic
A small language model that translates plain-English premises and questions into First-Order Logic (FOL), then hands the FOL to a clingo ASP solver to derive a verifiable answer with a step-by-step proof.
Live demo: Venkatdatta/fol-slm-demo
Architecture
NL premises + question
│
â–¼
T5-base Encoder (frozen, 768d, 12L)
│ cross-attention
â–¼
Translation Decoder (4L, 512d)
│
â–¼
FOL premises + FOL question
│
â–¼
clingo ASP solver (symbolic reasoner)
│
â–¼
Answer (True / False / Unknown) + proof chain
- Encoder:
t5-baseencoder, all weights frozen. Encodes the NL input; provides rich contextual representations to the decoder via cross-attention. - Translation Decoder: 4-layer, 512-dimensional Transformer decoder trained to autoregressively generate the FOL translation. Uses
<extra_id_0>to separate premises from the question in encoder input. Generation stops at<extra_id_3>. - Reasoner: clingo Answer Set Programming solver with provenance tracking. Takes the generated FOL and produces the answer + a human-readable proof chain. No neural component in the reasoning step.
Total trainable parameters: ~22M (decoder only; encoder is frozen).
Training
- Dataset: Venkatdatta/fol-data — 229k ProofWriter examples preprocessed to add FOL annotations and vocabulary augmentation
- Vocabulary augmentation: per-question random substitution from NLTK pools (7,372 entity names, 13,006 property words, 7,463 relation words) — forces structural generalisation over surface memorisation
- Optimizer: AdamW, lr=1.9028e-3, weight_decay=3e-3
- Schedule: linear decay with linear warmup (3,340 warmup steps = 16.7% of 20,000 total steps)
- Batch: 16 examples × 4 gradient accumulation = effective batch 64
- Precision: bfloat16 (loss in float32)
- Weighted sampler: boosts rare classes (
pos_False,neg_True) ~11× to match dominant class frequency
Results (translation decoder, step 20k)
| Metric | Score |
|---|---|
| Premises FOL accuracy (fuzzy) | 85.8% |
| Question FOL exact match | 91.1% (60,621 / 66,556) |
Evaluated on the full test split (66,556 examples), completely unseen vocabulary — confirming the model learns structural FOL mapping rather than surface-level memorisation.
Usage
Via the Gradio Space (recommended)
Go to Venkatdatta/fol-slm-demo and enter premises + question in the text boxes.
Programmatic usage
from huggingface_hub import hf_hub_download
from scripts.pipeline import load_model, run
# Download checkpoint
ckpt_path = hf_hub_download(repo_id="Venkatdatta/fol-slm", filename="checkpoint_final.pt")
# Load (once)
load_model(ckpt_path, config_path="configs/v12_translation.yaml")
# Run
result = run(
nl_premises="Venkat is perseverant. Venkat is curious. Venkat is intuitive. "
"If someone is curious and intuitive they explore. "
"If someone is perseverant and explores they discover.",
nl_question="Venkat discovers.",
)
print(result["answer"]) # "True"
print(result["fol_premises"]) # list of FOL premise strings
print(result["fol_question"]) # FOL question string
for step in result["proof"]:
print(step)
Clone the source repository for scripts/pipeline.py, src/, and configs/.
Example
Premises:
Anne is kind. Bob is furry. If someone is kind then they are furry. If someone is furry then they are green.
Question:
Anne is green.
FOL Translation:
Kind(anne)
Furry(bob)
forall x (Kind(x) -> Furry(x))
forall x (Furry(x) -> Green(x))
Question: Green(anne)
Answer: ✅ True
Proof:
Kind(anne) and forall x (Kind(x) -> Furry(x)) -> therefore Furry(anne)
Furry(anne) and forall x (Furry(x) -> Green(x)) -> therefore Green(anne)
Limitations
- Trained on ProofWriter-style synthetic reasoning (OWA splits). The FOL used is a Horn-clause-like subset — rules have a single consequent, no existential quantifiers, no disjunctions, and no nested quantifiers. This is not full FOL and the model does not generalise beyond this pattern set.
- Degraded accuracy on large premise sets. Over 90% of training examples have QDep ≤ 2 (at most two chained rule applications), so the model has limited exposure to densely connected rule sets. When a large number of rules is present, translation accuracy drops — the model tends to confuse predicates across rules or swap arguments, producing structurally plausible but semantically incorrect FOL.
- Proof correctness depends on the translation decoder producing well-formed FOL — if the FOL is malformed, clingo may return Unknown even for True/False cases.
- Vocabulary substitution covers 7k+ names but the model can still struggle with highly unusual names or multi-word predicates.
pos_Falseandneg_Truecases (non-negated→False, negated→True) are rare in the training data and likely harder for the model.
Citation
If you use this model, please cite:
@misc{fol-slm-2026,
author = {Venkat Datta Bommena},
title = {FOL SLM: Natural Language to First-Order Logic with Symbolic Reasoning},
year = {2026},
url = {https://huggingface.co/Venkatdatta/fol-slm}
}
Please also cite the original data source:
@misc{mathurinache-proofwriter-kaggle,
author = {mathurinache},
title = {ProofWriter},
year = {2021},
url = {https://www.kaggle.com/datasets/mathurinache/proofwriter},
note = {Kaggle dataset}
}