FOL SLM — Natural Language → First-Order Logic

A small language model that translates plain-English premises and questions into First-Order Logic (FOL), then hands the FOL to a clingo ASP solver to derive a verifiable answer with a step-by-step proof.

Live demo: Venkatdatta/fol-slm-demo

Architecture

NL premises + question
        │
        ▼
T5-base Encoder (frozen, 768d, 12L)
        │   cross-attention
        ▼
Translation Decoder (4L, 512d)
        │
        ▼
FOL premises + FOL question
        │
        ▼
clingo ASP solver (symbolic reasoner)
        │
        ▼
Answer (True / False / Unknown) + proof chain

Encoder: t5-base encoder, all weights frozen. Encodes the NL input; provides rich contextual representations to the decoder via cross-attention.
Translation Decoder: 4-layer, 512-dimensional Transformer decoder trained to autoregressively generate the FOL translation. Uses <extra_id_0> to separate premises from the question in encoder input. Generation stops at <extra_id_3>.
Reasoner: clingo Answer Set Programming solver with provenance tracking. Takes the generated FOL and produces the answer + a human-readable proof chain. No neural component in the reasoning step.

Total trainable parameters: ~22M (decoder only; encoder is frozen).

Training

Dataset: Venkatdatta/fol-data — 229k ProofWriter examples preprocessed to add FOL annotations and vocabulary augmentation
Vocabulary augmentation: per-question random substitution from NLTK pools (7,372 entity names, 13,006 property words, 7,463 relation words) — forces structural generalisation over surface memorisation
Optimizer: AdamW, lr=1.9028e-3, weight_decay=3e-3
Schedule: linear decay with linear warmup (3,340 warmup steps = 16.7% of 20,000 total steps)
Batch: 16 examples × 4 gradient accumulation = effective batch 64
Precision: bfloat16 (loss in float32)
Weighted sampler: boosts rare classes (pos_False, neg_True) ~11× to match dominant class frequency

Results (translation decoder, step 20k)

Metric	Score
Premises FOL accuracy (fuzzy)	85.8%
Question FOL exact match	91.1% (60,621 / 66,556)

Evaluated on the full test split (66,556 examples), completely unseen vocabulary — confirming the model learns structural FOL mapping rather than surface-level memorisation.

Usage

Via the Gradio Space (recommended)

Go to Venkatdatta/fol-slm-demo and enter premises + question in the text boxes.

Programmatic usage

from huggingface_hub import hf_hub_download
from scripts.pipeline import load_model, run

# Download checkpoint
ckpt_path = hf_hub_download(repo_id="Venkatdatta/fol-slm", filename="checkpoint_final.pt")

# Load (once)
load_model(ckpt_path, config_path="configs/v12_translation.yaml")

# Run
result = run(
    nl_premises="Venkat is perseverant. Venkat is curious. Venkat is intuitive. "
                "If someone is curious and intuitive they explore. "
                "If someone is perseverant and explores they discover.",
    nl_question="Venkat discovers.",
)

print(result["answer"])        # "True"
print(result["fol_premises"])  # list of FOL premise strings
print(result["fol_question"])  # FOL question string
for step in result["proof"]:
    print(step)

Clone the source repository for scripts/pipeline.py, src/, and configs/.

Example

Premises:

Anne is kind. Bob is furry. If someone is kind then they are furry. If someone is furry then they are green.

Question:

Anne is green.

FOL Translation:

Kind(anne)
Furry(bob)
forall x (Kind(x) -> Furry(x))
forall x (Furry(x) -> Green(x))

Question: Green(anne)

Answer: ✅ True

Proof:

Kind(anne) and forall x (Kind(x) -> Furry(x)) -> therefore Furry(anne)
Furry(anne) and forall x (Furry(x) -> Green(x)) -> therefore Green(anne)

Limitations

Trained on ProofWriter-style synthetic reasoning (OWA splits). The FOL used is a Horn-clause-like subset — rules have a single consequent, no existential quantifiers, no disjunctions, and no nested quantifiers. This is not full FOL and the model does not generalise beyond this pattern set.
Degraded accuracy on large premise sets. Over 90% of training examples have QDep ≤ 2 (at most two chained rule applications), so the model has limited exposure to densely connected rule sets. When a large number of rules is present, translation accuracy drops — the model tends to confuse predicates across rules or swap arguments, producing structurally plausible but semantically incorrect FOL.
Proof correctness depends on the translation decoder producing well-formed FOL — if the FOL is malformed, clingo may return Unknown even for True/False cases.
Vocabulary substitution covers 7k+ names but the model can still struggle with highly unusual names or multi-word predicates.
pos_False and neg_True cases (non-negated→False, negated→True) are rare in the training data and likely harder for the model.

Citation

If you use this model, please cite:

@misc{fol-slm-2026,
  author = {Venkat Datta Bommena},
  title  = {FOL SLM: Natural Language to First-Order Logic with Symbolic Reasoning},
  year   = {2026},
  url    = {https://huggingface.co/Venkatdatta/fol-slm}
}

Please also cite the original data source:

@misc{mathurinache-proofwriter-kaggle,
  author = {mathurinache},
  title  = {ProofWriter},
  year   = {2021},
  url    = {https://www.kaggle.com/datasets/mathurinache/proofwriter},
  note   = {Kaggle dataset}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Venkatdatta
/

fol-slm