FOL SLM — Natural Language → First-Order Logic

A small language model that translates plain-English premises and questions into First-Order Logic (FOL), then hands the FOL to a clingo ASP solver to derive a verifiable answer with a step-by-step proof.

Live demo: Venkatdatta/fol-slm-demo


Architecture

NL premises + question
        │
        â–¼
T5-base Encoder (frozen, 768d, 12L)
        │   cross-attention
        â–¼
Translation Decoder (4L, 512d)
        │
        â–¼
FOL premises + FOL question
        │
        â–¼
clingo ASP solver (symbolic reasoner)
        │
        â–¼
Answer (True / False / Unknown) + proof chain
  • Encoder: t5-base encoder, all weights frozen. Encodes the NL input; provides rich contextual representations to the decoder via cross-attention.
  • Translation Decoder: 4-layer, 512-dimensional Transformer decoder trained to autoregressively generate the FOL translation. Uses <extra_id_0> to separate premises from the question in encoder input. Generation stops at <extra_id_3>.
  • Reasoner: clingo Answer Set Programming solver with provenance tracking. Takes the generated FOL and produces the answer + a human-readable proof chain. No neural component in the reasoning step.

Total trainable parameters: ~22M (decoder only; encoder is frozen).


Training

  • Dataset: Venkatdatta/fol-data — 229k ProofWriter examples preprocessed to add FOL annotations and vocabulary augmentation
  • Vocabulary augmentation: per-question random substitution from NLTK pools (7,372 entity names, 13,006 property words, 7,463 relation words) — forces structural generalisation over surface memorisation
  • Optimizer: AdamW, lr=1.9028e-3, weight_decay=3e-3
  • Schedule: linear decay with linear warmup (3,340 warmup steps = 16.7% of 20,000 total steps)
  • Batch: 16 examples × 4 gradient accumulation = effective batch 64
  • Precision: bfloat16 (loss in float32)
  • Weighted sampler: boosts rare classes (pos_False, neg_True) ~11× to match dominant class frequency

Results (translation decoder, step 20k)

Metric Score
Premises FOL accuracy (fuzzy) 85.8%
Question FOL exact match 91.1% (60,621 / 66,556)

Evaluated on the full test split (66,556 examples), completely unseen vocabulary — confirming the model learns structural FOL mapping rather than surface-level memorisation.


Usage

Via the Gradio Space (recommended)

Go to Venkatdatta/fol-slm-demo and enter premises + question in the text boxes.

Programmatic usage

from huggingface_hub import hf_hub_download
from scripts.pipeline import load_model, run

# Download checkpoint
ckpt_path = hf_hub_download(repo_id="Venkatdatta/fol-slm", filename="checkpoint_final.pt")

# Load (once)
load_model(ckpt_path, config_path="configs/v12_translation.yaml")

# Run
result = run(
    nl_premises="Venkat is perseverant. Venkat is curious. Venkat is intuitive. "
                "If someone is curious and intuitive they explore. "
                "If someone is perseverant and explores they discover.",
    nl_question="Venkat discovers.",
)

print(result["answer"])        # "True"
print(result["fol_premises"])  # list of FOL premise strings
print(result["fol_question"])  # FOL question string
for step in result["proof"]:
    print(step)

Clone the source repository for scripts/pipeline.py, src/, and configs/.


Example

Premises:

Anne is kind. Bob is furry. If someone is kind then they are furry. If someone is furry then they are green.

Question:

Anne is green.

FOL Translation:

Kind(anne)
Furry(bob)
forall x (Kind(x) -> Furry(x))
forall x (Furry(x) -> Green(x))

Question: Green(anne)

Answer: ✅ True

Proof:

Kind(anne) and forall x (Kind(x) -> Furry(x)) -> therefore Furry(anne)
Furry(anne) and forall x (Furry(x) -> Green(x)) -> therefore Green(anne)

Limitations

  • Trained on ProofWriter-style synthetic reasoning (OWA splits). The FOL used is a Horn-clause-like subset — rules have a single consequent, no existential quantifiers, no disjunctions, and no nested quantifiers. This is not full FOL and the model does not generalise beyond this pattern set.
  • Degraded accuracy on large premise sets. Over 90% of training examples have QDep ≤ 2 (at most two chained rule applications), so the model has limited exposure to densely connected rule sets. When a large number of rules is present, translation accuracy drops — the model tends to confuse predicates across rules or swap arguments, producing structurally plausible but semantically incorrect FOL.
  • Proof correctness depends on the translation decoder producing well-formed FOL — if the FOL is malformed, clingo may return Unknown even for True/False cases.
  • Vocabulary substitution covers 7k+ names but the model can still struggle with highly unusual names or multi-word predicates.
  • pos_False and neg_True cases (non-negated→False, negated→True) are rare in the training data and likely harder for the model.

Citation

If you use this model, please cite:

@misc{fol-slm-2026,
  author = {Venkat Datta Bommena},
  title  = {FOL SLM: Natural Language to First-Order Logic with Symbolic Reasoning},
  year   = {2026},
  url    = {https://huggingface.co/Venkatdatta/fol-slm}
}

Please also cite the original data source:

@misc{mathurinache-proofwriter-kaggle,
  author = {mathurinache},
  title  = {ProofWriter},
  year   = {2021},
  url    = {https://www.kaggle.com/datasets/mathurinache/proofwriter},
  note   = {Kaggle dataset}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Venkatdatta/fol-slm

Space using Venkatdatta/fol-slm 1