METANO Logo

METANO

Metal-Aware InChI to IUPAC Transformer with Neuro-Symbolic Oversight


Model Details

Model Description

Converting chemical identifiers such as InChI to IUPAC names is crucial in cheminformatics. While recent models excel with organic compounds, they struggle with inorganic and organometallic compounds because Standard InChI representations break metal-ligand bonds, leading to the loss of structural details.

METANO is a hybrid framework that combines the pattern-recognition strengths of transformer-based sequence-to-sequence models with chemistry-aware symbolic checks. It uses Reconnected InChI for metal-ligand connections and character-level tokenisation with custom control tokens. A symbolic oversight layer further enhances chemical name accuracy.

  • Author: Banula Perera
  • Supervisor: Mr Viraj Lakshitha Bandara
  • Institution: Informatics Institute of Technology, in collaboration with the University of Westminster
  • Degree: B.Eng. (Hons) Software Engineering (February 2026)
  • Model type: Encoder-Decoder Transformer (T5-based) with Neuro-Symbolic Search
  • Language(s) (NLP): English (IUPAC nomenclature), Chemical Representation (InChI)
  • License: MIT
  • Finetuned from model: t5-small

Model Sources


Uses

Direct Use

The primary use case is translating standard and reconnected InChI strings into human‑readable IUPAC names.

It is intended for:

  • Cheminformatics researchers
  • Computational chemists
  • Chemical database maintainers
  • AI-driven chemistry pipelines

The model is particularly useful for molecules containing:

  • Transition metals
  • Alkali metals
  • Lanthanides
  • Actinides

Out-of-Scope Use

The model is not intended for:

  • Generating molecular 3D structures
  • Predicting chemical properties
  • Reaction prediction
  • Translating formats other than InChI (e.g., SMILES) directly to IUPAC without conversion

Standard Hugging Face inference pipelines such as:

pipeline("text2text-generation")
AutoTokenizer.from_pretrained()

will not work directly due to the custom character‑level tokenizer and structural markers.


Bias, Risks, and Limitations

Although METANO integrates a neuro-symbolic scorer that enforces bracket balancing and basic chemical syntax constraints, it remains a statistical model.

Potential issues include:

  • Hallucinated nomenclature for unseen structures
  • Reduced accuracy for extremely large molecules
  • Errors for polymeric or highly unusual compounds

Training limits:

  • Maximum InChI length: 400 characters
  • Maximum IUPAC length: 150 characters

Recommendations

Users should employ the provided predict_neurosymbolic inference method, which applies:

  • BalancedBracketsLogitsProcessor
  • symbolic chemical scoring

to filter syntactically invalid outputs.

For critical applications such as patent filing or regulatory documentation, outputs should be verified by domain experts.


How to Get Started with the Model

⚠️ Important: Download the metano_inference.py script included in this repository.

Example usage:

from metano_inference import load_model_from_hf, predict_neurosymbolic, SymbolicScorer, ModelConfig

repo_id = "banulaperera/metano"
model = load_model_from_hf(repo_id)

config = ModelConfig()
scorer = SymbolicScorer(metals=config.metal_elements)

test_inchi = "InChI=1/C15H16N2O3S.Na/c1-10-3-4-12(9-11(10)2)15(18)17-21(19,20)14-7-5-13(16)6-8-14;/h3-9H,16H2,1-2H3,(H,17,18);/q;+1"

out = predict_neurosymbolic(
    model=model,
    inchi=test_inchi,
    scorer=scorer,
    num_candidates=5,
    repair_num_candidates=5,
    max_repair_rounds=1
)

print("=== TEST RESULTS ===")
print(f"Predicted IUPAC: {out['predicted_iupac']}")
print(f"Hard Fail Triggered: {out['hard_fail']}")
print(f"Combined Score: {out['combined_score']:.3f}")
print(f"Symbolic Score: {out['symbolic_score']:.3f}")
print(f"Neural Score: {out['neural_score']:.3f}")

if out['reasons']:
    print(f"Penalty Reasons: {out['reasons']}")
    
print("\nTop Candidates:")
for cand in out["candidates"][1:]:
    print(f"  [{cand['combined']:.3f}] {cand['text']}")

Training Details

Training Data

The model was trained on a large-scale dataset of InChI--IUPAC pairs covering diverse chemical classes.

Training subsets include:

  • ~294K inorganic combinations
  • ~123K organometallic compounds
  • ~82K coordination complexes

Both standard and reconnected (/r) InChI strings were included.


Training Procedure

Preprocessing

Preprocessing steps included:

  • Filtering InChI strings between 10--400 characters
  • Filtering IUPAC names between 2--150 characters
  • Unicode NFKC normalization
  • whitespace trimming
  • PIN‑safe normalization rules for η (eta) and κ (kappa) notation

InChI strings were prepended with control tokens:

<ORGANIC>
<ORGANOMETALLIC>
<INORGANIC>
<COORDINATION>

Structural metal markers such as:

<METAL_FE>
<METAL_CU>

were added using a custom CharacterLevelChemicalTokenizer.


Training Hyperparameters

  • Training regime: fp16 mixed precision (AMP)
  • Optimizer: AdamW
  • Learning Rate: 3e‑4 with 10% linear warmup and linear decay
  • Weight Decay: 0.01
  • Batch Size: 128 (effective via gradient accumulation = 2)
  • Max Input Length: 410 tokens
  • Max Output Length: 160 tokens
  • Gradient Clipping: 1.0

Evaluation

Results

Evaluation was conducted on a held‑out test split containing a balanced distribution of:

  • Inorganic Compounds: METANO achieves a Top-1 accuracy of 0.378, outperforming previously reported results of 0.14.
  • Organometallic Compounds: METANO achieves a Top-1 accuracy of 0.364, outperforming previously reported results of 0.20.
  • Co-ordination Compounds: METANO achieves a Top-1 accuracy of 0.394.
  • Top-K Decoding Additional gains are observed using Top-K decoding, reaching Top-5 accuracies of 0.481 (inorganic), 0.488 (organometallic) and 0.521 (Co-ordination).

Visualizations & Training Metrics

Training History & Loss Curves

Training History & Loss Curves

Training History

Top-K Decoding Performance Comparison

Top-K Performance

Category Performance

Technical Specifications

Model Architecture

METANO is based on the T5‑Small encoder‑decoder transformer.

Key modifications:

  • Replacement of SentencePiece tokenizer with CharacterLevelChemicalTokenizer
  • Vocabulary size resized to 187 tokens
  • Embedding layer reinitialized
  • Transformer layers retained from pretrained T5 weights

Training objective:

Cross‑entropy sequence‑to‑sequence generation, with neuro‑symbolic scoring applied during inference.


Compute Infrastructure

Software

  • Python
  • PyTorch
  • Hugging Face Transformers
  • RDKit
Downloads last month
145
Safetensors
Model size
44.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for banulaperera/metano

Base model

google-t5/t5-small
Finetuned
(2245)
this model