METANO

Metal-Aware InChI to IUPAC Transformer with Neuro-Symbolic Oversight

Model Details

Model Description

Converting chemical identifiers such as InChI to IUPAC names is crucial in cheminformatics. While recent models excel with organic compounds, they struggle with inorganic and organometallic compounds because Standard InChI representations break metal-ligand bonds, leading to the loss of structural details.

METANO is a hybrid framework that combines the pattern-recognition strengths of transformer-based sequence-to-sequence models with chemistry-aware symbolic checks. It uses Reconnected InChI for metal-ligand connections and character-level tokenisation with custom control tokens. A symbolic oversight layer further enhances chemical name accuracy.

Author: Banula Perera
Supervisor: Mr Viraj Lakshitha Bandara
Institution: Informatics Institute of Technology, in collaboration with the University of Westminster
Degree: B.Eng. (Hons) Software Engineering (February 2026)
Model type: Encoder-Decoder Transformer (T5-based) with Neuro-Symbolic Search
Language(s) (NLP): English (IUPAC nomenclature), Chemical Representation (InChI)
License: MIT
Finetuned from model: t5-small

Model Sources

Repository: https://huggingface.co/banulaperera/metano

Uses

Direct Use

The primary use case is translating standard and reconnected InChI strings into human‑readable IUPAC names.

It is intended for:

Cheminformatics researchers
Computational chemists
Chemical database maintainers
AI-driven chemistry pipelines

The model is particularly useful for molecules containing:

Transition metals
Alkali metals
Lanthanides
Actinides

Out-of-Scope Use

The model is not intended for:

Generating molecular 3D structures
Predicting chemical properties
Reaction prediction
Translating formats other than InChI (e.g., SMILES) directly to IUPAC without conversion

Standard Hugging Face inference pipelines such as:

pipeline("text2text-generation")
AutoTokenizer.from_pretrained()

will not work directly due to the custom character‑level tokenizer and structural markers.

Bias, Risks, and Limitations

Although METANO integrates a neuro-symbolic scorer that enforces bracket balancing and basic chemical syntax constraints, it remains a statistical model.

Potential issues include:

Hallucinated nomenclature for unseen structures
Reduced accuracy for extremely large molecules
Errors for polymeric or highly unusual compounds

Training limits:

Maximum InChI length: 400 characters
Maximum IUPAC length: 150 characters

Recommendations

Users should employ the provided predict_neurosymbolic inference method, which applies:

BalancedBracketsLogitsProcessor
symbolic chemical scoring

to filter syntactically invalid outputs.

For critical applications such as patent filing or regulatory documentation, outputs should be verified by domain experts.

How to Get Started with the Model

⚠️ Important: Download the metano_inference.py script included in this repository.

Example usage:

from metano_inference import load_model_from_hf, predict_neurosymbolic, SymbolicScorer, ModelConfig

repo_id = "banulaperera/metano"
model = load_model_from_hf(repo_id)

config = ModelConfig()
scorer = SymbolicScorer(metals=config.metal_elements)

test_inchi = "InChI=1/C15H16N2O3S.Na/c1-10-3-4-12(9-11(10)2)15(18)17-21(19,20)14-7-5-13(16)6-8-14;/h3-9H,16H2,1-2H3,(H,17,18);/q;+1"

out = predict_neurosymbolic(
    model=model,
    inchi=test_inchi,
    scorer=scorer,
    num_candidates=5,
    repair_num_candidates=5,
    max_repair_rounds=1
)

print("=== TEST RESULTS ===")
print(f"Predicted IUPAC: {out['predicted_iupac']}")
print(f"Hard Fail Triggered: {out['hard_fail']}")
print(f"Combined Score: {out['combined_score']:.3f}")
print(f"Symbolic Score: {out['symbolic_score']:.3f}")
print(f"Neural Score: {out['neural_score']:.3f}")

if out['reasons']:
    print(f"Penalty Reasons: {out['reasons']}")
    
print("\nTop Candidates:")
for cand in out["candidates"][1:]:
    print(f"  [{cand['combined']:.3f}] {cand['text']}")

Training Details

Training Data

The model was trained on a large-scale dataset of InChI--IUPAC pairs covering diverse chemical classes.

Training subsets include:

~294K inorganic combinations
~123K organometallic compounds
~82K coordination complexes

Both standard and reconnected (/r) InChI strings were included.

Training Procedure

Preprocessing

Preprocessing steps included:

Filtering InChI strings between 10--400 characters
Filtering IUPAC names between 2--150 characters
Unicode NFKC normalization
whitespace trimming
PIN‑safe normalization rules for η (eta) and κ (kappa) notation

InChI strings were prepended with control tokens:

<ORGANIC>
<ORGANOMETALLIC>
<INORGANIC>
<COORDINATION>

Structural metal markers such as:

<METAL_FE>
<METAL_CU>

were added using a custom CharacterLevelChemicalTokenizer.

Training Hyperparameters

Training regime: fp16 mixed precision (AMP)
Optimizer: AdamW
Learning Rate: 3e‑4 with 10% linear warmup and linear decay
Weight Decay: 0.01
Batch Size: 128 (effective via gradient accumulation = 2)
Max Input Length: 410 tokens
Max Output Length: 160 tokens
Gradient Clipping: 1.0

Evaluation

Results

Evaluation was conducted on a held‑out test split containing a balanced distribution of:

Inorganic Compounds: METANO achieves a Top-1 accuracy of 0.378, outperforming previously reported results of 0.14.
Organometallic Compounds: METANO achieves a Top-1 accuracy of 0.364, outperforming previously reported results of 0.20.
Co-ordination Compounds: METANO achieves a Top-1 accuracy of 0.394.
Top-K Decoding Additional gains are observed using Top-K decoding, reaching Top-5 accuracies of 0.481 (inorganic), 0.488 (organometallic) and 0.521 (Co-ordination).

Visualizations & Training Metrics

Training History & Loss Curves

Training History & Loss Curves

Top-K Decoding Performance Comparison

Top-K Performance

Technical Specifications

Model Architecture

METANO is based on the T5‑Small encoder‑decoder transformer.

Key modifications:

Replacement of SentencePiece tokenizer with CharacterLevelChemicalTokenizer
Vocabulary size resized to 187 tokens
Embedding layer reinitialized
Transformer layers retained from pretrained T5 weights

Training objective:

Cross‑entropy sequence‑to‑sequence generation, with neuro‑symbolic scoring applied during inference.

Compute Infrastructure

Software

Python
PyTorch
Hugging Face Transformers
RDKit

Downloads last month: 10

Safetensors

Model size

44.2M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for banulaperera/metano

Base model

google-t5/t5-small

Finetuned

(2304)

this model