METANO
Metal-Aware InChI to IUPAC Transformer with Neuro-Symbolic Oversight
Model Details
Model Description
Converting chemical identifiers such as InChI to IUPAC names is crucial in cheminformatics. While recent models excel with organic compounds, they struggle with inorganic and organometallic compounds because Standard InChI representations break metal-ligand bonds, leading to the loss of structural details.
METANO is a hybrid framework that combines the pattern-recognition strengths of transformer-based sequence-to-sequence models with chemistry-aware symbolic checks. It uses Reconnected InChI for metal-ligand connections and character-level tokenisation with custom control tokens. A symbolic oversight layer further enhances chemical name accuracy.
- Author: Banula Perera
- Supervisor: Mr Viraj Lakshitha Bandara
- Institution: Informatics Institute of Technology, in collaboration with the University of Westminster
- Degree: B.Eng. (Hons) Software Engineering (February 2026)
- Model type: Encoder-Decoder Transformer (T5-based) with Neuro-Symbolic Search
- Language(s) (NLP): English (IUPAC nomenclature), Chemical Representation (InChI)
- License: MIT
- Finetuned from model:
t5-small
Model Sources
- Repository: https://huggingface.co/banulaperera/metano
Uses
Direct Use
The primary use case is translating standard and reconnected InChI strings into human‑readable IUPAC names.
It is intended for:
- Cheminformatics researchers
- Computational chemists
- Chemical database maintainers
- AI-driven chemistry pipelines
The model is particularly useful for molecules containing:
- Transition metals
- Alkali metals
- Lanthanides
- Actinides
Out-of-Scope Use
The model is not intended for:
- Generating molecular 3D structures
- Predicting chemical properties
- Reaction prediction
- Translating formats other than InChI (e.g., SMILES) directly to IUPAC without conversion
Standard Hugging Face inference pipelines such as:
pipeline("text2text-generation")
AutoTokenizer.from_pretrained()
will not work directly due to the custom character‑level tokenizer and structural markers.
Bias, Risks, and Limitations
Although METANO integrates a neuro-symbolic scorer that enforces bracket balancing and basic chemical syntax constraints, it remains a statistical model.
Potential issues include:
- Hallucinated nomenclature for unseen structures
- Reduced accuracy for extremely large molecules
- Errors for polymeric or highly unusual compounds
Training limits:
- Maximum InChI length: 400 characters
- Maximum IUPAC length: 150 characters
Recommendations
Users should employ the provided predict_neurosymbolic inference
method, which applies:
BalancedBracketsLogitsProcessor- symbolic chemical scoring
to filter syntactically invalid outputs.
For critical applications such as patent filing or regulatory documentation, outputs should be verified by domain experts.
How to Get Started with the Model
⚠️ Important: Download the metano_inference.py script included in
this repository.
Example usage:
from metano_inference import load_model_from_hf, predict_neurosymbolic, SymbolicScorer, ModelConfig
repo_id = "banulaperera/metano"
model = load_model_from_hf(repo_id)
config = ModelConfig()
scorer = SymbolicScorer(metals=config.metal_elements)
test_inchi = "InChI=1/C15H16N2O3S.Na/c1-10-3-4-12(9-11(10)2)15(18)17-21(19,20)14-7-5-13(16)6-8-14;/h3-9H,16H2,1-2H3,(H,17,18);/q;+1"
out = predict_neurosymbolic(
model=model,
inchi=test_inchi,
scorer=scorer,
num_candidates=5,
repair_num_candidates=5,
max_repair_rounds=1
)
print("=== TEST RESULTS ===")
print(f"Predicted IUPAC: {out['predicted_iupac']}")
print(f"Hard Fail Triggered: {out['hard_fail']}")
print(f"Combined Score: {out['combined_score']:.3f}")
print(f"Symbolic Score: {out['symbolic_score']:.3f}")
print(f"Neural Score: {out['neural_score']:.3f}")
if out['reasons']:
print(f"Penalty Reasons: {out['reasons']}")
print("\nTop Candidates:")
for cand in out["candidates"][1:]:
print(f" [{cand['combined']:.3f}] {cand['text']}")
Training Details
Training Data
The model was trained on a large-scale dataset of InChI--IUPAC pairs covering diverse chemical classes.
Training subsets include:
- ~294K inorganic combinations
- ~123K organometallic compounds
- ~82K coordination complexes
Both standard and reconnected (/r) InChI strings were included.
Training Procedure
Preprocessing
Preprocessing steps included:
- Filtering InChI strings between 10--400 characters
- Filtering IUPAC names between 2--150 characters
- Unicode NFKC normalization
- whitespace trimming
- PIN‑safe normalization rules for η (eta) and κ (kappa) notation
InChI strings were prepended with control tokens:
<ORGANIC>
<ORGANOMETALLIC>
<INORGANIC>
<COORDINATION>
Structural metal markers such as:
<METAL_FE>
<METAL_CU>
were added using a custom CharacterLevelChemicalTokenizer.
Training Hyperparameters
- Training regime: fp16 mixed precision (AMP)
- Optimizer: AdamW
- Learning Rate: 3e‑4 with 10% linear warmup and linear decay
- Weight Decay: 0.01
- Batch Size: 128 (effective via gradient accumulation = 2)
- Max Input Length: 410 tokens
- Max Output Length: 160 tokens
- Gradient Clipping: 1.0
Evaluation
Results
Evaluation was conducted on a held‑out test split containing a balanced distribution of:
- Inorganic Compounds: METANO achieves a Top-1 accuracy of 0.378, outperforming previously reported results of 0.14.
- Organometallic Compounds: METANO achieves a Top-1 accuracy of 0.364, outperforming previously reported results of 0.20.
- Co-ordination Compounds: METANO achieves a Top-1 accuracy of 0.394.
- Top-K Decoding Additional gains are observed using Top-K decoding, reaching Top-5 accuracies of 0.481 (inorganic), 0.488 (organometallic) and 0.521 (Co-ordination).
Visualizations & Training Metrics
Training History & Loss Curves
Training History & Loss Curves
Top-K Decoding Performance Comparison
Top-K Performance
Technical Specifications
Model Architecture
METANO is based on the T5‑Small encoder‑decoder transformer.
Key modifications:
- Replacement of SentencePiece tokenizer with CharacterLevelChemicalTokenizer
- Vocabulary size resized to 187 tokens
- Embedding layer reinitialized
- Transformer layers retained from pretrained T5 weights
Training objective:
Cross‑entropy sequence‑to‑sequence generation, with neuro‑symbolic scoring applied during inference.
Compute Infrastructure
Software
- Python
- PyTorch
- Hugging Face Transformers
- RDKit
- Downloads last month
- 145
Model tree for banulaperera/metano
Base model
google-t5/t5-small