---
{}
---

## References to read

[1]: Prein, T., Pan, E., Doerr, T., Olivetti, E., & Rupp, J. L. M. 2024. **MTEncoder: A Transformer-Based Framework for Materials Representation Learning.** *Materials Today*. Available at chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://openreview.net/pdf?id=wug7i3O7y1
<br>
[2] Schmidt, Jonathan, Hai-Chen Wang, Tiago F.T. Cerqueira, Silvana Botti, Aldo H. Romero, and Miguel A.L. Marques. 2024. "Improving Machine-Learning Models in Materials Science through Large Datasets." Journal Name (To be determined). https://www.sciencedirect.com/science/article/pii/S2542529324002360.

# MTEncoder (SyntMTE)

## Overview

MTEncoder is a transformer-based model for encoding materials’ elemental compositions into dense vector representations. Each material is tokenized into:

- Individual element tokens (e.g., Na, Fe, O)  
- A special `Compound` token (`[CPD]`) that aggregates elemental information

These tokens are fed into a transformer encoder, which produces context-rich embeddings. The embedding of the `[CPD]` token serves as the learned representation of the material and is passed through an MLP head to predict various properties[1].

## Pretraining Tasks

MTEncoder is pretrained on the Alexandria dataset [2] across 12 tasks:

| Pretraining Objective                        |
|----------------------------------------------|
| Stress                                       |
| Band Gap (Direct)                            |
| Band Gap (Indirect)                          |
| Density of States at Fermi Level             |
| Energy Above Hull                            |
| Formation Energy                             |
| Corrected Total Energy                       |
| Phase Separation Energy                      |
| Number of Atomic Sites                       |
| Total Magnetic Moment                        |
| Crystal Space Group                          |
| Masked Element Reconstruction (Self-Supervised) |

*Table: Pretraining objectives for MTEncoder (drawn from the Alexandria materials dataset).*