CremaX
/

PoCo

Safetensors

roformer

Model card Files Files and versions

xet

Community

CremaX commited on 11 days ago

Commit

d284d5d

verified ·

1 Parent(s): 7e847d8

Create README.md

Browse files

Files changed (1) hide show

README.md +141 -0

README.md ADDED Viewed

	@@ -0,0 +1,141 @@

+# PoCo
+PoCo is a feature extractor for polymer structures.
+It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.
+## Prerequisites
+Install either `sentence-transformers` (recommended), or
+`transformers` if you want to work with the Hugging Face pipeline:
+```bash
+pip install -U sentence-transformers transformers torch
+```
+## Usage
+### Sentence Transformers (Recommended)
+The easiest way to use PoCo is through `SentenceTransformer`. This interface
+handles tokenization, padding, batching, pooling, device placement, and
+conversion to NumPy arrays.
+```python
+from sentence_transformers import SentenceTransformer
+model_id = "CremaX/PoCo"
+model = SentenceTransformer(model_id)
+polymer_smiles = [
+    "[*]CC[*]",
+    "[*]CC(c1ccccc1)[*]",
+]
+embeddings = model.encode(
+    polymer_smiles,
+    batch_size=64,
+    convert_to_numpy=True,
+    show_progress_bar=True,
+)
+print(embeddings.shape)
+# (2, 512)
+```
+For a single polymer SMILES string:
+```python
+embedding = model.encode("[*]CC[*]", convert_to_numpy=True)
+print(embedding.shape)
+# (512,)
+```
+By default, embeddings are returned as raw feature vectors. If you plan to use
+cosine similarity directly, you may normalize them:
+```python
+embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
+```
+For downstream machine learning models, raw embeddings are often a good default:
+```python
+from sklearn.ensemble import RandomForestRegressor
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("CremaX/PoCo")
+X_train = model.encode(train_smiles, convert_to_numpy=True)
+X_test = model.encode(test_smiles, convert_to_numpy=True)
+regressor = RandomForestRegressor(random_state=0)
+regressor.fit(X_train, y_train)
+predictions = regressor.predict(X_test)
+```
+### Hugging Face Transformers
+You can also use the model directly with `transformers`. This is useful when
+you need full control over tokenization, tensors, devices, or pooling.
+`AutoModel` returns token-level hidden states with shape
+`(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
+per polymer, apply attention-mask-aware mean pooling over the token dimension.
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+model_id = "CremaX/PoCo"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModel.from_pretrained(model_id).to(device)
+model.eval()
+polymer_smiles = [
+    "[*]CC[*]",
+    "[*]CC(c1ccccc1)[*]",
+]
+encoded = tokenizer(
+    polymer_smiles,
+    padding=True,
+    truncation=True,
+    return_tensors="pt",
+)
+encoded = {key: value.to(device) for key, value in encoded.items()}
+with torch.no_grad():
+    outputs = model(**encoded)
+token_embeddings = outputs.last_hidden_state
+attention_mask = encoded["attention_mask"].unsqueeze(-1).float()
+# mean pooling
+embeddings = (token_embeddings * attention_mask).sum(dim=1)
+embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
+embeddings = embeddings.cpu().numpy()
+print(embeddings.shape)
+# (2, 512)
+```
+The Hugging Face pipeline returns token-level features.
+For polymer-level embeddings, prefer the `SentenceTransformer` example above or
+apply the mean pooling step shown in this section.
+## Input Notes
+- Polymer SMILES must use `[*]` to mark repeat-unit endpoints, not bare `*`.
+- The model does not validate whether a string is a chemically valid SMILES
+  string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.
+## Citation
+If you use PoCo, please cite:
+Wang, L.; Long, D. *Contrastive representation learning for polymer
+informatics*. ChemRxiv, 2026. https://doi.org/10.26434/chemrxiv.15003645/v1