PoCo / README.md

Update README.md

a913452 verified 39 minutes ago

4.19 kB

	# PoCo

	PoCo is a feature extractor for polymer structures.

	It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.

	## Resources

	- Paper: [Contrastive representation learning for polymer informatics](https://doi.org/10.26434/chemrxiv.15003645/v1)
	- Code: [GitHub repository](https://github.com/crema-lida/PoCo)

	## Prerequisites

	Install either `sentence-transformers` (recommended), or
	`transformers` if you want to work with the Hugging Face pipeline:

	```bash
	pip install -U sentence-transformers transformers torch
	```

	## Usage

	### Sentence Transformers (Recommended)

	The easiest way to use PoCo is through `SentenceTransformer`. This interface
	handles tokenization, padding, batching, pooling, device placement, and
	conversion to NumPy arrays.

	```python
	from sentence_transformers import SentenceTransformer

	model_id = "CremaX/PoCo"
	model = SentenceTransformer(model_id)

	polymer_smiles = [
	"[]CC[]",
	"[]CC(c1ccccc1)[]",
	]

	embeddings = model.encode(
	polymer_smiles,
	batch_size=64,
	convert_to_numpy=True,
	show_progress_bar=True,
	)

	print(embeddings.shape)
	# (2, 512)
	```

	For a single polymer SMILES string:

	```python
	embedding = model.encode("[]CC[]", convert_to_numpy=True)

	print(embedding.shape)
	# (512,)
	```

	By default, embeddings are returned as raw feature vectors. If you plan to use
	cosine similarity directly, you may normalize them:

	```python
	embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
	```

	For downstream machine learning models, raw embeddings are often a good default:

	```python
	from sklearn.ensemble import RandomForestRegressor
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("CremaX/PoCo")

	X_train = model.encode(train_smiles, convert_to_numpy=True)
	X_test = model.encode(test_smiles, convert_to_numpy=True)

	regressor = RandomForestRegressor(random_state=0)
	regressor.fit(X_train, y_train)
	predictions = regressor.predict(X_test)
	```

	### Hugging Face Transformers

	You can also use the model directly with `transformers`. This is useful when
	you need full control over tokenization, tensors, devices, or pooling.

	`AutoModel` returns token-level hidden states with shape
	`(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
	per polymer, apply attention-mask-aware mean pooling over the token dimension.

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	model_id = "CremaX/PoCo"
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModel.from_pretrained(model_id).to(device)
	model.eval()

	polymer_smiles = [
	"[]CC[]",
	"[]CC(c1ccccc1)[]",
	]

	encoded = tokenizer(
	polymer_smiles,
	padding=True,
	truncation=True,
	return_tensors="pt",
	)
	encoded = {key: value.to(device) for key, value in encoded.items()}

	with torch.no_grad():
	outputs = model(**encoded)

	token_embeddings = outputs.last_hidden_state
	attention_mask = encoded["attention_mask"].unsqueeze(-1).float()

	# mean pooling
	embeddings = (token_embeddings * attention_mask).sum(dim=1)
	embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
	embeddings = embeddings.cpu().numpy()

	print(embeddings.shape)
	# (2, 512)
	```

	The Hugging Face pipeline returns token-level features.
	For polymer-level embeddings, prefer the `SentenceTransformer` example above or
	apply the mean pooling step shown in this section.

	## Input Notes

	- Polymer SMILES must use `[]` to mark repeat-unit endpoints, not bare ``.
	- The model does not validate whether a string is a chemically valid SMILES
	string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.

	## Citation

	If you use PoCo, please cite:

	```bibtex
	@article{wang2026poco,
	title = {Contrastive representation learning for polymer informatics},
	author = {Wang, Lida and Long, Donghui},
	journal = {ChemRxiv},
	year = {2026},
	doi = {10.26434/chemrxiv.15003645/v1}
	}
	```