CremaX commited on
Commit
d284d5d
·
verified ·
1 Parent(s): 7e847d8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PoCo
2
+
3
+ PoCo is a feature extractor for polymer structures.
4
+
5
+ It takes polymer SMILES strings as input and returns 512-dimensional vectors, which can be used as polymer representations for downstream tasks such as property prediction.
6
+
7
+ ## Prerequisites
8
+
9
+ Install either `sentence-transformers` (recommended), or
10
+ `transformers` if you want to work with the Hugging Face pipeline:
11
+
12
+ ```bash
13
+ pip install -U sentence-transformers transformers torch
14
+ ```
15
+
16
+ ## Usage
17
+
18
+ ### Sentence Transformers (Recommended)
19
+
20
+ The easiest way to use PoCo is through `SentenceTransformer`. This interface
21
+ handles tokenization, padding, batching, pooling, device placement, and
22
+ conversion to NumPy arrays.
23
+
24
+ ```python
25
+ from sentence_transformers import SentenceTransformer
26
+
27
+ model_id = "CremaX/PoCo"
28
+ model = SentenceTransformer(model_id)
29
+
30
+ polymer_smiles = [
31
+ "[*]CC[*]",
32
+ "[*]CC(c1ccccc1)[*]",
33
+ ]
34
+
35
+ embeddings = model.encode(
36
+ polymer_smiles,
37
+ batch_size=64,
38
+ convert_to_numpy=True,
39
+ show_progress_bar=True,
40
+ )
41
+
42
+ print(embeddings.shape)
43
+ # (2, 512)
44
+ ```
45
+
46
+ For a single polymer SMILES string:
47
+
48
+ ```python
49
+ embedding = model.encode("[*]CC[*]", convert_to_numpy=True)
50
+
51
+ print(embedding.shape)
52
+ # (512,)
53
+ ```
54
+
55
+ By default, embeddings are returned as raw feature vectors. If you plan to use
56
+ cosine similarity directly, you may normalize them:
57
+
58
+ ```python
59
+ embeddings = model.encode(polymer_smiles, normalize_embeddings=True)
60
+ ```
61
+
62
+ For downstream machine learning models, raw embeddings are often a good default:
63
+
64
+ ```python
65
+ from sklearn.ensemble import RandomForestRegressor
66
+ from sentence_transformers import SentenceTransformer
67
+
68
+ model = SentenceTransformer("CremaX/PoCo")
69
+
70
+ X_train = model.encode(train_smiles, convert_to_numpy=True)
71
+ X_test = model.encode(test_smiles, convert_to_numpy=True)
72
+
73
+ regressor = RandomForestRegressor(random_state=0)
74
+ regressor.fit(X_train, y_train)
75
+ predictions = regressor.predict(X_test)
76
+ ```
77
+
78
+ ### Hugging Face Transformers
79
+
80
+ You can also use the model directly with `transformers`. This is useful when
81
+ you need full control over tokenization, tensors, devices, or pooling.
82
+
83
+ `AutoModel` returns token-level hidden states with shape
84
+ `(batch_size, sequence_length, hidden_size)`. To get one 512-dimensional vector
85
+ per polymer, apply attention-mask-aware mean pooling over the token dimension.
86
+
87
+ ```python
88
+ import torch
89
+ from transformers import AutoModel, AutoTokenizer
90
+
91
+ model_id = "CremaX/PoCo"
92
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
95
+ model = AutoModel.from_pretrained(model_id).to(device)
96
+ model.eval()
97
+
98
+ polymer_smiles = [
99
+ "[*]CC[*]",
100
+ "[*]CC(c1ccccc1)[*]",
101
+ ]
102
+
103
+ encoded = tokenizer(
104
+ polymer_smiles,
105
+ padding=True,
106
+ truncation=True,
107
+ return_tensors="pt",
108
+ )
109
+ encoded = {key: value.to(device) for key, value in encoded.items()}
110
+
111
+ with torch.no_grad():
112
+ outputs = model(**encoded)
113
+
114
+ token_embeddings = outputs.last_hidden_state
115
+ attention_mask = encoded["attention_mask"].unsqueeze(-1).float()
116
+
117
+ # mean pooling
118
+ embeddings = (token_embeddings * attention_mask).sum(dim=1)
119
+ embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
120
+ embeddings = embeddings.cpu().numpy()
121
+
122
+ print(embeddings.shape)
123
+ # (2, 512)
124
+ ```
125
+
126
+ The Hugging Face pipeline returns token-level features.
127
+ For polymer-level embeddings, prefer the `SentenceTransformer` example above or
128
+ apply the mean pooling step shown in this section.
129
+
130
+ ## Input Notes
131
+
132
+ - Polymer SMILES must use `[*]` to mark repeat-unit endpoints, not bare `*`.
133
+ - The model does not validate whether a string is a chemically valid SMILES
134
+ string. We recommend canonicalizing polymer SMILES with the [`psmiles`](https://psmiles.readthedocs.io/) library before passing them to the model.
135
+
136
+ ## Citation
137
+
138
+ If you use PoCo, please cite:
139
+
140
+ Wang, L.; Long, D. *Contrastive representation learning for polymer
141
+ informatics*. ChemRxiv, 2026. https://doi.org/10.26434/chemrxiv.15003645/v1