Update README.md
Browse files
README.md
CHANGED
|
@@ -6,15 +6,44 @@ tags:
|
|
| 6 |
- feature-extraction
|
| 7 |
- sentence-similarity
|
| 8 |
- transformers
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# Dingyun-Huang/oe-
|
| 13 |
|
| 14 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
| 15 |
|
| 16 |
<!--- Describe your model here -->
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
## Usage (Sentence-Transformers)
|
| 19 |
|
| 20 |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
|
@@ -29,7 +58,7 @@ Then you can use the model like this:
|
|
| 29 |
from sentence_transformers import SentenceTransformer
|
| 30 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 31 |
|
| 32 |
-
model = SentenceTransformer('Dingyun-Huang/oe-
|
| 33 |
embeddings = model.encode(sentences)
|
| 34 |
print(embeddings)
|
| 35 |
```
|
|
@@ -55,8 +84,8 @@ def mean_pooling(model_output, attention_mask):
|
|
| 55 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 56 |
|
| 57 |
# Load model from HuggingFace Hub
|
| 58 |
-
tokenizer = AutoTokenizer.from_pretrained('Dingyun-Huang/oe-
|
| 59 |
-
model = AutoModel.from_pretrained('Dingyun-Huang/oe-
|
| 60 |
|
| 61 |
# Tokenize sentences
|
| 62 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
@@ -78,18 +107,38 @@ print(sentence_embeddings)
|
|
| 78 |
|
| 79 |
<!--- Describe how your model was evaluated -->
|
| 80 |
|
| 81 |
-
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=Dingyun-Huang/oe-
|
| 82 |
|
| 83 |
|
| 84 |
|
| 85 |
## Full Model Architecture
|
| 86 |
```
|
| 87 |
SentenceTransformer(
|
| 88 |
-
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model:
|
| 89 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 90 |
)
|
| 91 |
```
|
| 92 |
|
| 93 |
## Citing & Authors
|
| 94 |
|
| 95 |
-
<!--- Describe where people can find more information -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- feature-extraction
|
| 7 |
- sentence-similarity
|
| 8 |
- transformers
|
| 9 |
+
- optoelectronics
|
| 10 |
+
license: mit
|
| 11 |
+
datasets:
|
| 12 |
+
- CambridgeMolecularEngineering/oe-ttl-abs-303k
|
| 13 |
+
language:
|
| 14 |
+
- en
|
| 15 |
+
base_model:
|
| 16 |
+
- bert-base-uncased
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Dingyun-Huang/oe-sroberta-embedding
|
| 20 |
|
| 21 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
| 22 |
|
| 23 |
<!--- Describe your model here -->
|
| 24 |
|
| 25 |
+
|
| 26 |
+
**The OE-BERT model is domain adapted from bert-base-uncased over research literature in optoelectronics. The adapted model is then fine-tuned on abstracts and titles of optoelectronics research articles for embedding capabilities.**
|
| 27 |
+
|
| 28 |
+
## Model Details
|
| 29 |
+
|
| 30 |
+
### Model Description
|
| 31 |
+
|
| 32 |
+
<!-- Provide a longer summary of what this model is. -->
|
| 33 |
+
|
| 34 |
+
- **Language(s) (NLP):** English
|
| 35 |
+
- **Adapted from model:** bert-base-uncased
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
### Model Sources
|
| 40 |
+
|
| 41 |
+
<!-- Provide the basic links for the model. -->
|
| 42 |
+
|
| 43 |
+
- **Repository:** [OptoelectronicsLM-codebase (GitHub)](https://github.com/Dingyun-Huang/OptoelectronicsLM-codebase)
|
| 44 |
+
- **Paper:** [
|
| 45 |
+
Cost-Efficient Domain-Adaptive Pretraining of Language Models for Optoelectronics Applications](https://pubs.acs.org/doi/10.1021/acs.jcim.4c02029)
|
| 46 |
+
|
| 47 |
## Usage (Sentence-Transformers)
|
| 48 |
|
| 49 |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
|
|
|
| 58 |
from sentence_transformers import SentenceTransformer
|
| 59 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 60 |
|
| 61 |
+
model = SentenceTransformer('Dingyun-Huang/oe-sroberta-embedding')
|
| 62 |
embeddings = model.encode(sentences)
|
| 63 |
print(embeddings)
|
| 64 |
```
|
|
|
|
| 84 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 85 |
|
| 86 |
# Load model from HuggingFace Hub
|
| 87 |
+
tokenizer = AutoTokenizer.from_pretrained('Dingyun-Huang/oe-sroberta-embedding')
|
| 88 |
+
model = AutoModel.from_pretrained('Dingyun-Huang/oe-sroberta-embedding')
|
| 89 |
|
| 90 |
# Tokenize sentences
|
| 91 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
|
|
| 107 |
|
| 108 |
<!--- Describe how your model was evaluated -->
|
| 109 |
|
| 110 |
+
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=Dingyun-Huang/oe-sroberta-embedding)
|
| 111 |
|
| 112 |
|
| 113 |
|
| 114 |
## Full Model Architecture
|
| 115 |
```
|
| 116 |
SentenceTransformer(
|
| 117 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
|
| 118 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 119 |
)
|
| 120 |
```
|
| 121 |
|
| 122 |
## Citing & Authors
|
| 123 |
|
| 124 |
+
<!--- Describe where people can find more information -->
|
| 125 |
+
**BibTeX:**
|
| 126 |
+
```bibtex
|
| 127 |
+
@article{doi:10.1021/acs.jcim.4c02029,
|
| 128 |
+
author = {Huang, Dingyun and Cole, Jacqueline M.},
|
| 129 |
+
title = {Cost-Efficient Domain-Adaptive Pretraining of Language Models for Optoelectronics Applications},
|
| 130 |
+
journal = {Journal of Chemical Information and Modeling},
|
| 131 |
+
volume = {65},
|
| 132 |
+
number = {5},
|
| 133 |
+
pages = {2476-2486},
|
| 134 |
+
year = {2025},
|
| 135 |
+
doi = {10.1021/acs.jcim.4c02029},
|
| 136 |
+
note ={PMID: 39933074},
|
| 137 |
+
URL = {
|
| 138 |
+
https://doi.org/10.1021/acs.jcim.4c02029
|
| 139 |
+
},
|
| 140 |
+
eprint = {
|
| 141 |
+
https://doi.org/10.1021/acs.jcim.4c02029
|
| 142 |
+
}
|
| 143 |
+
}
|
| 144 |
+
```
|