|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- bert |
|
|
- scientific-text |
|
|
- embeddings |
|
|
- fine-tuned |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# scibert-citation-model |
|
|
|
|
|
This model is a fine-tuned version of SciBERT specifically optimized for generating embeddings from scientific papers. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: SciBERT (Scientific BERT) |
|
|
- **Fine-tuning Task**: Scientific paper understanding and embedding generation |
|
|
- **Language**: English (Scientific/Academic) |
|
|
- **Vocabulary**: Scientific vocabulary |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/scibert-citation-model") |
|
|
model = AutoModel.from_pretrained("your-username/scibert-citation-model") |
|
|
|
|
|
# Generate embeddings |
|
|
text = "Your scientific text here" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state[:, 0, :] # [CLS] token embedding |
|
|
|
|
|
print(f"Embeddings shape: {embeddings.shape}") |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
Fine-tuned SciBert |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training Framework**: PyTorch/Transformers |
|
|
- **Fine-tuning Objective**: Scientific text understanding |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite appropriately. |
|
|
|