File size: 2,978 Bytes
21d2722 208738a df5a409 21d2722 0398188 21d2722 0f47b32 21d2722 0f47b32 21d2722 772bd9c 0398188 772bd9c 0398188 772bd9c 0f47b32 772bd9c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
license: apache-2.0
base_model:
- google-bert/bert-base-chinese
metrics:
- accuracy
language:
- en
- zh
pipeline_tag: fill-mask
---
# Herbert: Pretrained Bert Model for Herbal Medicine
**Herbert** is a pretrained model for herbal medicine research, developed based on the `bert-base-chinese` model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks.
---
## Introduction
This model is optimized for TCM-related tasks, including but not limited to:
- Herbal formula encoding
- Domain-specific word embedding
- Classification, labeling, and sequence prediction tasks in TCM research
Herbert combines the strengths of modern pretraining techniques and domain knowledge, allowing it to excel in TCM-related text processing tasks.
---
## Model Config
```json
{
"hidden_size": 1024,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"torch_dtype": "float32",
"vocab_size": 21128
}
### requirements
"transformers_version": "4.45.1"
### Quickstart
#### Use Huggingface
```python
from transformers import AutoTokenizer, AutoModel
# Replace "Chengfengke/herbert" with the Hugging Face model repository name
model_name = "Chengfengke/herbert"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Input text
text = "中医理论是我国传统文化的瑰宝。"
# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
# Get the model's outputs
with torch.no_grad():
outputs = model(**inputs)
# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)
print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)
```
#### LocalModel
```python
from transformers import BertTokenizer, BertForMaskedLM
# Load the model and tokenizer
model_name = "Chengfengke/herbert"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
inputs = tokenizer("This is an example text for herbal medicine.", return_tensors="pt")
outputs = model(**inputs)
```
## Citation
If you find our work helpful, feel free to give us a cite.
```bibtex
@misc{herbert-embedding,
title = {Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
author = {Yehan Yang,Xinhan Zheng},
month = {December},
year = {2024}
}
@article{herbert-technical-report,
title={Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
author={Yehan Yang,Xinhan Zheng},
institution={Beijing Angopro Technology Co., Ltd.},
year={2024},
note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
}
|