|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- google-bert/bert-base-chinese |
|
|
metrics: |
|
|
- accuracy |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
# Herbert: Pretrained Bert Model for Herbal Medicine |
|
|
|
|
|
**Herbert** is a pretrained model for herbal medicine research, developed based on the `bert-base-chinese` model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Introduction |
|
|
|
|
|
This model is optimized for TCM-related tasks, including but not limited to: |
|
|
- Herbal formula encoding |
|
|
- Domain-specific word embedding |
|
|
- Classification, labeling, and sequence prediction tasks in TCM research |
|
|
|
|
|
Herbert combines the strengths of modern pretraining techniques and domain knowledge, allowing it to excel in TCM-related text processing tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Config |
|
|
|
|
|
```json |
|
|
{ |
|
|
"hidden_size": 1024, |
|
|
"max_position_embeddings": 512, |
|
|
"model_type": "bert", |
|
|
"num_attention_heads": 16, |
|
|
"num_hidden_layers": 24, |
|
|
"torch_dtype": "float32", |
|
|
"vocab_size": 21128 |
|
|
} |
|
|
### requirements |
|
|
"transformers_version": "4.45.1" |
|
|
|
|
|
### Quickstart |
|
|
|
|
|
#### Use Huggingface |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
# Replace "Chengfengke/herbert" with the Hugging Face model repository name |
|
|
model_name = "Chengfengke/herbert" |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
|
|
# Input text |
|
|
text = "中医理论是我国传统文化的瑰宝。" |
|
|
|
|
|
# Tokenize and prepare input |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128) |
|
|
|
|
|
# Get the model's outputs |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Get the embedding (sentence-level average pooling) |
|
|
sentence_embedding = outputs.last_hidden_state.mean(dim=1) |
|
|
|
|
|
print("Embedding shape:", sentence_embedding.shape) |
|
|
print("Embedding vector:", sentence_embedding) |
|
|
``` |
|
|
|
|
|
|
|
|
#### LocalModel |
|
|
```python |
|
|
from transformers import BertTokenizer, BertForMaskedLM |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "Chengfengke/herbert" |
|
|
tokenizer = BertTokenizer.from_pretrained(model_name) |
|
|
model = BertForMaskedLM.from_pretrained(model_name) |
|
|
inputs = tokenizer("This is an example text for herbal medicine.", return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
|
|
|
```bibtex |
|
|
@misc{herbert-embedding, |
|
|
title = {Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation}, |
|
|
author = {Yehan Yang,Xinhan Zheng}, |
|
|
month = {December}, |
|
|
year = {2024} |
|
|
} |
|
|
|
|
|
@article{herbert-technical-report, |
|
|
title={Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation}, |
|
|
author={Yehan Yang,Xinhan Zheng}, |
|
|
institution={Beijing Angopro Technology Co., Ltd.}, |
|
|
year={2024}, |
|
|
note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} |
|
|
} |
|
|
|