Chengfengke
/

herbert

Model card Files Files and versions

herbert / README.md

Chengfengke's picture

Update README.md

0f47b32 verified about 1 year ago

|

history blame contribute delete

2.98 kB

	---
	license: apache-2.0
	base_model:
	- google-bert/bert-base-chinese
	metrics:
	- accuracy
	language:
	- en
	- zh
	pipeline_tag: fill-mask
	---
	# Herbert: Pretrained Bert Model for Herbal Medicine

	Herbert is a pretrained model for herbal medicine research, developed based on the `bert-base-chinese` model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks.

	---

	## Introduction

	This model is optimized for TCM-related tasks, including but not limited to:
	- Herbal formula encoding
	- Domain-specific word embedding
	- Classification, labeling, and sequence prediction tasks in TCM research

	Herbert combines the strengths of modern pretraining techniques and domain knowledge, allowing it to excel in TCM-related text processing tasks.

	---

	## Model Config

	```json
	{
	"hidden_size": 1024,
	"max_position_embeddings": 512,
	"model_type": "bert",
	"num_attention_heads": 16,
	"num_hidden_layers": 24,
	"torch_dtype": "float32",
	"vocab_size": 21128
	}
	### requirements
	"transformers_version": "4.45.1"

	### Quickstart

	#### Use Huggingface
	```python
	from transformers import AutoTokenizer, AutoModel

	# Replace "Chengfengke/herbert" with the Hugging Face model repository name
	model_name = "Chengfengke/herbert"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	# Input text
	text = "中医理论是我国传统文化的瑰宝。"

	# Tokenize and prepare input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

	# Get the model's outputs
	with torch.no_grad():
	outputs = model(**inputs)

	# Get the embedding (sentence-level average pooling)
	sentence_embedding = outputs.last_hidden_state.mean(dim=1)

	print("Embedding shape:", sentence_embedding.shape)
	print("Embedding vector:", sentence_embedding)
	```


	#### LocalModel
	```python
	from transformers import BertTokenizer, BertForMaskedLM

	# Load the model and tokenizer
	model_name = "Chengfengke/herbert"
	tokenizer = BertTokenizer.from_pretrained(model_name)
	model = BertForMaskedLM.from_pretrained(model_name)
	inputs = tokenizer("This is an example text for herbal medicine.", return_tensors="pt")
	outputs = model(**inputs)
	```

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```bibtex
	@misc{herbert-embedding,
	title = {Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
	author = {Yehan Yang,Xinhan Zheng},
	month = {December},
	year = {2024}
	}

	@article{herbert-technical-report,
	title={Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
	author={Yehan Yang,Xinhan Zheng},
	institution={Beijing Angopro Technology Co., Ltd.},
	year={2024},
	note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
	}