Chengfengke commited on
Commit
7a6cf8f
·
2 Parent(s): 5490b49 0398188

Merge branch 'main' of https://huggingface.co/Chengfengke/herbert

Browse files
Files changed (1) hide show
  1. README.md +78 -3
README.md CHANGED
@@ -2,10 +2,16 @@
2
  license: apache-2.0
3
  base_model:
4
  - google-bert/bert-base-chinese
 
 
 
 
 
 
5
  ---
6
- # Herberta: Pretrained Language Model for Herbal Medicine
7
 
8
- **Herberta** is a pretrained model for herbal medicine research, developed based on the `chinese-roberta-wwm-ext-large` model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks.
9
 
10
  ---
11
 
@@ -31,4 +37,73 @@ Herberta combines the strengths of modern pretraining techniques and domain know
31
  "num_hidden_layers": 24,
32
  "torch_dtype": "float32",
33
  "vocab_size": 21128
34
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  base_model:
4
  - google-bert/bert-base-chinese
5
+ metrics:
6
+ - accuracy
7
+ language:
8
+ - en
9
+ - zh
10
+ pipeline_tag: fill-mask
11
  ---
12
+ # Herbert: Pretrained Bert Model for Herbal Medicine
13
 
14
+ **Herberta** is a pretrained model for herbal medicine research, developed based on the `bert-base-chinese` model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks.
15
 
16
  ---
17
 
 
37
  "num_hidden_layers": 24,
38
  "torch_dtype": "float32",
39
  "vocab_size": 21128
40
+ }
41
+ ### requirements
42
+ "transformers_version": "4.45.1"
43
+ ```bash
44
+ pip install herberta
45
+ ```
46
+
47
+ ### Quickstart
48
+
49
+ #### Use Huggingface
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModel
52
+
53
+ # Replace "Chengfengke/herbert" with the Hugging Face model repository name
54
+ model_name = "Chengfengke/herbert"
55
+
56
+ # Load tokenizer and model
57
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
58
+ model = AutoModel.from_pretrained(model_name)
59
+
60
+ # Input text
61
+ text = "中医理论是我国传统文化的瑰宝。"
62
+
63
+ # Tokenize and prepare input
64
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
65
+
66
+ # Get the model's outputs
67
+ with torch.no_grad():
68
+ outputs = model(**inputs)
69
+
70
+ # Get the embedding (sentence-level average pooling)
71
+ sentence_embedding = outputs.last_hidden_state.mean(dim=1)
72
+
73
+ print("Embedding shape:", sentence_embedding.shape)
74
+ print("Embedding vector:", sentence_embedding)
75
+ ```
76
+
77
+
78
+ #### LocalModel
79
+ ```python
80
+ from transformers import BertTokenizer, BertForMaskedLM
81
+
82
+ # Load the model and tokenizer
83
+ model_name = "Chengfengke/herbert"
84
+ tokenizer = BertTokenizer.from_pretrained(model_name)
85
+ model = BertForMaskedLM.from_pretrained(model_name)
86
+ inputs = tokenizer("This is an example text for herbal medicine.", return_tensors="pt")
87
+ outputs = model(**inputs)
88
+ ```
89
+
90
+ ## Citation
91
+
92
+ If you find our work helpful, feel free to give us a cite.
93
+
94
+ ```bibtex
95
+ @misc{herberta-embedding,
96
+ title = {Herberta: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
97
+ url = {https://github.com/15392778677/herberta},
98
+ author = {Yehan Yang,Xinhan Zheng},
99
+ month = {December},
100
+ year = {2024}
101
+ }
102
+
103
+ @article{herbert-technical-report,
104
+ title={Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
105
+ author={Yehan Yang,Xinhan Zheng},
106
+ institution={Beijing Angopro Technology Co., Ltd.},
107
+ year={2024},
108
+ note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
109
+ }