|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
language: ja |
|
|
pipeline_tag: fill-mask |
|
|
tags: |
|
|
- japanese |
|
|
- pharmaceutical |
|
|
- bert |
|
|
- continual-pretraining |
|
|
--- |
|
|
|
|
|
# JpharmaBERT: A Japanese Language Model for Pharmaceutical NLP |
|
|
|
|
|
[\ud83d\udcda Paper](https://huggingface.co/papers/2505.16661) - [\ud83d\udcbb Code](https://github.com/EQUES-AI/JpharmaBERT) |
|
|
|
|
|
This is the **JpharmaBERT (base)** model, presented in the paper [A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP](https://huggingface.co/papers/2505.16661). It is a continually pre-trained version of the BERT model ([tohoku-nlp/bert-base-japanese-v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)), further trained on pharmaceutical data. |
|
|
|
|
|
# Example Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
|
|
|
model = AutoModelForMaskedLM.from_pretrained("EQUES/jpharma-bert-base", torch_dtype=torch.bfloat16) |
|
|
tokenizer = AutoTokenizer.from_pretrained("EQUES/jpharma-bert-base") |
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
|
|
|
results = fill_mask("水は化学式で[MASK]2Oです。") |
|
|
|
|
|
for result in results: |
|
|
print(result) |
|
|
# {'score': 0.49609375, 'token': 55, 'token_str': 'H', 'sequence': '水は化学式でH2Oです。'} |
|
|
# {'score': 0.11767578125, 'token': 29257, 'token_str': 'Na', 'sequence': '水は化学式でNa2Oです。'} |
|
|
# {'score': 0.047607421875, 'token': 61, 'token_str': 'N', 'sequence': '水は化学式でN2Oです。'} |
|
|
# {'score': 0.038330078125, 'token': 16966, 'token_str': 'CH', 'sequence': '水は化学式でCH2Oです 。'} |
|
|
# {'score': 0.0255126953125, 'token': 66, 'token_str': 'S', 'sequence': '水は化学式でS2Oです 。'} |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
We used the same dataset as [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B) for training our JpharmaBERT, which consists of: |
|
|
|
|
|
* Japanese text data (2B tokens) collected from pharmaceutical documents such as academic papers and package inserts |
|
|
* English data (8B tokens) obtained from PubMed abstracts |
|
|
* Pharmaceutical-related data (1.2B tokens) extracted from the multilingual CC100 dataset |
|
|
|
|
|
After removing duplicate entries across these sources, the final dataset contains approximately 9 billion tokens. |
|
|
(For details, please refer to our paper about Jpharmatron: [link](https://arxiv.org/abs/2505.16661)) |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
The model was continually pre-trained with the following settings: |
|
|
|
|
|
* Mask probability: 15% |
|
|
* Maximum sequence length: 512 tokens |
|
|
* Number of training epochs: 6 |
|
|
* Learning rate: 1e-4 |
|
|
* Warm-up steps: 10,000 |
|
|
* Per-device training batch size: 64 |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Created by Takuro Fujii (tkr.fujii.ynu@gmail.com) |