File size: 2,800 Bytes
0680e63 40d63e1 0680e63 40d63e1 0680e63 40d63e1 0680e63 40d63e1 0680e63 9094568 e1cb868 40d63e1 0680e63 40d63e1 0680e63 40d63e1 e1cb868 40d63e1 0680e63 40d63e1 0680e63 40d63e1 0680e63 40d63e1 0680e63 40d63e1 0680e63 40d63e1 0680e63 40d63e1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
---
library_name: transformers
tags: []
---
# Model Card
<!-- Provide a quick summary of what the model is/does. -->
Our **JpharmaBERT (base)** is a continually pre-trained version of the BERT model ([tohoku-nlp/bert-base-japanese-v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)), further trained on pharmaceutical data — the same dataset used for [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B).
# Examoke Usage
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
```python
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model = AutoModelForMaskedLM.from_pretrained("EQUES/jpharma-bert-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("EQUES/jpharma-bert-base")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = fill_mask("水は化学式で[MASK]2Oです。")
for result in results:
print(result)
# {'score': 0.49609375, 'token': 55, 'token_str': 'H', 'sequence': '水は化学式でH2Oです。'}
# {'score': 0.11767578125, 'token': 29257, 'token_str': 'Na', 'sequence': '水は化学式でNa2Oです。'}
# {'score': 0.047607421875, 'token': 61, 'token_str': 'N', 'sequence': '水は化学式でN2Oです。'}
# {'score': 0.038330078125, 'token': 16966, 'token_str': 'CH', 'sequence': '水は化学式でCH2Oです 。'}
# {'score': 0.0255126953125, 'token': 66, 'token_str': 'S', 'sequence': '水は化学式でS2Oです 。'}
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used the same dataset as [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B) for training our JpharmaBERT, which consists of:
- Japanese text data (2B tokens) collected from pharmaceutical documents such as academic papers and package inserts
- English data (8B tokens) obtained from PubMed abstracts
- Pharmaceutical-related data (1.2B tokens) extracted from the multilingual CC100 dataset
After removing duplicate entries across these sources, the final dataset contains approximately 9 billion tokens.
(For details, please refer to our paper about Jpharmatron: [link](https://arxiv.org/abs/2505.16661))
#### Training Hyperparameters
The model was continually pre-trained with the following settings:
- Mask probability: 15%
- Maximum sequence length: 512 tokens
- Number of training epochs: 6
- Learning rate: 1e-4
- Warm-up steps: 10,000
- Per-device training batch size: 64
## Model Card Authors
Created by Takuro Fujii (tkr.fujii.ynu@gmail.com) |