File size: 2,800 Bytes
0680e63
 
 
 
 
40d63e1
0680e63
 
40d63e1
0680e63
40d63e1
0680e63
 
40d63e1
 
 
0680e63
9094568
e1cb868
40d63e1
0680e63
40d63e1
0680e63
40d63e1
 
 
 
 
e1cb868
 
40d63e1
0680e63
 
 
 
 
 
40d63e1
 
 
 
0680e63
40d63e1
 
0680e63
 
 
40d63e1
0680e63
40d63e1
 
 
 
 
 
0680e63
40d63e1
0680e63
40d63e1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
library_name: transformers
tags: []
---

# Model Card

<!-- Provide a quick summary of what the model is/does. -->
Our **JpharmaBERT (base)** is a continually pre-trained version of the BERT model ([tohoku-nlp/bert-base-japanese-v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)), further trained on pharmaceutical data — the same dataset used for [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B).

# Examoke Usage

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
```python
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model = AutoModelForMaskedLM.from_pretrained("EQUES/jpharma-bert-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("EQUES/jpharma-bert-base")
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

results = fill_mask("水は化学式で[MASK]2Oです。")

for result in results:
    print(result)
# {'score': 0.49609375, 'token': 55, 'token_str': 'H', 'sequence': '水は化学式でH2Oです。'}
# {'score': 0.11767578125, 'token': 29257, 'token_str': 'Na', 'sequence': '水は化学式でNa2Oです。'}
# {'score': 0.047607421875, 'token': 61, 'token_str': 'N', 'sequence': '水は化学式でN2Oです。'}
# {'score': 0.038330078125, 'token': 16966, 'token_str': 'CH', 'sequence': '水は化学式でCH2Oです 。'}
# {'score': 0.0255126953125, 'token': 66, 'token_str': 'S', 'sequence': '水は化学式でS2Oです 。'}
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used the same dataset as [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B) for training our JpharmaBERT, which consists of:
- Japanese text data (2B tokens) collected from pharmaceutical documents such as academic papers and package inserts  
- English data (8B tokens) obtained from PubMed abstracts  
- Pharmaceutical-related data (1.2B tokens) extracted from the multilingual CC100 dataset  

After removing duplicate entries across these sources, the final dataset contains approximately 9 billion tokens.  
(For details, please refer to our paper about Jpharmatron: [link](https://arxiv.org/abs/2505.16661))

#### Training Hyperparameters

The model was continually pre-trained with the following settings:

- Mask probability: 15%
- Maximum sequence length: 512 tokens
- Number of training epochs: 6
- Learning rate: 1e-4  
- Warm-up steps: 10,000 
- Per-device training batch size: 64

## Model Card Authors

Created by Takuro Fujii (tkr.fujii.ynu@gmail.com)