Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +93 -0
config.json +28 -0
model.safetensors +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +15 -0
tokenizer_config.json +57 -0

README.md ADDED Viewed

	@@ -0,0 +1,93 @@

+---
+language:
+  - ja
+tags:
+- biomedical
+- text
+license: cc-by-4.0
+datasets:
+- JMED-DICT-mini
+base_model: "xlm-roberta-base"
+---
+# MedTXTNorm
+**MedTXTNorm** is a model for normalizing Japanese medical terms. It is trained on a subset of JMED-DICT (approximately 30k term-concept pairs) using SapBERT-XLMR as the base model. This model is fine-tuned from [cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR](https://huggingface.co/cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR), which utilizes [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
+[ja]
+MedTXTNormは、日本語の医療用語を正規化するためのモデルです。JMED-DICTのサブセット（約3万の用語-概念ペア）でSapBERT-XLMRをベースモデルとして学習されています。
+**MedTXTNorm**は、日本語の医療用語を正規化するためのモデルです。SapBERT-XLMRをベースモデルとし、JMED-DICTのサブセット（約3万の用語-概念ペア）を用いてファインチューニングされています。
+## How to use
+The following script converts a list of strings (entity names) into embeddings and performs a similarity search.
+[ja]
+以下のスクリプトは、文字列（エンティティ名）のリストを埋め込みベクトルに変換し、類似度検索を行います。
+jmed_dict_mini_demo: JMED-DICT-miniの一部の正規化候補
+questions: 出現形 (ex. '脱水')
+answers: 正規形 (ex. '脱水症')
+```python
+import os
+import time
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+# 1. Setup
+model_name = "sociocom/MedTXTNorm"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name).to(device).eval()
+# 2. Data
+jmed_dict_mini_demo = ['脱水症', '高張性脱水症', '口渇症', '発汗障害', '羊水過少症', '破水', '水中毒', '両側水腎症', '下血', '溺水']
+questions, answers = ['脱水'], ['脱水症']
+top_k = 10
+# 3. Inference (Embedding & Search)
+def embed(texts):
+    with torch.no_grad():
+        inputs = tokenizer(texts, padding=True, truncation=True, max_length=25, return_tensors="pt").to(device)
+        return F.normalize(model(**inputs)[0][:, 0, :], p=2, dim=1)
+torch.cuda.synchronize()
+start = time.time()
+# 計算：(Batch, dim) @ (N, dim).T -> (Batch, N)
+# 埋め込みベクトルの作成
+query_embs = embed(questions)            # Shape: (Batch, Dim)
+dict_embs  = embed(jmed_dict_mini_demo)  # Shape: (N, Dim)
+# 類似度行列の計算 (行列積 = コサイン類似度)
+# (Batch, Dim) @ (Dim, N) -> (Batch, N)
+similarity_matrix = torch.matmul(query_embs, dict_embs.T)
+# 上位k件の取得
+top_vals, top_idxs = torch.topk(similarity_matrix, k=top_k)
+torch.cuda.synchronize()
+print(f"Time: {time.time() - start:.4f} sec")
+# 4. Formatting
+# ループ処理高速化のため、GPU上のTensorをPythonリストに変換
+top_vals_list = top_vals.tolist()
+top_idxs_list = top_idxs.tolist()
+results = []
+for i, (q, a) in enumerate(zip(questions, answers)):
+    candidates = []
+    # 重複チェック(set)を削除し、そのままリストに追加
+    for val, idx in zip(top_vals_list[i], top_idxs_list[i]):
+        name = jmed_dict_mini_demo[idx] # 変数名を修正しました
+        score = float(f"{val:.3g}")     # 有効数字3桁
+        candidates.append((name, score))
+    results.append({"input": q, "answer": a, "candidates": candidates})
+print(results)
+# Time: 0.0303 sec
+# [{'input': '脱水', 'answer': '脱水症', 'candidates': [('脱水症', 0.986), ('羊水過少症', 0.532), ('溺水', 0.491), ('口渇症', 0.49), ('水中毒', 0.482), ('発汗障害', 0.468), ('下血', 0.452), ('高張性脱水症', 0.447), ('両側水腎症', 0.442), ('破水', 0.409)]}]
+```

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "architectures": [
+    "XLMRobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.3",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0bdc447471e5af437f3d57038273b30445deb80619c1ca2a18823d4aec84822d
+size 2239607176

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "do_lower_case": true,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}