Spaces:
Running
Running
File size: 10,045 Bytes
9ad7c66 e12b63f 366f2e3 e12b63f 4543f66 366f2e3 e12b63f da06e6f e12b63f 366f2e3 e12b63f 4543f66 366f2e3 e12b63f 77c6ede e12b63f 15cbef0 e12b63f 77c6ede e12b63f 15cbef0 e12b63f 77c6ede e12b63f 15cbef0 e12b63f 15cbef0 e12b63f 4543f66 e12b63f 261cb16 e12b63f 4543f66 e12b63f 4543f66 e12b63f 9ad7c66 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>MedInjection-FR • French Biomedical Instruction Dataset</title>
<link rel="stylesheet" href="style.css" />
<meta name="description" content="MedInjection-FR: a French biomedical instruction dataset with native, synthetic, and translated components, plus fine-tuned models." />
</head>
<body>
<header class="site-header">
<div class="wrap">
<h1>MedInjection-FR</h1>
<p class="subtitle">A French biomedical instruction dataset and model suite</p>
<p class="meta">Native • Synthetic • Translated | 571,436 instruction–response pairs</p>
<div class="cta-row">
<a class="btn primary" href="#download">Download</a>
<a class="btn" href="#models">Models</a>
<a class="btn" href="#citation">Cite</a>
<a class="btn" href="#contact">Contact</a>
</div>
</div>
</header>
<main class="wrap">
<!-- Overview -->
<section id="overview" class="card">
<h2>Overview</h2>
<p>
MedInjection-FR is a large-scale French biomedical instruction dataset designed to study
how the <strong>provenance of supervision</strong> (native, synthetic, translated) affects instruction-tuning of LLMs.
The corpus supports multiple-choice QA (single and multi-answer) and open-ended QA,
and is released together with a family of fine-tuned baseline models.
</p>
<ul class="pill-list">
<li>Native: <strong>77,247</strong></li>
<li>Synthetic: <strong>76,506</strong></li>
<li>Translated: <strong>417,674</strong></li>
<li>Total: <strong>571,436</strong></li>
</ul>
</section>
<!-- What’s inside -->
<section id="composition" class="card">
<h2>Composition & Tasks</h2>
<div class="grid-2">
<div>
<h3>Task types</h3>
<ul>
<li>MCQU (single-answer)</li>
<li>MCQ (multiple-answer)</li>
<li>OEQ (open-ended QA)</li>
</ul>
<p class="muted">Counts (all components): OEQ <strong>57,509</strong>, MCQ <strong>59,592</strong>, MCQU <strong>454,335</strong>.</p>
<h3>Splits</h3>
<table class="clean">
<thead>
<tr><th>Component</th><th>Train</th><th>Validation</th><th>Test</th><th>Total</th></tr>
</thead>
<tbody>
<tr><td>Native</td><td>57,563</td><td>5,055</td><td>14,629</td><td>77,247</td></tr>
<tr><td>Synthetic</td><td>76,506</td><td>—</td><td>—</td><td>76,506</td></tr>
<tr><td>Translated</td><td>366,370 </td><td>38,011</td><td>13,293</td><td>417,674</td></tr>
<tr class="total"><td>Total</td><td>500,439</td><td>43,066</td><td>27,931</td><td>571,436</td></tr>
</tbody>
</table>
</div>
<div>
<h3>Translated quality (WMT24 biomedical parallel)</h3>
<table class="clean">
<thead><tr><th>Model</th><th>BLEU</th><th>COMET</th></tr></thead>
<tbody>
<tr><td>GPT-4o-mini</td><td>51.01</td><td>0.8751</td></tr>
<tr><td>Gemini 2.0 Flash</td><td>53.72</td><td>0.8783</td></tr>
<tr class="muted"><td>WMT’24 best (ref.)</td><td>53.54</td><td>0.8760</td></tr>
</tbody>
</table>
<p class="muted">Higher is better. These scores indicate strong translation fidelity for the translated subset.</p>
</div>
</div>
</section>
<!-- Downloads -->
<section id="download" class="card">
<h2>Download</h2>
<p>Each component is published separately. Use the links below or load via the 🤗 Datasets library.</p>
<div class="grid-3 tight">
<a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Native" target="_blank">
<h3>Native</h3>
<p>French medical exams, resources, curated QA.</p>
<code>MedInjection-FR/Native</code>
</a>
<a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Synthetic" target="_blank">
<h3>Synthetic</h3>
<p>GPT-4o generated from clinical cases and abstracts.</p>
<code>MedInjection-FR/Synthetic</code>
</a>
<a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Translated" target="_blank">
<h3>Translated</h3>
<p>FR translations of established EN biomedical sets.</p>
<code>MedInjection-FR/Translated</code>
</a>
</div>
<h3>Python (🤗 Datasets)</h3>
<pre><code class="code">
from datasets import load_dataset
ds = load_dataset("MedInjection-FR/Native") # or "Synthetic", "Translated"
print(ds)
</code></pre>
</section>
<!-- Models -->
<section id="models" class="card">
<h2>Fine-tuned Models</h2>
<p>We release seven instruction-tuned baselines (Qwen-4B-Instruct backbone, DoRA adapters), trained on 30k samples per configuration:</p>
<ul class="pill-list">
<li>QWEN-4B-NAT</li>
<li>QWEN-4B-TRAD</li>
<li>QWEN-4B-SYN</li>
<li>QWEN-4B-NAT-TRAD</li>
<li>QWEN-4B-NAT-SYN</li>
<li>QWEN-4B-TRAD-SYN</li>
<li>QWEN-4B-ALL</li>
</ul>
<div class="grid-3 tight">
<a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-NAT" target="_blank"><h3>NAT</h3><p>Best single-source (MCQ/MCQU).</p></a>
<a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-NAT-TRAD" target="_blank"><h3>NAT-TRAD</h3><p>Top mixed configuration.</p></a>
<a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-ALL" target="_blank"><h3>ALL</h3><p>All sources combined.</p></a>
</div>
<h3>Quick inference (🤗 Transformers)</h3>
<pre><code class="code">
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MedInjection-FR/QWEN-4B-NAT-TRAD"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = """Un professionnel de santé de 54 ans consulte un spécialiste des maladies infectieuses pour un suivi concernant un diagnostic récent d'hépatite C chronique.
Il s'est initialement présenté avec des symptômes tels que fatigue, malaise et enzymes hépatiques élevées et soupçonne d'avoir contracté l'infection à la suite
d'une piqûre d'aiguille il y a des années. Malgré le début du traitement, son titre viral reste élevé, ce qui incite le médecin à ajouter un nouveau médicament
qui inhibe la maturation virale en bloquant la synthèse des protéines. Quel est l'effet indésirable le plus probable de ce médicament ?
Choix de réponses :
(A) Uropathie cristalline obstructive
(B) Suppression de la moelle osseuse
(C) Insomnie et irritabilité
(D) Céphalées et photosensibilité
(E) Rêves lucides
(F) Hyperbilirubinémie
(G) Pancréatite
(H) Neuropathie périphérique
(I) Augmentation de la créatine kinase
(J) Alopécie"""
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
</code></pre>
</section>
<!-- Evaluation note -->
<section class="card">
<h2>Evaluation at a glance</h2>
<ul>
<li>MCQ/MCQU reported with Exact-Match; MCQU also uses Hamming score.</li>
<li>OEQ uses BLEU/ROUGE/METEOR/BERTScore and an LLM-as-a-judge calibrated on <em>human annotations</em> (100 samples).</li>
<li>Mixed training (especially <strong>NAT-TRAD</strong>) provides complementary gains over single-source setups.</li>
</ul>
</section>
<!-- Ethics & License -->
<section id="ethics" class="card">
<h2>Ethics, Intended Use & License</h2>
<p class="warning">This dataset and the released models are for <strong>research use only</strong>. They are <strong>not</strong> a substitute for professional medical advice, diagnosis, or treatment.</p>
<ul>
<li>No PHI included; sources compiled from public datasets and teaching material.</li>
<li>Evaluation includes human expert checks for a small sample; outputs may still contain errors.</li>
<li>Please review the <a href="./LICENSE" target="_blank">LICENSE</a> before use. If unsure, contact the maintainers.</li>
</ul>
</section>
<!-- Citation -->
<section id="citation" class="card">
<h2>Citation</h2>
<p>If you use MedInjection-FR or the models, please cite:</p>
<pre><code class="code">
</code></pre>
</section>
<!-- Contact -->
<section id="contact" class="card">
<h2>Contact</h2>
<p>Questions, feedback, or requests: open an issue on the repo or email <a href="mailto:you@example.com">you@example.com</a>.</p>
</section>
<footer class="site-footer">
<p>© 2025 MedInjection-FR. Built for research and reproducibility.</p>
</footer>
</main>
</body>
</html>
|