Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="utf-8" /> | |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> | |
| <title>MedInjection-FR • French Biomedical Instruction Dataset</title> | |
| <link rel="stylesheet" href="style.css" /> | |
| <meta name="description" content="MedInjection-FR: a French biomedical instruction dataset with native, synthetic, and translated components, plus fine-tuned models." /> | |
| </head> | |
| <body> | |
| <header class="site-header"> | |
| <div class="wrap"> | |
| <h1>MedInjection-FR</h1> | |
| <p class="subtitle">A French biomedical instruction dataset and model suite</p> | |
| <p class="meta">Native • Synthetic • Translated | 571,436 instruction–response pairs</p> | |
| <div class="cta-row"> | |
| <a class="btn primary" href="#download">Download</a> | |
| <a class="btn" href="#models">Models</a> | |
| <a class="btn" href="#citation">Cite</a> | |
| <a class="btn" href="#contact">Contact</a> | |
| </div> | |
| </div> | |
| </header> | |
| <main class="wrap"> | |
| <!-- Overview --> | |
| <section id="overview" class="card"> | |
| <h2>Overview</h2> | |
| <p> | |
| MedInjection-FR is a large-scale French biomedical instruction dataset designed to study | |
| how the <strong>provenance of supervision</strong> (native, synthetic, translated) affects instruction-tuning of LLMs. | |
| The corpus supports multiple-choice QA (single and multi-answer) and open-ended QA, | |
| and is released together with a family of fine-tuned baseline models. | |
| </p> | |
| <ul class="pill-list"> | |
| <li>Native: <strong>77,247</strong></li> | |
| <li>Synthetic: <strong>76,506</strong></li> | |
| <li>Translated: <strong>417,674</strong></li> | |
| <li>Total: <strong>571,436</strong></li> | |
| </ul> | |
| </section> | |
| <!-- What’s inside --> | |
| <section id="composition" class="card"> | |
| <h2>Composition & Tasks</h2> | |
| <div class="grid-2"> | |
| <div> | |
| <h3>Task types</h3> | |
| <ul> | |
| <li>MCQU (single-answer)</li> | |
| <li>MCQ (multiple-answer)</li> | |
| <li>OEQ (open-ended QA)</li> | |
| </ul> | |
| <p class="muted">Counts (all components): OEQ <strong>57,509</strong>, MCQ <strong>59,592</strong>, MCQU <strong>454,335</strong>.</p> | |
| <h3>Splits</h3> | |
| <table class="clean"> | |
| <thead> | |
| <tr><th>Component</th><th>Train</th><th>Validation</th><th>Test</th><th>Total</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr><td>Native</td><td>57,563</td><td>5,055</td><td>14,629</td><td>77,247</td></tr> | |
| <tr><td>Synthetic</td><td>76,506</td><td>—</td><td>—</td><td>76,506</td></tr> | |
| <tr><td>Translated</td><td>366,370 </td><td>38,011</td><td>13,293</td><td>417,674</td></tr> | |
| <tr class="total"><td>Total</td><td>500,439</td><td>43,066</td><td>27,931</td><td>571,436</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <div> | |
| <h3>Translated quality (WMT24 biomedical parallel)</h3> | |
| <table class="clean"> | |
| <thead><tr><th>Model</th><th>BLEU</th><th>COMET</th></tr></thead> | |
| <tbody> | |
| <tr><td>GPT-4o-mini</td><td>51.01</td><td>0.8751</td></tr> | |
| <tr><td>Gemini 2.0 Flash</td><td>53.72</td><td>0.8783</td></tr> | |
| <tr class="muted"><td>WMT’24 best (ref.)</td><td>53.54</td><td>0.8760</td></tr> | |
| </tbody> | |
| </table> | |
| <p class="muted">Higher is better. These scores indicate strong translation fidelity for the translated subset.</p> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- Downloads --> | |
| <section id="download" class="card"> | |
| <h2>Download</h2> | |
| <p>Each component is published separately. Use the links below or load via the 🤗 Datasets library.</p> | |
| <div class="grid-3 tight"> | |
| <a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Native" target="_blank"> | |
| <h3>Native</h3> | |
| <p>French medical exams, resources, curated QA.</p> | |
| <code>MedInjection-FR/Native</code> | |
| </a> | |
| <a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Synthetic" target="_blank"> | |
| <h3>Synthetic</h3> | |
| <p>GPT-4o generated from clinical cases and abstracts.</p> | |
| <code>MedInjection-FR/Synthetic</code> | |
| </a> | |
| <a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Translated" target="_blank"> | |
| <h3>Translated</h3> | |
| <p>FR translations of established EN biomedical sets.</p> | |
| <code>MedInjection-FR/Translated</code> | |
| </a> | |
| </div> | |
| <h3>Python (🤗 Datasets)</h3> | |
| <pre><code class="code"> | |
| from datasets import load_dataset | |
| ds = load_dataset("MedInjection-FR/Native") # or "Synthetic", "Translated" | |
| print(ds) | |
| </code></pre> | |
| </section> | |
| <!-- Models --> | |
| <section id="models" class="card"> | |
| <h2>Fine-tuned Models</h2> | |
| <p>We release seven instruction-tuned baselines (Qwen-4B-Instruct backbone, DoRA adapters), trained on 30k samples per configuration:</p> | |
| <ul class="pill-list"> | |
| <li>QWEN-4B-NAT</li> | |
| <li>QWEN-4B-TRAD</li> | |
| <li>QWEN-4B-SYN</li> | |
| <li>QWEN-4B-NAT-TRAD</li> | |
| <li>QWEN-4B-NAT-SYN</li> | |
| <li>QWEN-4B-TRAD-SYN</li> | |
| <li>QWEN-4B-ALL</li> | |
| </ul> | |
| <div class="grid-3 tight"> | |
| <a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-NAT" target="_blank"><h3>NAT</h3><p>Best single-source (MCQ/MCQU).</p></a> | |
| <a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-NAT-TRAD" target="_blank"><h3>NAT-TRAD</h3><p>Top mixed configuration.</p></a> | |
| <a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-ALL" target="_blank"><h3>ALL</h3><p>All sources combined.</p></a> | |
| </div> | |
| <h3>Quick inference (🤗 Transformers)</h3> | |
| <pre><code class="code"> | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "MedInjection-FR/QWEN-4B-NAT-TRAD" | |
| # load the tokenizer and the model | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto" | |
| ) | |
| # prepare the model input | |
| prompt = """Un professionnel de santé de 54 ans consulte un spécialiste des maladies infectieuses pour un suivi concernant un diagnostic récent d'hépatite C chronique. | |
| Il s'est initialement présenté avec des symptômes tels que fatigue, malaise et enzymes hépatiques élevées et soupçonne d'avoir contracté l'infection à la suite | |
| d'une piqûre d'aiguille il y a des années. Malgré le début du traitement, son titre viral reste élevé, ce qui incite le médecin à ajouter un nouveau médicament | |
| qui inhibe la maturation virale en bloquant la synthèse des protéines. Quel est l'effet indésirable le plus probable de ce médicament ? | |
| Choix de réponses : | |
| (A) Uropathie cristalline obstructive | |
| (B) Suppression de la moelle osseuse | |
| (C) Insomnie et irritabilité | |
| (D) Céphalées et photosensibilité | |
| (E) Rêves lucides | |
| (F) Hyperbilirubinémie | |
| (G) Pancréatite | |
| (H) Neuropathie périphérique | |
| (I) Augmentation de la créatine kinase | |
| (J) Alopécie""" | |
| messages = [ | |
| {"role": "user", "content": prompt} | |
| ] | |
| text = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| model_inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| # conduct text completion | |
| generated_ids = model.generate( | |
| **model_inputs, | |
| max_new_tokens=1 | |
| ) | |
| output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() | |
| content = tokenizer.decode(output_ids, skip_special_tokens=True) | |
| print("content:", content) | |
| </code></pre> | |
| </section> | |
| <!-- Evaluation note --> | |
| <section class="card"> | |
| <h2>Evaluation at a glance</h2> | |
| <ul> | |
| <li>MCQ/MCQU reported with Exact-Match; MCQU also uses Hamming score.</li> | |
| <li>OEQ uses BLEU/ROUGE/METEOR/BERTScore and an LLM-as-a-judge calibrated on <em>human annotations</em> (100 samples).</li> | |
| <li>Mixed training (especially <strong>NAT-TRAD</strong>) provides complementary gains over single-source setups.</li> | |
| </ul> | |
| </section> | |
| <!-- Ethics & License --> | |
| <section id="ethics" class="card"> | |
| <h2>Ethics, Intended Use & License</h2> | |
| <p class="warning">This dataset and the released models are for <strong>research use only</strong>. They are <strong>not</strong> a substitute for professional medical advice, diagnosis, or treatment.</p> | |
| <ul> | |
| <li>No PHI included; sources compiled from public datasets and teaching material.</li> | |
| <li>Evaluation includes human expert checks for a small sample; outputs may still contain errors.</li> | |
| <li>Please review the <a href="./LICENSE" target="_blank">LICENSE</a> before use. If unsure, contact the maintainers.</li> | |
| </ul> | |
| </section> | |
| <!-- Citation --> | |
| <section id="citation" class="card"> | |
| <h2>Citation</h2> | |
| <p>If you use MedInjection-FR or the models, please cite:</p> | |
| <pre><code class="code"> | |
| </code></pre> | |
| </section> | |
| <!-- Contact --> | |
| <section id="contact" class="card"> | |
| <h2>Contact</h2> | |
| <p>Questions, feedback, or requests: open an issue on the repo or email <a href="mailto:you@example.com">you@example.com</a>.</p> | |
| </section> | |
| <footer class="site-footer"> | |
| <p>© 2025 MedInjection-FR. Built for research and reproducibility.</p> | |
| </footer> | |
| </main> | |
| </body> | |
| </html> | |