Spaces:

MedInjection-FR
/

README

Running

App Files Files Community

README / index.html

MedInjection-FR

Update index.html

366f2e3 verified 3 months ago

raw

history blame contribute delete

10 kB

	<!doctype html>
	<html lang="en">
	<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<title>MedInjection-FR • French Biomedical Instruction Dataset</title>
	<link rel="stylesheet" href="style.css" />
	<meta name="description" content="MedInjection-FR: a French biomedical instruction dataset with native, synthetic, and translated components, plus fine-tuned models." />
	</head>
	<body>
	<header class="site-header">
	<div class="wrap">
	<h1>MedInjection-FR</h1>
	<p class="subtitle">A French biomedical instruction dataset and model suite</p>
	<p class="meta">Native • Synthetic • Translated \| 571,436 instruction–response pairs</p>
	<div class="cta-row">
	<a class="btn primary" href="#download">Download</a>
	<a class="btn" href="#models">Models</a>
	<a class="btn" href="#citation">Cite</a>
	<a class="btn" href="#contact">Contact</a>
	</div>
	</div>
	</header>

	<main class="wrap">
	<!-- Overview -->
	<section id="overview" class="card">
	<h2>Overview</h2>
	<p>
	MedInjection-FR is a large-scale French biomedical instruction dataset designed to study
	how the <strong>provenance of supervision</strong> (native, synthetic, translated) affects instruction-tuning of LLMs.
	The corpus supports multiple-choice QA (single and multi-answer) and open-ended QA,
	and is released together with a family of fine-tuned baseline models.
	</p>
	<ul class="pill-list">
	<li>Native: <strong>77,247</strong></li>
	<li>Synthetic: <strong>76,506</strong></li>
	<li>Translated: <strong>417,674</strong></li>
	<li>Total: <strong>571,436</strong></li>
	</ul>
	</section>

	<!-- What’s inside -->
	<section id="composition" class="card">
	<h2>Composition & Tasks</h2>
	<div class="grid-2">
	<div>
	<h3>Task types</h3>
	<ul>
	<li>MCQU (single-answer)</li>
	<li>MCQ (multiple-answer)</li>
	<li>OEQ (open-ended QA)</li>
	</ul>
	<p class="muted">Counts (all components): OEQ <strong>57,509</strong>, MCQ <strong>59,592</strong>, MCQU <strong>454,335</strong>.</p>

	<h3>Splits</h3>
	<table class="clean">
	<thead>
	<tr><th>Component</th><th>Train</th><th>Validation</th><th>Test</th><th>Total</th></tr>
	</thead>
	<tbody>
	<tr><td>Native</td><td>57,563</td><td>5,055</td><td>14,629</td><td>77,247</td></tr>
	<tr><td>Synthetic</td><td>76,506</td><td>—</td><td>—</td><td>76,506</td></tr>
	<tr><td>Translated</td><td>366,370 </td><td>38,011</td><td>13,293</td><td>417,674</td></tr>
	<tr class="total"><td>Total</td><td>500,439</td><td>43,066</td><td>27,931</td><td>571,436</td></tr>
	</tbody>
	</table>
	</div>
	<div>
	<h3>Translated quality (WMT24 biomedical parallel)</h3>
	<table class="clean">
	<thead><tr><th>Model</th><th>BLEU</th><th>COMET</th></tr></thead>
	<tbody>
	<tr><td>GPT-4o-mini</td><td>51.01</td><td>0.8751</td></tr>
	<tr><td>Gemini 2.0 Flash</td><td>53.72</td><td>0.8783</td></tr>
	<tr class="muted"><td>WMT’24 best (ref.)</td><td>53.54</td><td>0.8760</td></tr>
	</tbody>
	</table>
	<p class="muted">Higher is better. These scores indicate strong translation fidelity for the translated subset.</p>
	</div>
	</div>
	</section>

	<!-- Downloads -->
	<section id="download" class="card">
	<h2>Download</h2>
	<p>Each component is published separately. Use the links below or load via the 🤗 Datasets library.</p>
	<div class="grid-3 tight">
	<a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Native" target="_blank">
	<h3>Native</h3>
	<p>French medical exams, resources, curated QA.</p>
	<code>MedInjection-FR/Native</code>
	</a>
	<a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Synthetic" target="_blank">
	<h3>Synthetic</h3>
	<p>GPT-4o generated from clinical cases and abstracts.</p>
	<code>MedInjection-FR/Synthetic</code>
	</a>
	<a class="tile" href="https://huggingface.co/datasets/MedInjection-FR/Translated" target="_blank">
	<h3>Translated</h3>
	<p>FR translations of established EN biomedical sets.</p>
	<code>MedInjection-FR/Translated</code>
	</a>
	</div>

	<h3>Python (🤗 Datasets)</h3>
	<pre><code class="code">
	from datasets import load_dataset

	ds = load_dataset("MedInjection-FR/Native") # or "Synthetic", "Translated"
	print(ds)
	</code></pre>
	</section>

	<!-- Models -->
	<section id="models" class="card">
	<h2>Fine-tuned Models</h2>
	<p>We release seven instruction-tuned baselines (Qwen-4B-Instruct backbone, DoRA adapters), trained on 30k samples per configuration:</p>
	<ul class="pill-list">
	<li>QWEN-4B-NAT</li>
	<li>QWEN-4B-TRAD</li>
	<li>QWEN-4B-SYN</li>
	<li>QWEN-4B-NAT-TRAD</li>
	<li>QWEN-4B-NAT-SYN</li>
	<li>QWEN-4B-TRAD-SYN</li>
	<li>QWEN-4B-ALL</li>
	</ul>

	<div class="grid-3 tight">
	<a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-NAT" target="_blank"><h3>NAT</h3><p>Best single-source (MCQ/MCQU).</p></a>
	<a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-NAT-TRAD" target="_blank"><h3>NAT-TRAD</h3><p>Top mixed configuration.</p></a>
	<a class="tile" href="https://huggingface.co/MedInjection-FR/QWEN-4B-ALL" target="_blank"><h3>ALL</h3><p>All sources combined.</p></a>
	</div>

	<h3>Quick inference (🤗 Transformers)</h3>
	<pre><code class="code">
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "MedInjection-FR/QWEN-4B-NAT-TRAD"

	# load the tokenizer and the model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# prepare the model input
	prompt = """Un professionnel de santé de 54 ans consulte un spécialiste des maladies infectieuses pour un suivi concernant un diagnostic récent d'hépatite C chronique.
	Il s'est initialement présenté avec des symptômes tels que fatigue, malaise et enzymes hépatiques élevées et soupçonne d'avoir contracté l'infection à la suite
	d'une piqûre d'aiguille il y a des années. Malgré le début du traitement, son titre viral reste élevé, ce qui incite le médecin à ajouter un nouveau médicament
	qui inhibe la maturation virale en bloquant la synthèse des protéines. Quel est l'effet indésirable le plus probable de ce médicament ?
	Choix de réponses :
	(A) Uropathie cristalline obstructive
	(B) Suppression de la moelle osseuse
	(C) Insomnie et irritabilité
	(D) Céphalées et photosensibilité
	(E) Rêves lucides
	(F) Hyperbilirubinémie
	(G) Pancréatite
	(H) Neuropathie périphérique
	(I) Augmentation de la créatine kinase
	(J) Alopécie"""
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# conduct text completion
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=1
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

	content = tokenizer.decode(output_ids, skip_special_tokens=True)

	print("content:", content)

	</code></pre>
	</section>

	<!-- Evaluation note -->
	<section class="card">
	<h2>Evaluation at a glance</h2>
	<ul>
	<li>MCQ/MCQU reported with Exact-Match; MCQU also uses Hamming score.</li>
	<li>OEQ uses BLEU/ROUGE/METEOR/BERTScore and an LLM-as-a-judge calibrated on <em>human annotations</em> (100 samples).</li>
	<li>Mixed training (especially <strong>NAT-TRAD</strong>) provides complementary gains over single-source setups.</li>
	</ul>
	</section>

	<!-- Ethics & License -->
	<section id="ethics" class="card">
	<h2>Ethics, Intended Use & License</h2>
	<p class="warning">This dataset and the released models are for <strong>research use only</strong>. They are <strong>not</strong> a substitute for professional medical advice, diagnosis, or treatment.</p>
	<ul>
	<li>No PHI included; sources compiled from public datasets and teaching material.</li>
	<li>Evaluation includes human expert checks for a small sample; outputs may still contain errors.</li>
	<li>Please review the <a href="./LICENSE" target="_blank">LICENSE</a> before use. If unsure, contact the maintainers.</li>
	</ul>
	</section>

	<!-- Citation -->
	<section id="citation" class="card">
	<h2>Citation</h2>
	<p>If you use MedInjection-FR or the models, please cite:</p>
	<pre><code class="code">

	</code></pre>
	</section>

	<!-- Contact -->
	<section id="contact" class="card">
	<h2>Contact</h2>
	<p>Questions, feedback, or requests: open an issue on the repo or email <a href="mailto:you@example.com">you@example.com</a>.</p>
	</section>

	<footer class="site-footer">
	<p>© 2025 MedInjection-FR. Built for research and reproducibility.</p>
	</footer>
	</main>
	</body>
	</html>