Spaces:

BILALfym
/

SkimLit_NLP

Sleeping

App Files Files Community

SkimLit_NLP / README.md

AL FAYOUMI BILAL

ajout de readme

18a20ae about 1 month ago

preview code

raw

history blame contribute delete

5.6 kB

	---
	title: SkimLit NLP
	emoji: 📄
	colorFrom: red
	colorTo: blue
	sdk: docker
	app_port: 7860
	tags:
	- streamlit
	- nlp
	- medical
	- text-classification
	pinned: false
	short_description: Classification de phrases dans les résumés médicaux
	---

	# 📄 SkimLit — Classification séquentielle de phrases dans les résumés médicaux

	<p align="center">
	<a href="https://huggingface.co/spaces/BILALfym/Testing">
	<img src="https://img.shields.io/badge/🤗%20HuggingFace-Space-yellow" alt="HuggingFace Space"/>
	</a>
	<a href="https://huggingface.co/BILALfym/skimlit-model">
	<img src="https://img.shields.io/badge/🤗%20Modèle-skimlit--model-orange" alt="HuggingFace Model"/>
	</a>
	<a href="https://colab.research.google.com/github/bilalfym/SkimLit_NLP/blob/main/src/SkimLit_NLP.ipynb">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
	</a>
	<img src="https://img.shields.io/badge/Python-3.10+-blue" alt="Python"/>
	<img src="https://img.shields.io/badge/TensorFlow-2.13+-FF6F00" alt="TensorFlow"/>
	</p>

	Réplication de "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts" — Dernoncourt & Lee (2017).

	> Étant donné un résumé médical, classifier automatiquement chaque phrase dans l'une des cinq catégories rhétoriques :
	> Background · Objective · Methods · Results · Conclusions

	---

	## Démonstration

	L'application est déployée sur Hugging Face Spaces — colle un résumé médical et obtiens une version structurée en quelques secondes.

	🔗 [Essayer la démo](https://huggingface.co/spaces/BILALfym/Testing)

	---

	## Dataset

	[PubMed 20k RCT](https://github.com/Franck-Dernoncourt/pubmed-rct) — résumés d'essais cliniques randomisés (RCT) issus de PubMed, chaque phrase annotée avec son rôle rhétorique.

	\| Split \| Phrases \|
	\|-------\|---------\|
	\| Train \| 180 040 \|
	\| Validation \| 30 212 \|
	\| Test \| 30 135 \|

	---

	## Progression des architectures

	Le projet explore 6 architectures en complexité croissante :

	\| # \| Modèle \| Accuracy \| F1 (weighted) \|
	\|---\|--------\|:--------:\|:-------------:\|
	\| 0 \| Naive Bayes + TF-IDF (baseline) \| 72.18 % \| 0.699 \|
	\| 1 \| Conv1D — token embeddings \| ~85 % \| ~0.843 \|
	\| 2 \| Transfer Learning — USE (512 dims) \| ~87 % \| ~0.871 \|
	\| 3 \| Conv1D — character embeddings \| ~73 % \| ~0.720 \|
	\| 4 \| Hybride Token + Char (BiLSTM) \| ~88 % \| ~0.876 \|
	\| 5 \| Tribride — Token + Char + Positionnel \| 85.86 % \| 0.855 \|

	> Les résultats des modèles 1–4 sont issus de la publication originale.
	> Model 5 : résultat réel après 5 époques (non convergé — ~0.88–0.90 attendu à convergence).

	---

	## Architecture du modèle final (Model 5 — Tribride)

	Le modèle combine quatre flux d'entrée complémentaires :

	<p align="center">
	<img src="Images/model_5.png" width="600" alt="Architecture Model 5 Tribride"/>
	</p>

	\| Flux \| Entrée \| Sortie \|
	\|------\|--------\|--------\|
	\| Sémantique \| Embeddings `all-MiniLM-L6-v2` (384 dims) \| Dense → 128 \|
	\| Morphologique \| Texte caractère par caractère \| Embedding(25) → BiLSTM(24) → Dense → 128 \|
	\| Positionnel \| One-hot(numéro de ligne, depth=15) \| Dense → 32 \|
	\| Structurel \| One-hot(total de lignes, depth=20) \| Dense → 32 \|

	Les flux sémantique et morphologique sont fusionnés (Dense 256 + Dropout 0.5), puis concaténés avec les features positionnelles avant la couche de classification finale.

	### Modèle hybride intermédiaire (Model 4)

	<p align="center">
	<img src="Images/téléchargement.png" width="500" alt="Architecture Model 4 Hybride"/>
	</p>

	---

	## Stack technique

	```
	Notebook (Google Colab GPU)
	└── SentenceTransformer (all-MiniLM-L6-v2) # encodage sémantique
	└── TensorFlow / Keras # construction & entraînement
	└── scikit-learn # baseline + encodage labels

	Application
	└── Streamlit # interface web
	└── Docker # containerisation (PORT 7860)
	└── Hugging Face Spaces # hébergement & CI
	└── Hugging Face Hub # stockage du modèle
	```

	---

	## Lancer localement

	```bash
	# Cloner le dépôt
	git clone https://github.com/bilalfym/SkimLit_NLP.git
	cd SkimLit_NLP

	# Installer les dépendances
	pip install -r requirements.txt

	# Lancer l'application Streamlit
	streamlit run src/streamlit_app.py
	```

	Ou via Docker :

	```bash
	docker build -t skimlit .
	docker run -p 7860:7860 skimlit
	```

	---

	## Structure du projet

	```
	SkimLit_NLP/
	├── src/
	│ ├── SkimLit_NLP.ipynb # notebook complet (entraînement + évaluation)
	│ └── streamlit_app.py # application de démonstration
	├── Images/
	│ ├── model_5.png # architecture tribride
	│ └── téléchargement.png # architecture hybride (Model 4)
	├── .github/workflows/
	│ └── keep_alive.yml # ping automatique du Space HuggingFace
	├── Dockerfile
	└── requirements.txt
	```

	---

	## Références

	- Dernoncourt, F., & Lee, J. Y. (2017). PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. arXiv:1710.06071
	- Dernoncourt, F., & Lee, J. Y. (2017). Neural Networks for Joint Sentence Classification in Medical Paper Abstracts. EACL 2017.
	- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084