YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
---
language:
- en
license: mit
library_name: scikit-learn
tags:
- text-classification
- embeddings
- symptom-routing
- public-health
- information-retrieval
- logistic-regression
- linear-svm
- classifier-comparison
datasets:
- https://huggingface.co/datasets/charlie0831/english-symptom-routing
metrics:
- accuracy
- f1
---
# Symptom Routing Embedding Classifier Comparison
## Model Description
This repository contains embedding-based classifiers for routing short English symptom descriptions into broad public health information categories. The project was created for an Information Retrieval assignment to test whether frozen text embeddings can support query routing before retrieval.
Four classifiers were trained and evaluated on the same train/test split:
- Logistic Regression
- Linear SVM
- KNN with cosine distance
- Random Forest
The best deployed classifier is Logistic Regression. It achieved **0.957 accuracy** and **0.952 macro F1** on the held-out test set.
The model predicts one of six broad routing categories:
- `respiratory`
- `gastrointestinal`
- `skin`
- `neurological`
- `musculoskeletal`
- `mental_health_sleep`
The labels are broad public health information categories, not diagnoses or clinical conditions.
**Important safety notice:**
This model is for teaching information retrieval and text classification. It does not provide medical diagnosis, treatment advice, or emergency guidance.
## Intended Use
This model is intended for:
- teaching embedding-based text classification
- routing short symptom descriptions toward broad public health information categories
- demonstrating information retrieval query routing
- comparing lightweight classifiers trained on frozen embeddings
- supporting a Gradio demo for symptom information routing
Example use case:
```text
Input: I have a dry cough and sore throat.
Output: respiratory
The predicted category can be used as a retrieval signal for selecting broad public health information resources.
Out-of-Scope Use
This model must not be used for:
- medical diagnosis
- treatment recommendation
- medication advice
- emergency triage
- clinical decision-making
- replacing professional medical review
- making decisions about real patients
If someone has severe, worsening, or urgent symptoms, they should contact a qualified medical professional or emergency service.
Research Question
Can text embeddings classify short symptom descriptions into broad health-information categories for routing users toward relevant public health resources?
Training Data
The classifiers were trained on a custom English symptom routing dataset created for this assignment.
Dataset file:
symptom_routing_expanded_dataset.csv
Dataset size:
90 examples
Data split:
Training rows: 67
Test rows: 23
Test size: 0.25
Random seed: 712
Stratified by label
Each row contains:
| Column | Description |
|---|---|
text |
A short English symptom description |
label |
A broad public health information category |
The dataset contains six balanced categories:
| Label | Description |
|---|---|
respiratory |
cough, sore throat, breathing problems |
gastrointestinal |
stomach pain, nausea, diarrhea |
skin |
rash, itching, swelling |
neurological |
headache, dizziness, numbness |
musculoskeletal |
back pain, joint pain, muscle pain |
mental_health_sleep |
anxiety, insomnia, low mood, sleep problems |
Hugging Face Dataset:
TODO: add your dataset link
Method
The project uses a frozen embedding model to convert symptom descriptions into vector representations. Downstream classifiers were then trained on these embedding vectors.
Pipeline:
symptom text -> embedding model -> embedding vector -> classifier -> predicted category
Embedding model:
nicher92/saga-embed_v1
Embedding usage:
Frozen text encoder; only downstream classifiers were trained.
Compared classifiers:
| Classifier | Description |
|---|---|
| Logistic Regression | Regularized linear classifier |
| Linear SVM | Linear support vector classifier |
| KNN cosine | k-nearest neighbors using cosine distance |
| Random Forest | Ensemble tree-based classifier |
Model Files
This repository contains:
| File | Description |
|---|---|
model_phase2.joblib |
Best deployed classifier, Logistic Regression |
all_classifiers_phase2.joblib |
All trained classifiers from the comparison experiment |
metrics.json |
Evaluation metrics for all classifiers |
predictions_phase2.csv |
Test set predictions from the selected model |
README.md |
Model card |
Evaluation
The classifiers were evaluated on the held-out test set.
| Classifier | Accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|
| Logistic Regression | 0.957 | 0.952 | 0.957 |
| Linear SVM | 0.957 | 0.952 | 0.957 |
| KNN cosine | 0.783 | 0.763 | 0.764 |
| Random Forest | 0.913 | 0.903 | 0.909 |
Logistic Regression and Linear SVM achieved the same top score. Logistic Regression was selected for deployment because it supports probability scores through predict_proba, which makes the demo output more informative.
Macro F1 is important because it measures whether the classifier performs consistently across all categories rather than only performing well on the most common category.
Detailed results are available in:
metrics.json
predictions_phase2.csv
Example Predictions
Example 1:
Input: I have a fever, dry cough, and sore throat.
Predicted category: respiratory
Example 2:
Input: My stomach hurts after eating and I feel nauseous.
Predicted category: gastrointestinal
Example 3:
Input: I cannot sleep and I feel anxious most nights.
Predicted category: mental_health_sleep
Limitations
This model was trained on a small educational dataset with 90 manually created examples. It may not generalize well to real-world symptom descriptions, different writing styles, misspellings, slang, or complex multi-symptom cases.
Some symptom descriptions can reasonably belong to more than one category. For example, a sentence mentioning both dizziness and sleep problems may be difficult to classify because it contains signals for both neurological and mental_health_sleep.
The model should only be interpreted as a broad routing tool for information retrieval experiments. It should not be interpreted as a clinical or diagnostic system.
Demo
A working Hugging Face Space demo is available here:
https://huggingface.co/spaces/charlie0831/symptom-routing-embedding-demo
The demo lets users enter a short symptom description and returns the predicted routing category, a retrieval focus, and category probability scores.
Repository Links
| Resource | Link |
|---|---|
| GitHub Repository | https://github.com/aa9911220/english-symptom-routing-embeddings/tree/main |
| Hugging Face Dataset | https://huggingface.co/datasets/charlie0831/english-symptom-routing |
| Hugging Face Demo Space | https://huggingface.co/spaces/charlie0831/symptom-routing-embedding-demo |
AI Tool Use
AI coding tools were used to support coding, documentation, dataset formatting, training script development, model card writing, and report drafting. The outputs were manually checked and edited, especially the medical safety statements, label definitions, evaluation results, and limitations.
Using AI tools helped speed up implementation, but the project still required manual understanding of the data, embedding pipeline, classifier comparison, and evaluation results.
License
This model is released under the MIT License.