YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

---
language:
- en
license: mit
library_name: scikit-learn
tags:
- text-classification
- embeddings
- symptom-routing
- public-health
- information-retrieval
- logistic-regression
- linear-svm
- classifier-comparison
datasets:
- https://huggingface.co/datasets/charlie0831/english-symptom-routing
metrics:
- accuracy
- f1
---

# Symptom Routing Embedding Classifier Comparison

## Model Description

This repository contains embedding-based classifiers for routing short English symptom descriptions into broad public health information categories. The project was created for an Information Retrieval assignment to test whether frozen text embeddings can support query routing before retrieval.

Four classifiers were trained and evaluated on the same train/test split:

- Logistic Regression
- Linear SVM
- KNN with cosine distance
- Random Forest

The best deployed classifier is Logistic Regression. It achieved **0.957 accuracy** and **0.952 macro F1** on the held-out test set.

The model predicts one of six broad routing categories:

- `respiratory`
- `gastrointestinal`
- `skin`
- `neurological`
- `musculoskeletal`
- `mental_health_sleep`

The labels are broad public health information categories, not diagnoses or clinical conditions.

**Important safety notice:**  
This model is for teaching information retrieval and text classification. It does not provide medical diagnosis, treatment advice, or emergency guidance.

## Intended Use

This model is intended for:

- teaching embedding-based text classification
- routing short symptom descriptions toward broad public health information categories
- demonstrating information retrieval query routing
- comparing lightweight classifiers trained on frozen embeddings
- supporting a Gradio demo for symptom information routing

Example use case:

```text
Input: I have a dry cough and sore throat.
Output: respiratory

The predicted category can be used as a retrieval signal for selecting broad public health information resources.

Out-of-Scope Use

This model must not be used for:

medical diagnosis
treatment recommendation
medication advice
emergency triage
clinical decision-making
replacing professional medical review
making decisions about real patients

If someone has severe, worsening, or urgent symptoms, they should contact a qualified medical professional or emergency service.

Research Question

Can text embeddings classify short symptom descriptions into broad health-information categories for routing users toward relevant public health resources?

Training Data

The classifiers were trained on a custom English symptom routing dataset created for this assignment.

Dataset file:

symptom_routing_expanded_dataset.csv

Dataset size:

90 examples

Data split:

Training rows: 67
Test rows: 23
Test size: 0.25
Random seed: 712
Stratified by label

Each row contains:

Column	Description
`text`	A short English symptom description
`label`	A broad public health information category

The dataset contains six balanced categories:

Label	Description
`respiratory`	cough, sore throat, breathing problems
`gastrointestinal`	stomach pain, nausea, diarrhea
`skin`	rash, itching, swelling
`neurological`	headache, dizziness, numbness
`musculoskeletal`	back pain, joint pain, muscle pain
`mental_health_sleep`	anxiety, insomnia, low mood, sleep problems

Hugging Face Dataset:

TODO: add your dataset link

Method

The project uses a frozen embedding model to convert symptom descriptions into vector representations. Downstream classifiers were then trained on these embedding vectors.

Pipeline:

symptom text -> embedding model -> embedding vector -> classifier -> predicted category

Embedding model:

nicher92/saga-embed_v1

Embedding usage:

Frozen text encoder; only downstream classifiers were trained.

Compared classifiers:

Classifier	Description
Logistic Regression	Regularized linear classifier
Linear SVM	Linear support vector classifier
KNN cosine	k-nearest neighbors using cosine distance
Random Forest	Ensemble tree-based classifier

Model Files

This repository contains:

File	Description
`model_phase2.joblib`	Best deployed classifier, Logistic Regression
`all_classifiers_phase2.joblib`	All trained classifiers from the comparison experiment
`metrics.json`	Evaluation metrics for all classifiers
`predictions_phase2.csv`	Test set predictions from the selected model
`README.md`	Model card

Evaluation

The classifiers were evaluated on the held-out test set.

Classifier	Accuracy	Macro F1	Weighted F1
Logistic Regression	0.957	0.952	0.957
Linear SVM	0.957	0.952	0.957
KNN cosine	0.783	0.763	0.764
Random Forest	0.913	0.903	0.909

Logistic Regression and Linear SVM achieved the same top score. Logistic Regression was selected for deployment because it supports probability scores through predict_proba, which makes the demo output more informative.

Macro F1 is important because it measures whether the classifier performs consistently across all categories rather than only performing well on the most common category.

Detailed results are available in:

metrics.json
predictions_phase2.csv

Example Predictions

Example 1:

Input: I have a fever, dry cough, and sore throat.
Predicted category: respiratory

Example 2:

Input: My stomach hurts after eating and I feel nauseous.
Predicted category: gastrointestinal

Example 3:

Input: I cannot sleep and I feel anxious most nights.
Predicted category: mental_health_sleep

Limitations

This model was trained on a small educational dataset with 90 manually created examples. It may not generalize well to real-world symptom descriptions, different writing styles, misspellings, slang, or complex multi-symptom cases.

Some symptom descriptions can reasonably belong to more than one category. For example, a sentence mentioning both dizziness and sleep problems may be difficult to classify because it contains signals for both neurological and mental_health_sleep.

The model should only be interpreted as a broad routing tool for information retrieval experiments. It should not be interpreted as a clinical or diagnostic system.

Demo

A working Hugging Face Space demo is available here:

https://huggingface.co/spaces/charlie0831/symptom-routing-embedding-demo

The demo lets users enter a short symptom description and returns the predicted routing category, a retrieval focus, and category probability scores.

Repository Links

Resource	Link
GitHub Repository	https://github.com/aa9911220/english-symptom-routing-embeddings/tree/main
Hugging Face Dataset	https://huggingface.co/datasets/charlie0831/english-symptom-routing
Hugging Face Demo Space	https://huggingface.co/spaces/charlie0831/symptom-routing-embedding-demo

AI Tool Use

AI coding tools were used to support coding, documentation, dataset formatting, training script development, model card writing, and report drafting. The outputs were manually checked and edited, especially the medical safety statements, label definitions, evaluation results, and limitations.

Using AI tools helped speed up implementation, but the project still required manual understanding of the data, embedding pipeline, classifier comparison, and evaluation results.

License

This model is released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support