--- library_name: transformers license: apache-2.0 tags: - healthcare - column-normalization - text-classification - distilgpt2 model-index: - name: tsilva/clinical-field-mapper-classification results: - task: name: Field Classification type: text-classification dataset: name: tsilva/clinical-field-mappings type: healthcare metrics: - name: train Accuracy type: accuracy value: 0.9471 - name: validation Accuracy type: accuracy value: 0.9144 - name: test Accuracy type: accuracy value: 0.9156 --- # Model Card for tsilva/clinical-field-mapper-classification This model is a fine-tuned version of `distilbert/distilgpt2` on the [`tsilva/clinical-field-mappings`](https://huggingface.co/datasets/tsilva/clinical-field-mappings/tree/4d4cdba1b7e9b1eff2893c7014cfc08fe58a73bc) dataset. Its purpose is to normalize healthcare database column names to a standardized set of target column names. ## Task This model is a sequence classification model that maps free-text field names to a set of standardized schema terms. ## Usage from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("tsilva/clinical-field-mapper-classification") model = AutoModelForSequenceClassification.from_pretrained("tsilva/clinical-field-mapper-classification") def predict(input_text): inputs = tokenizer(input_text, return_tensors="pt") outputs = model(**inputs) pred = outputs.logits.argmax(-1).item() label = model.config.id2label[str(pred)] if hasattr(model.config, 'id2label') else pred print(f"Predicted label: family_history_reported") predict('cardi@') ## Evaluation Results - **train accuracy**: 94.71% - **validation accuracy**: 91.44% - **test accuracy**: 91.56% ## Training Details - **Seed**: 42 - **Epochs scheduled**: 50 - **Epochs completed**: 34 - **Early stopping triggered**: Yes - **Final training loss**: 1.0888 - **Final evaluation loss**: 0.9916 - **Optimizer**: adamw_bnb_8bit - **Learning rate**: 0.0005 - **Batch size**: 1024 - **Precision**: fp16 - **DeepSpeed enabled**: True - **Gradient accumulation steps**: 1 ## License Specify your license here (e.g., Apache 2.0, MIT, etc.) ## Limitations and Bias - Model was trained on a specific clinical mapping dataset. - Performance may vary on out-of-distribution column names. - Ensure you validate model outputs in production environments.