|
|
---
|
|
|
language: en
|
|
|
license: mit
|
|
|
tags:
|
|
|
- text-classification
|
|
|
- survey-classification
|
|
|
- james-river
|
|
|
- bert
|
|
|
datasets:
|
|
|
- custom
|
|
|
metrics:
|
|
|
- accuracy
|
|
|
- f1
|
|
|
model-index:
|
|
|
- name: james-river-classifier
|
|
|
results:
|
|
|
- task:
|
|
|
type: text-classification
|
|
|
name: Text Classification
|
|
|
dataset:
|
|
|
type: custom
|
|
|
name: James River Survey Classification
|
|
|
metrics:
|
|
|
- type: accuracy
|
|
|
value: 0.996
|
|
|
---
|
|
|
|
|
|
# James River Survey Classifier
|
|
|
|
|
|
This model classifies survey-related text messages into different job types for James River surveying services.
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
- **Model Type**: BERT-based text classification
|
|
|
- **Base Model**: bert-base-uncased
|
|
|
- **Language**: English
|
|
|
- **Task**: Multi-class text classification
|
|
|
- **Classes**: 6 survey job types
|
|
|
|
|
|
## Classes
|
|
|
|
|
|
The model can classify text into the following survey job types:
|
|
|
|
|
|
- **Boundary Survey** (ID: 0)
|
|
|
- **Construction Survey** (ID: 1)
|
|
|
- **Fence Staking** (ID: 2)
|
|
|
- **Other/General** (ID: 3)
|
|
|
- **Real Estate Survey** (ID: 4)
|
|
|
- **Subdivision Survey** (ID: 5)
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
import torch
|
|
|
import json
|
|
|
|
|
|
# Load model and tokenizer
|
|
|
model_name = "ityndall/james-river-classifier"
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
|
|
|
|
|
# Load label mapping
|
|
|
import requests
|
|
|
label_mapping_url = f"https://huggingface.co/{model_name}/resolve/main/label_mapping.json"
|
|
|
label_mapping = requests.get(label_mapping_url).json()
|
|
|
|
|
|
def classify_text(text):
|
|
|
# Tokenize input
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
|
|
|
|
|
|
# Get prediction
|
|
|
with torch.no_grad():
|
|
|
outputs = model(**inputs)
|
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
|
|
predicted_class_id = predictions.argmax().item()
|
|
|
confidence = predictions[0][predicted_class_id].item()
|
|
|
|
|
|
# Get label
|
|
|
predicted_label = label_mapping["id2label"][str(predicted_class_id)]
|
|
|
|
|
|
return {
|
|
|
"label": predicted_label,
|
|
|
"confidence": confidence,
|
|
|
"class_id": predicted_class_id
|
|
|
}
|
|
|
|
|
|
# Example usage
|
|
|
text = "I need a boundary survey for my property"
|
|
|
result = classify_text(text)
|
|
|
print(f"Predicted: {result['label']} (confidence: {result['confidence']:.3f})")
|
|
|
```
|
|
|
|
|
|
## Training Data
|
|
|
|
|
|
The model was trained on 1,000 survey-related text messages with the following distribution:
|
|
|
|
|
|
- **Other/General**: 919 samples (91.9%)
|
|
|
- **Real Estate Survey**: 49 samples (4.9%)
|
|
|
- **Fence Staking**: 21 samples (2.1%)
|
|
|
- **Subdivision Survey**: 4 samples (0.4%)
|
|
|
- **Boundary Survey**: 4 samples (0.4%)
|
|
|
- **Construction Survey**: 3 samples (0.3%)
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
- **Training Framework**: Hugging Face Transformers
|
|
|
- **Base Model**: bert-base-uncased
|
|
|
- **Training Epochs**: 3
|
|
|
- **Batch Size**: 8
|
|
|
- **Learning Rate**: 5e-05
|
|
|
- **Optimizer**: AdamW
|
|
|
- **Training Loss**: 0.279
|
|
|
- **Training Time**: ~19.5 minutes
|
|
|
|
|
|
## Model Performance
|
|
|
|
|
|
The model achieved a training loss of 0.279 after 3 epochs. However, note that this is a highly imbalanced dataset, and performance on minority classes may vary.
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- The model was trained on a small, imbalanced dataset
|
|
|
- Performance on minority classes (Construction Survey, Boundary Survey, Subdivision Survey) may be limited due to few training examples
|
|
|
- The model may have a bias toward predicting "Other/General" due to class imbalance
|
|
|
|
|
|
## Intended Use
|
|
|
|
|
|
This model is specifically designed for classifying survey-related inquiries for James River surveying services. It should not be used for other domains without additional training.
|
|
|
|
|
|
## Files
|
|
|
|
|
|
- `config.json`: Model configuration
|
|
|
- `model.safetensors`: Model weights
|
|
|
- `tokenizer.json`, `tokenizer_config.json`, `vocab.txt`: Tokenizer files
|
|
|
- `label_encoder.pkl`: Original scikit-learn label encoder
|
|
|
- `label_mapping.json`: Human-readable label mappings
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
If you use this model, please cite:
|
|
|
|
|
|
```
|
|
|
@misc{james-river-classifier,
|
|
|
title={James River Survey Classifier},
|
|
|
author={James River Surveying},
|
|
|
year={2025},
|
|
|
url={https://huggingface.co/ityndall/james-river-classifier}
|
|
|
}
|
|
|
```
|
|
|
|