File size: 4,266 Bytes
da6e1f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
library_name: sklearn
tags:
- text-classification
- dependency-detection
- random-forest
- nlp
- query-dependency
- conversational-ai
pipeline_tag: text-classification
metrics:
- accuracy
- f1
- precision
- recall
---
# Query Dependence Classifier
A Random Forest model that determines whether a second query depends on the context of a first query in conversational AI systems.
## Model Description
- **Model Type:** Random Forest Classifier (scikit-learn)
- **Task:** Binary text classification for query dependency detection
- **Features:** 45 engineered linguistic features
- **Classes:** Independent vs Dependent queries
## Intended Use
This model is designed for conversational AI systems to determine if a follow-up question requires context from a previous query.
**Examples:**
- Query 1: "What is machine learning?" Query 2: "Can you give me examples?" → **Dependent**
- Query 1: "What is AI?" Query 2: "What's the weather today?" → **Independent**
## Model Performance
- **Training Features:** 45 engineered features
- **Model Architecture:** Random Forest with 500 estimators
- **Cross-validation:** Out-of-bag scoring enabled
## Feature Engineering
The model uses 45 sophisticated features including:
### Lexical Features
- Word overlap and Jaccard similarity
- N-gram overlap (bigrams, trigrams)
- Semantic similarity with stemming
### Linguistic Features
- Pronoun and reference patterns
- Question type classification
- Discourse markers and connectives
- Dependency phrases detection
### Structural Features
- Length ratios and differences
- Punctuation patterns
- Complexity measures (syllable density)
- Capitalization patterns
## Usage
```python
# Install dependencies
# pip install scikit-learn pandas nltk huggingface-hub joblib
from huggingface_hub import hf_hub_download
import joblib
import json
# Download model files
model_path = hf_hub_download(repo_id="admin-4minds/QUERY-DEPENDENCE-MODEL", filename="model.joblib")
encoder_path = hf_hub_download(repo_id="admin-4minds/QUERY-DEPENDENCE-MODEL", filename="label_encoder.joblib")
config_path = hf_hub_download(repo_id="admin-4minds/QUERY-DEPENDENCE-MODEL", filename="config.json")
# Load model components
model = joblib.load(model_path)
label_encoder = joblib.load(encoder_path)
with open(config_path, 'r') as f:
config = json.load(f)
# Initialize classifier
classifier = DependencyClassifier()
classifier.model = model
classifier.label_encoder = label_encoder
classifier.feature_names = config['feature_names']
# Make predictions
result = classifier.predict(
"What is artificial intelligence?",
"Can you give me some examples?"
)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.3f}")
print(f"Probabilities: {result['probabilities']}")
```
## Alternative Loading Method
```python
# Load directly using class method
classifier = DependencyClassifier.load_from_huggingface_hub("admin-4minds/QUERY-DEPENDENCE-MODEL")
# Use for inference
result = classifier.predict("Query 1", "Query 2")
```
## Training Data Format
The model expects training data with columns:
- `query1`: First query/question
- `query2`: Second query/question
- `label`: 'independent' or 'dependent'
## Model Architecture
```python
RandomForestClassifier(
n_estimators=500,
max_depth=15,
min_samples_split=7,
min_samples_leaf=3,
max_features='sqrt',
class_weight='balanced',
random_state=42
)
```
## Limitations
- Designed for English language queries
- Performance may vary on very short queries (< 3 words)
- Requires NLTK stopwords corpus for optimal performance
- Best suited for conversational question-answering scenarios
## Technical Details
- **Framework:** scikit-learn
- **Storage Format:** joblib (secure alternative to pickle)
- **Configuration:** JSON metadata
- **Reproducibility:** Fixed random seed (42)
## Citation
```bibtex
@misc{query_dependence_classifier_2025,
title={Query Dependence Classifier},
author={Admin-4minds},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/admin-4minds/QUERY-DEPENDENCE-MODEL}
}
```
## License
This model is released under the MIT License.
## Contact
For questions or issues, please contact the admin-4minds team.
|