|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# Model Card for MatroidNN |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
**Model type:** Neural Network with Matroid-based Feature Selection (MatroidNN) |
|
|
|
|
|
**Version:** 1.0 |
|
|
|
|
|
**Framework:** PyTorch |
|
|
|
|
|
**Last updated:** February 27, 2025 |
|
|
|
|
|
### Overview |
|
|
|
|
|
MatroidNN is a neural network architecture that incorporates matroid theory for feature selection. It addresses the challenge of feature redundancy by selecting a maximally independent set of features based on matroid theory principles before training the neural network. |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Feature Selection Component**: MatroidFeatureSelector using correlation-based dependency analysis |
|
|
- **Neural Network**: 3-layer feedforward network with batch normalization and dropout |
|
|
- **Input**: Varies based on the number of features selected by the matroid selector |
|
|
- **Hidden Layers**: Configurable hidden layer sizes (default 64 → 32) |
|
|
- **Output**: Multi-class classification (configurable number of classes) |
|
|
- **Parameters**: ~5K-10K parameters (varies based on input/output dimensions) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
MatroidNN is designed for classification tasks where feature redundancy is a potential issue. It's particularly useful for: |
|
|
|
|
|
- High-dimensional datasets with correlated features |
|
|
- Feature selection in biological/medical data |
|
|
- Financial prediction with multicollinear variables |
|
|
- Any classification task where feature independence is desired |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model is not intended for: |
|
|
- Regression tasks (without modification) |
|
|
- Time series prediction (without temporal adaptations) |
|
|
- Raw image or text classification (without appropriate feature extraction) |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was developed and tested using synthetic data with deliberate feature dependencies. For real-world applications, the model should be retrained on domain-specific data. |
|
|
|
|
|
### Training Dataset |
|
|
|
|
|
- **Type**: Synthetic data with controlled dependencies |
|
|
- **Size**: 1000 samples (default), configurable |
|
|
- **Features**: 20 initial features (default), configurable |
|
|
- **Classes**: 3 classes (default), configurable |
|
|
- **Distribution**: Equal class distribution in the synthetic data |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Metrics |
|
|
|
|
|
On synthetic test data with 3 classes: |
|
|
- **Accuracy**: 94.0% |
|
|
- **Macro-average F1-score**: 0.93 |
|
|
- **Per-class metrics**: |
|
|
- Class 0: Precision 0.96, Recall 1.00, F1 0.98 |
|
|
- Class 1: Precision 0.86, Recall 0.86, F1 0.86 |
|
|
- Class 2: Precision 0.97, Recall 0.93, F1 0.95 |
|
|
|
|
|
### Factors |
|
|
|
|
|
Performance may vary based on: |
|
|
- Feature correlation structure in the dataset |
|
|
- Number of initial features and their information content |
|
|
- Class distribution balance |
|
|
- Rank threshold parameter in the MatroidFeatureSelector |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The matroid-based feature selection uses correlation as a proxy for independence, which may not capture all forms of dependency |
|
|
- The current implementation assumes numerical features and may require adaptation for categorical features |
|
|
- Feature selection is performed once before training and does not adapt during training |
|
|
- The rank threshold parameter requires careful tuning based on the dataset |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- Feature selection might unintentionally exclude features that are important for fairness considerations |
|
|
- The model inherits any biases present in the training data |
|
|
- Results should be interpreted with caution in high-stakes applications, with human oversight |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Hardware Requirements |
|
|
|
|
|
- Training: CUDA-capable GPU recommended for larger datasets |
|
|
- Inference: CPU sufficient for most applications |
|
|
|
|
|
### Software Requirements |
|
|
|
|
|
- Python 3.8+ |
|
|
- PyTorch 1.8+ |
|
|
- NumPy 1.20+ |
|
|
- scikit-learn 0.24+ |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- **Batch size**: 32 (default) |
|
|
- **Learning rate**: 0.001 (default) |
|
|
- **Optimizer**: Adam |
|
|
- **Loss function**: Cross-Entropy Loss |
|
|
- **Epochs**: Early stopping based on validation loss (patience=10) |
|
|
- **Feature selection rank threshold**: 0.7 (default, configurable) |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from matroid_nn import MatroidFeatureSelector, MatroidNN |
|
|
|
|
|
# Initialize feature selector |
|
|
selector = MatroidFeatureSelector(rank_threshold=0.7) |
|
|
|
|
|
# Apply feature selection |
|
|
X_train_selected = selector.fit_transform(X_train) |
|
|
X_test_selected = selector.transform(X_test) |
|
|
|
|
|
# Create and train model |
|
|
model = MatroidNN( |
|
|
input_size=X_train_selected.shape[1], |
|
|
hidden_size=64, |
|
|
output_size=num_classes |
|
|
) |