gradient boosting
classification
history

g17_cat_boost

17_cat_boost is a gradient boosting model for classifying information elements within the Graceful17 project.

The classified elements can thus be automatically assigned to the respective classes (grace:event, grace:object) of the GRACE ontology.

Training

The model was trained using manually assigned information elements. The research data on which the model is based can be found here (updated continously).

The core codebase is found in a GitHub repostory.

A current training data set is found in the attached json file in this repository in the folder training data (running numbers as file names as versioning)

A more detailed description of methodological background is forthcoming.

The core feature engineering is:

df['entry_length'] = df['entry'].apply(len)
df['start'] = df['start'] / df['entry_length']
df['end'] = df['end'] / df['entry_length']
df['entity_count'] = df.groupby('entry_ID')['text'].transform('count')
df['avg_start_position'] = df.groupby('entry_ID')['start'].transform('mean')
df['avg_end_position'] = df.groupby('entry_ID')['end'].transform('mean')
df['all_texts'] = df.groupby('entry_ID')['text'].transform(lambda x: ' '.join(x))
df['all_labels'] = df.groupby('entry_ID')['label'].transform(lambda x: ','.join(sorted(x)))
df['all_labels_count'] = df['entry_ID'].map(df.groupby('entry_ID')['label'].agg(list))


preprocessor = ColumnTransformer(
    transformers=[
        ('text', TfidfVectorizer(token_pattern=r"(?u)\b\w+\b"), 'text'),
        ('label', OneHotEncoder(handle_unknown='ignore'), ['label']),
        ('all_texts', TfidfVectorizer(token_pattern=r"(?u)\b\w+\b"), 'all_texts'),
        ('all_labels', OneHotEncoder(handle_unknown='ignore'), ['all_labels']),
        ('all_labels_count', CountVectorizer(token_pattern=None, 
                tokenizer=lambda labels: labels, lowercase=False), 'all_labels_count'),
        ('start_end', 'passthrough', ['start', 'end']),
        ('context_features', 'passthrough', [
        'entry_length', 'entity_count', 'avg_start_position', 'avg_end_position'
        ])
    ])

License

MIT License © 2025 Christoph Sander

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support