--- license: apache-2.0 language: - en library_name: pytorch tags: - time-series - multimodal - transformer - github - forecasting datasets: - custom metrics: - mse - mae - r_squared pipeline_tag: time-series-forecasting --- # GitPulse: Multimodal Time Series Prediction for GitHub Project Health GitPulse is a multimodal Transformer-based model that combines project text descriptions with historical activity data to predict GitHub project health metrics. ## Model Description GitPulse leverages both **textual metadata** (project descriptions, topics) and **historical time series** (commits, issues, stars, etc.) to forecast future project activity. The key innovation is the adaptive fusion mechanism that dynamically balances text and time-series features. ### Architecture - **Text Encoder**: DistilBERT-based encoder with attention pooling - **Time Series Encoder**: Transformer encoder with positional embeddings - **Adaptive Fusion**: Dynamic gating mechanism for multimodal fusion - **Prediction Head**: MLP for generating future predictions ### Model Parameters | Parameter | Value | |-----------|-------| | d_model | 128 | | n_heads | 4 | | n_layers | 2 | | hist_len | 128 | | pred_len | 32 | | n_vars | 16 | ## Performance Evaluated on 636 test samples from 4,232 GitHub projects: | Model | MSE ↓ | MAE ↓ | R² ↑ | DA ↑ | TA@0.2 ↑ | |-------|-------|-------|------|------|----------| | **GitPulse** | **0.0755** | **0.1094** | **0.7559** | **86.68%** | **81.60%** | | CondGRU+Text | 0.0915 | 0.1204 | 0.7043 | 84.05% | 80.14% | | Transformer | 0.1142 | 0.1342 | 0.6312 | 84.02% | 78.87% | | LSTM | 0.2142 | 0.1914 | 0.3800 | 56.00% | 75.00% | ### Text Contribution | Architecture | TS-Only R² | +Text R² | Improvement | |--------------|-----------|----------|-------------| | Transformer → GitPulse | 0.6312 | 0.7559 | **+19.8%** | | CondGRU → CondGRU+Text | 0.3328 | 0.7043 | **+111.6%** | ## Usage ### Installation ```bash pip install torch transformers ``` ### Quick Start ```python import torch from transformers import DistilBertTokenizer # Load model from model import GitPulseModel model = GitPulseModel.from_pretrained('./') # Prepare inputs tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') text = "A Python library for machine learning" encoded = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='pt') # Time series: [batch, hist_len, n_vars] time_series = torch.randn(1, 128, 16) # Predict model.eval() with torch.no_grad(): predictions = model( time_series, input_ids=encoded['input_ids'], attention_mask=encoded['attention_mask'] ) # predictions shape: [1, 32, 16] ``` ### Inference API ```python # Simple prediction interface predictions = model.predict( time_series=history_data, # [batch, 128, 16] text="Project description...", tokenizer=tokenizer ) ``` ## Training Details - **Dataset**: GitHub project activity data (4,232 projects) - **Train/Val/Test Split**: 70% / 15% / 15% - **Optimizer**: AdamW (lr=1e-5, weight_decay=0.01) - **Fine-tuning Strategy**: Freeze encoder, train prediction head - **Hardware**: NVIDIA RTX GPU ## Input Features (16 variables) 1. Commits count 2. Issues opened 3. Issues closed 4. Pull requests opened 5. Pull requests merged 6. Stars gained 7. Forks count 8. Contributors count 9. Code additions 10. Code deletions 11. Comments count 12. Releases count 13. Wiki updates 14. Discussions count 15. Sponsors count 16. Watchers count ## Limitations - Trained on English project descriptions only - Best suited for projects with at least 128 months of history - Performance may vary for niche domains not well represented in training ## Citation ```bibtex @article{gitpulse2024, title={GitPulse: Multimodal Time Series Prediction for GitHub Project Health}, author={Anonymous}, journal={arXiv preprint}, year={2024} } ``` ## License Apache 2.0