Upload 6 files

Browse files

Files changed (6) hide show

README.md +157 -0
config.json +48 -0
evaluation_results.json +58 -0
gitpulse_weights.pt +3 -0
model.py +263 -0
requirements.txt +4 -0

README.md CHANGED Viewed

@@ -1,3 +1,160 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- en
+library_name: pytorch
+tags:
+- time-series
+- multimodal
+- transformer
+- github
+- forecasting
+datasets:
+- custom
+metrics:
+- mse
+- mae
+- r_squared
+pipeline_tag: time-series-forecasting
 ---
+# GitPulse: Multimodal Time Series Prediction for GitHub Project Health
+GitPulse is a multimodal Transformer-based model that combines project text descriptions with historical activity data to predict GitHub project health metrics.
+## Model Description
+GitPulse leverages both **textual metadata** (project descriptions, topics) and **historical time series** (commits, issues, stars, etc.) to forecast future project activity. The key innovation is the adaptive fusion mechanism that dynamically balances text and time-series features.
+### Architecture
+- **Text Encoder**: DistilBERT-based encoder with attention pooling
+- **Time Series Encoder**: Transformer encoder with positional embeddings
+- **Adaptive Fusion**: Dynamic gating mechanism for multimodal fusion
+- **Prediction Head**: MLP for generating future predictions
+### Model Parameters
+| Parameter | Value |
+|-----------|-------|
+| d_model | 128 |
+| n_heads | 4 |
+| n_layers | 2 |
+| hist_len | 128 |
+| pred_len | 32 |
+| n_vars | 16 |
+## Performance
+Evaluated on 636 test samples from 4,232 GitHub projects:
+| Model | MSE ↓ | MAE ↓ | R² ↑ | DA ↑ | TA@0.2 ↑ |
+|-------|-------|-------|------|------|----------|
+| **GitPulse** | **0.0755** | **0.1094** | **0.7559** | **86.68%** | **81.60%** |
+| CondGRU+Text | 0.0915 | 0.1204 | 0.7043 | 84.05% | 80.14% |
+| Transformer | 0.1142 | 0.1342 | 0.6312 | 84.02% | 78.87% |
+| LSTM | 0.2142 | 0.1914 | 0.3800 | 56.00% | 75.00% |
+### Text Contribution
+| Architecture | TS-Only R² | +Text R² | Improvement |
+|--------------|-----------|----------|-------------|
+| Transformer → GitPulse | 0.6312 | 0.7559 | **+19.8%** |
+| CondGRU → CondGRU+Text | 0.3328 | 0.7043 | **+111.6%** |
+## Usage
+### Installation
+```bash
+pip install torch transformers
+```
+### Quick Start
+```python
+import torch
+from transformers import DistilBertTokenizer
+# Load model
+from model import GitPulseModel
+model = GitPulseModel.from_pretrained('./')
+# Prepare inputs
+tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+text = "A Python library for machine learning"
+encoded = tokenizer(text, padding='max_length', truncation=True,
+                    max_length=128, return_tensors='pt')
+# Time series: [batch, hist_len, n_vars]
+time_series = torch.randn(1, 128, 16)
+# Predict
+model.eval()
+with torch.no_grad():
+    predictions = model(
+        time_series,
+        input_ids=encoded['input_ids'],
+        attention_mask=encoded['attention_mask']
+    )
+# predictions shape: [1, 32, 16]
+```
+### Inference API
+```python
+# Simple prediction interface
+predictions = model.predict(
+    time_series=history_data,  # [batch, 128, 16]
+    text="Project description...",
+    tokenizer=tokenizer
+)
+```
+## Training Details
+- **Dataset**: GitHub project activity data (4,232 projects)
+- **Train/Val/Test Split**: 70% / 15% / 15%
+- **Optimizer**: AdamW (lr=1e-5, weight_decay=0.01)
+- **Fine-tuning Strategy**: Freeze encoder, train prediction head
+- **Hardware**: NVIDIA RTX GPU
+## Input Features (16 variables)
+1. Commits count
+2. Issues opened
+3. Issues closed
+4. Pull requests opened
+5. Pull requests merged
+6. Stars gained
+7. Forks count
+8. Contributors count
+9. Code additions
+10. Code deletions
+11. Comments count
+12. Releases count
+13. Wiki updates
+14. Discussions count
+15. Sponsors count
+16. Watchers count
+## Limitations
+- Trained on English project descriptions only
+- Best suited for projects with at least 128 months of history
+- Performance may vary for niche domains not well represented in training
+## Citation
+```bibtex
+@article{gitpulse2024,
+  title={GitPulse: Multimodal Time Series Prediction for GitHub Project Health},
+  author={Anonymous},
+  journal={arXiv preprint},
+  year={2024}
+}
+```
+## License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "model_name": "GitPulse",
+  "version": "1.0",
+  "architecture": "Transformer+Text",
+  "model_config": {
+    "n_vars": 16,
+    "d_model": 128,
+    "n_heads": 4,
+    "n_layers": 2,
+    "hist_len": 128,
+    "pred_len": 32,
+    "dropout": 0.1,
+    "freeze_bert": true
+  },
+  "training_config": {
+    "dataset": "github_multivar",
+    "n_samples": 4232,
+    "train_ratio": 0.7,
+    "val_ratio": 0.15,
+    "test_ratio": 0.15,
+    "batch_size": 16,
+    "optimizer": "AdamW",
+    "learning_rate": 1e-5,
+    "weight_decay": 0.01,
+    "finetune_strategy": "freeze",
+    "epochs": 50,
+    "patience": 10
+  },
+  "evaluation_results": {
+    "MSE": 0.0755,
+    "MAE": 0.1094,
+    "RMSE": 0.2748,
+    "R2": 0.7559,
+    "DA": 0.8668,
+    "TA@0.2": 0.8160
+  },
+  "text_contribution": {
+    "baseline_r2": 0.6312,
+    "with_text_r2": 0.7559,
+    "improvement": "19.8%"
+  }
+}

evaluation_results.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "GitPulse": {
+    "MSE": 0.07554114609956741,
+    "MAE": 0.1093892976641655,
+    "RMSE": 0.2748474960765832,
+    "R2": 0.7559499740600586,
+    "DA": 0.8668435534591195,
+    "TA": 0.8160223810927673
+  },
+  "CondGRU+Text": {
+    "MSE": 0.09153253585100174,
+    "MAE": 0.12042587995529175,
+    "RMSE": 0.3025434445678864,
+    "R2": 0.7042867541313171,
+    "DA": 0.8405070754716981,
+    "TA": 0.801447646422956
+  },
+  "Transformer": {
+    "MSE": 0.11415749043226242,
+    "MAE": 0.1342027485370636,
+    "RMSE": 0.33787200303112186,
+    "R2": 0.6311925649642944,
+    "DA": 0.8402122641509434,
+    "TA": 0.7887400501179245
+  },
+  "LSTM": {
+    "MSE": 0.2142,
+    "MAE": 0.1914,
+    "RMSE": 0.4628,
+    "R2": 0.38,
+    "DA": 0.56,
+    "TA": 0.75
+  },
+  "MLP": {
+    "MSE": 0.228,
+    "MAE": 0.2025,
+    "RMSE": 0.4775,
+    "R2": 0.34,
+    "DA": 0.56,
+    "TA": 0.73
+  },
+  "Linear": {
+    "MSE": 0.2261,
+    "MAE": 0.1896,
+    "RMSE": 0.4755,
+    "R2": 0.34,
+    "DA": 0.53,
+    "TA": 0.74
+  },
+  "CondGRU": {
+    "MSE": 0.2065122127532959,
+    "MAE": 0.19494034349918365,
+    "RMSE": 0.45443614815867794,
+    "R2": 0.33282309770584106,
+    "DA": 0.7418435534591195,
+    "TA": 0.740633598663522
+  }
+}

gitpulse_weights.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e12b9528b6ab8a27f225e938a81318f9ab60f140312c7ba5a04c9e725ebadf3d
+size 270997714

model.py ADDED Viewed

	@@ -0,0 +1,263 @@

+"""
+GitPulse Model - Multimodal Transformer for GitHub Project Health Prediction
+基于 Transformer+Text 的多模态时序预测模型
+"""
+import os
+import json
+import torch
+import torch.nn as nn
+from typing import Optional, Tuple
+from transformers import DistilBertModel, DistilBertTokenizer
+class TextEncoder(nn.Module):
+    """文本编码器：基于 DistilBERT"""
+    def __init__(self, d_model=128, freeze_bert=True):
+        super().__init__()
+        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
+        if freeze_bert:
+            for param in self.bert.parameters():
+                param.requires_grad = False
+        # 投影层
+        self.proj = nn.Sequential(
+            nn.Linear(768, d_model * 2),
+            nn.LayerNorm(d_model * 2),
+            nn.GELU(),
+            nn.Dropout(0.1),
+            nn.Linear(d_model * 2, d_model),
+            nn.LayerNorm(d_model)
+        )
+        # 注意力池化
+        self.attn_pool = nn.Linear(768, 1)
+    def forward(self, input_ids, attention_mask):
+        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
+        hidden = outputs.last_hidden_state  # [B, L, 768]
+        # 注意力池化
+        attn_weights = self.attn_pool(hidden).squeeze(-1)  # [B, L]
+        attn_weights = attn_weights.masked_fill(attention_mask == 0, -1e9)
+        attn_weights = torch.softmax(attn_weights, dim=-1)
+        pooled = torch.bmm(attn_weights.unsqueeze(1), hidden).squeeze(1)  # [B, 768]
+        return self.proj(pooled)  # [B, d_model]
+class TransformerTSEncoder(nn.Module):
+    """时序编码器：Transformer"""
+    def __init__(self, n_vars=16, d_model=128, n_heads=4, n_layers=2,
+                 hist_len=128, dropout=0.1):
+        super().__init__()
+        self.input_proj = nn.Sequential(
+            nn.Linear(n_vars, d_model),
+            nn.LayerNorm(d_model),
+            nn.Dropout(dropout)
+        )
+        self.pos_embedding = nn.Parameter(torch.randn(1, hist_len, d_model) * 0.02)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model, nhead=n_heads,
+            dim_feedforward=d_model * 4, dropout=dropout,
+            activation='gelu', batch_first=True
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
+        self.norm = nn.LayerNorm(d_model)
+    def forward(self, x):
+        # x: [B, T, n_vars]
+        x = self.input_proj(x)
+        x = x + self.pos_embedding[:, :x.size(1), :]
+        x = self.encoder(x)
+        return self.norm(x)
+class AdaptiveFusion(nn.Module):
+    """自适应融合层"""
+    def __init__(self, d_model, min_weight=0.1, max_weight=0.3):
+        super().__init__()
+        self.min_weight = min_weight
+        self.max_weight = max_weight
+        self.gate = nn.Sequential(
+            nn.Linear(d_model * 2, d_model),
+            nn.GELU(),
+            nn.Linear(d_model, 1),
+            nn.Sigmoid()
+        )
+    def forward(self, ts_feat, text_feat):
+        combined = torch.cat([ts_feat, text_feat], dim=-1)
+        raw_weight = self.gate(combined)
+        weight = self.min_weight + (self.max_weight - self.min_weight) * raw_weight
+        return ts_feat * (1 - weight) + text_feat * weight
+class GitPulseModel(nn.Module):
+    """
+    GitPulse: Multimodal Transformer for Time Series Prediction
+    结合项目文本描述和历史时序数据进行预测
+    """
+    def __init__(
+        self,
+        n_vars: int = 16,
+        d_model: int = 128,
+        n_heads: int = 4,
+        n_layers: int = 2,
+        hist_len: int = 128,
+        pred_len: int = 32,
+        dropout: float = 0.1,
+        freeze_bert: bool = True
+    ):
+        super().__init__()
+        self.n_vars = n_vars
+        self.d_model = d_model
+        self.hist_len = hist_len
+        self.pred_len = pred_len
+        # 时序编码器
+        self.ts_encoder = TransformerTSEncoder(
+            n_vars=n_vars, d_model=d_model, n_heads=n_heads,
+            n_layers=n_layers, hist_len=hist_len, dropout=dropout
+        )
+        # 文本编码器
+        self.text_encoder = TextEncoder(d_model=d_model, freeze_bert=freeze_bert)
+        # 融合层
+        self.fusion = AdaptiveFusion(d_model)
+        # 预测头
+        self.pred_head = nn.Sequential(
+            nn.Linear(d_model, d_model * 2),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(d_model * 2, n_vars)
+        )
+        # 时间投影
+        self.temporal_proj = nn.Linear(hist_len, pred_len)
+    def forward(
+        self,
+        x: torch.Tensor,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        return_auxiliary: bool = False
+    ) -> torch.Tensor:
+        """
+        Args:
+            x: 历史时序 [B, hist_len, n_vars]
+            input_ids: 文本 token IDs [B, L]
+            attention_mask: 注意力掩码 [B, L]
+        Returns:
+            预测序列 [B, pred_len, n_vars]
+        """
+        # 时序编码
+        ts_encoded = self.ts_encoder(x)  # [B, hist_len, d_model]
+        ts_global = ts_encoded.mean(dim=1)  # [B, d_model]
+        # 文本编码和融合
+        if input_ids is not None and attention_mask is not None:
+            text_feat = self.text_encoder(input_ids, attention_mask)
+            fused = self.fusion(ts_global, text_feat)
+        else:
+            fused = ts_global
+        # 预测
+        pred_feat = self.pred_head(ts_encoded)  # [B, hist_len, n_vars]
+        pred_feat = pred_feat.transpose(1, 2)  # [B, n_vars, hist_len]
+        output = self.temporal_proj(pred_feat)  # [B, n_vars, pred_len]
+        output = output.transpose(1, 2)  # [B, pred_len, n_vars]
+        if return_auxiliary:
+            return output, torch.tensor(0.0), torch.tensor(0.0), {}
+        return output
+    @classmethod
+    def from_pretrained(cls, path: str, device: str = 'cuda'):
+        """从预训练权重加载模型"""
+        config_path = os.path.join(path, 'config.json')
+        weights_path = os.path.join(path, 'gitpulse_weights.pt')
+        # 加载配置
+        if os.path.exists(config_path):
+            with open(config_path, 'r') as f:
+                config = json.load(f)
+        else:
+            config = {}
+        # 创建模型
+        model = cls(
+            n_vars=config.get('n_vars', 16),
+            d_model=config.get('d_model', 128),
+            n_heads=config.get('n_heads', 4),
+            n_layers=config.get('n_layers', 2),
+            hist_len=config.get('hist_len', 128),
+            pred_len=config.get('pred_len', 32)
+        )
+        # 加载权重
+        if os.path.exists(weights_path):
+            state_dict = torch.load(weights_path, map_location=device, weights_only=False)
+            model.load_state_dict(state_dict, strict=False)
+            print(f"✓ Loaded weights from {weights_path}")
+        return model.to(device)
+    def predict(
+        self,
+        time_series: torch.Tensor,
+        text: str = None,
+        tokenizer: DistilBertTokenizer = None
+    ) -> torch.Tensor:
+        """便捷预测接口"""
+        self.eval()
+        if text is not None and tokenizer is not None:
+            encoded = tokenizer(
+                text, padding='max_length', truncation=True,
+                max_length=128, return_tensors='pt'
+            )
+            input_ids = encoded['input_ids'].to(time_series.device)
+            attention_mask = encoded['attention_mask'].to(time_series.device)
+        else:
+            input_ids = None
+            attention_mask = None
+        with torch.no_grad():
+            output = self.forward(time_series, input_ids, attention_mask)
+        return output
+# 模型信息
+def get_model_info():
+    return {
+        'name': 'GitPulse',
+        'version': '1.0',
+        'architecture': 'Transformer+Text',
+        'description': 'Multimodal time series prediction model for GitHub project health',
+        'metrics': {
+            'R2': 0.7559,
+            'MSE': 0.0755,
+            'DA': 0.8668,
+            'TA@0.2': 0.8160
+        }
+    }

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+torch>=1.12.0
+transformers>=4.20.0
+numpy>=1.21.0