danielostrow commited on
Commit
e14e625
·
verified ·
1 Parent(s): 1174dde

Add training documentation

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md CHANGED
@@ -284,6 +284,134 @@ Immediate detection for high-confidence C2 ports with matching behavioral patter
284
 
285
  ---
286
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
287
  ## Files
288
 
289
  ```
 
284
 
285
  ---
286
 
287
+ ## Training Your Own Model
288
+
289
+ C2Sentinel supports training custom weights on your own data. This is useful for:
290
+ - Fine-tuning on your network's specific traffic patterns
291
+ - Adding detection for new C2 frameworks
292
+ - Reducing false positives in your environment
293
+
294
+ ### Prerequisites
295
+
296
+ ```bash
297
+ pip install torch numpy safetensors tqdm packaging
298
+ ```
299
+
300
+ ### Using Pre-trained Weights
301
+
302
+ The released weights are trained on synthetic C2 beacon patterns covering 10+ framework types:
303
+
304
+ ```python
305
+ from c2sentinel import C2Sentinel
306
+
307
+ # Load pre-trained weights from HuggingFace
308
+ sentinel = C2Sentinel.from_pretrained('danielostrow/c2sentinel')
309
+
310
+ # Or load from local files
311
+ sentinel = C2Sentinel.load('c2_sentinel')
312
+ ```
313
+
314
+ ### Training From Scratch
315
+
316
+ Use the provided training script to train on synthetic data:
317
+
318
+ ```bash
319
+ # Basic training (20,000 samples, 100 epochs)
320
+ python train_model.py --epochs 100 --samples 20000
321
+
322
+ # Faster training with fewer samples
323
+ python train_model.py --epochs 50 --samples 10000
324
+
325
+ # Custom learning rate
326
+ python train_model.py --epochs 100 --samples 25000 --lr 0.0001
327
+ ```
328
+
329
+ ### Training on Custom Data
330
+
331
+ Create a custom dataset class that returns connection records:
332
+
333
+ ```python
334
+ from torch.utils.data import Dataset
335
+ from c2sentinel import FeatureExtractor
336
+
337
+ class CustomC2Dataset(Dataset):
338
+ def __init__(self, labeled_connections):
339
+ self.feature_extractor = FeatureExtractor()
340
+ self.samples = []
341
+ self.labels = []
342
+
343
+ for connections, is_c2 in labeled_connections:
344
+ features = self.feature_extractor.extract_features(connections)
345
+ self.samples.append(features)
346
+ self.labels.append(1 if is_c2 else 0)
347
+
348
+ # Normalize features (critical for training stability)
349
+ self.samples = np.array(self.samples, dtype=np.float32)
350
+ self.mean = np.mean(self.samples, axis=0)
351
+ self.std = np.std(self.samples, axis=0) + 1e-8
352
+ self.samples = (self.samples - self.mean) / self.std
353
+
354
+ def __len__(self):
355
+ return len(self.samples)
356
+
357
+ def __getitem__(self, idx):
358
+ return {
359
+ 'features': torch.tensor(self.samples[idx]),
360
+ 'label': torch.tensor(self.labels[idx], dtype=torch.float32)
361
+ }
362
+ ```
363
+
364
+ ### Fine-tuning Pre-trained Weights
365
+
366
+ Start from pre-trained weights and fine-tune on your data:
367
+
368
+ ```python
369
+ from c2sentinel import LogBERTC2Sentinel, C2SentinelConfig
370
+ from safetensors.torch import load_file, save_file
371
+ import torch.optim as optim
372
+
373
+ # Load pre-trained model
374
+ config = C2SentinelConfig()
375
+ model = LogBERTC2Sentinel(config)
376
+ state_dict = load_file('c2_sentinel.safetensors')
377
+ model.load_state_dict(state_dict)
378
+
379
+ # Fine-tune with lower learning rate
380
+ optimizer = optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.01)
381
+
382
+ # Train on your data...
383
+
384
+ # Save fine-tuned weights
385
+ save_file(model.state_dict(), 'c2_sentinel_finetuned.safetensors')
386
+ ```
387
+
388
+ ### Training Tips
389
+
390
+ 1. **Feature Normalization**: Always normalize input features. Save the mean/std for inference:
391
+ ```python
392
+ np.savez('normalization_params.npz', mean=mean, std=std)
393
+ ```
394
+
395
+ 2. **Learning Rate**: Use 0.0001 for training from scratch, 0.00005 for fine-tuning
396
+
397
+ 3. **Gradient Clipping**: Prevent exploding gradients:
398
+ ```python
399
+ torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
400
+ ```
401
+
402
+ 4. **Early Stopping**: Monitor validation accuracy and stop when it plateaus
403
+
404
+ 5. **Balanced Data**: Use roughly equal C2 and benign samples
405
+
406
+ ### Model Output Files
407
+
408
+ After training, you'll have:
409
+ - `c2_sentinel.safetensors` - Model weights
410
+ - `normalization_params.npz` - Feature normalization parameters
411
+ - `c2_sentinel.json` - Model configuration
412
+
413
+ ---
414
+
415
  ## Files
416
 
417
  ```