--- license: apache-2.0 tags: - networking - webrtc - congestion-control - edge-ai - gguf - llama.cpp - qos - qoe --- # NICoLE-LLM NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming. It **predicts**: - ECN - Current Profile (CP) - Next Profile (NP) from RTP packetization and queue telemetry using compact symbolic prompting. **Optimized** for: - low-latency inference - edge deployment - GGUF quantization - deterministic structured outputs **Applications**: - WebRTC adaptive streaming - congestion-aware real-time video encoding adaptation - in-Network QoE Optimization - edge AI networking --- # Profiles | Profile | Resolution | FPS | GoP | |---|---|---|---| | P0 | 3840×2160 (4K) | 30 / 60 / 90 / 120 | 2 s | | P1 | 1920×1080 | 30 / 60 / 90 / 120 | 2 s | | P2 | 1280×720 | 30 / 60 / 90 / 120 | 2 s | | P3 | 640×360 | 30 / 60 / 90 / 120 | 2 s | The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottleneck shared between background traffic and adaptive RTP video streaming. --- # Prompt Format **Input order**: ```text PS FS IFGS IFGR CQ LQ E ``` **Output order**: ```text E C N ``` **Example**: ```text I:PS FS IFGS IFGR CQ LQ E O:E C N U:1400,40,34,33,2,0,0 A: ``` Expected output: ```text 0,1,1 ``` --- # Hugging Face Usage (Python code) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "YOUR_USERNAME/NICoLE-LLM" tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto" ) prompt = """I:PS FS IFGS IFGR CQ LQ E O:E C N U:1400,40,34,33,2,0,0 A:""" inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=6, do_sample=False ) print(tok.decode(out[0], skip_special_tokens=True)) ``` --- # GGUF / llama.cpp Usage ```bash ./llama-cli \ -no-cnv \ -t 4 \ -m nicole-q4.gguf \ -p "I:PS FS IFGS IFGR CQ LQ E O:E C N U:1400,40,34,33,2,0,0 A:" \ -n 6 \ --temp 0 \ --top-k 1 ``` --- # Runtime Configuration | Parameter | Value | |---|---| | Runtime | llama.cpp | | Quantization | Q4_K_M | | Model Size | 636 MB | | Context Length | 4096 | | Inference | Deterministic | | Prompting | Compact Symbolic | --- # CPU Core Benchmark | Threads | Response (ms) | Decisions/sec | Tokens/sec | |---|---|---|---| | 1 | 1325 | 0.75 | 52.71 | | 2 | 624 | 1.60 | 113.30 | | **4** | **343** | **2.91** | **203.35** | | 8 | 904 | 1.11 | 60.70 | | 16 | 1043 | 0.96 | 132.46 | | 32 | 1432 | 0.70 | 104.29 | **Best CPU deployment:** - 4 threads - 343 ms response time - 2.91 decisions/sec ## Compact symbolic prompting significantly reduces: - prompt tokens - KV-cache usage - inference latency - deployment overhead compared to verbose natural-language prompting. --- # Quantized Models Available quantization: - Q4_K_M (recommended) **Runtime**: - llama.cpp **Designed** for: - edge deployment - CPU inference - bounded symbolic control inference - real-time congestion-aware adaptation --- # Limitations - Trained under a 40 Mbps bottleneck scenario - Designed for bounded RTP/WebRTC streaming tasks - Not intended for open-ended conversational generation --- # Citation If you use this model, please cite the NICoLE paper and repository. - **Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,*"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?"*, IEEE/IFIP Networking, Switzerland 2026.**