Spaces:

PatrickRedStar
/

logreader

Sleeping

App Files Files Community

PatrickRedStar commited on Dec 16, 2025

Commit

29fdac9

1 Parent(s): b60b1ca

add

Browse files

Files changed (18) hide show

.gitignore +1 -0
README.md +68 -12
app.py +167 -0
kb/auth_401_403.md +19 -0
kb/db_connection_pool.md +19 -0
kb/dns_failure.md +19 -0
kb/java_null_pointer.md +18 -0
kb/k8s_crashloop.md +19 -0
kb/oom_killed.md +19 -0
kb/timeout_slow_service.md +19 -0
kb/tls_handshake.md +19 -0
pipeline.py +201 -0
preprocess.py +67 -0
requirements.txt +5 -0
retrieval.py +83 -0
samples/sample_java.txt +9 -0
samples/sample_k8s.txt +15 -0
samples/sample_python.txt +14 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__/*

README.md CHANGED Viewed

@@ -1,12 +1,68 @@
----
-title: Logreader
-emoji: 🐠
-colorFrom: yellow
-colorTo: indigo
-sdk: gradio
-sdk_version: 6.1.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Log Compiler App
+Gradio demo that ingests raw logs/stacktraces, classifies the incident type, paraphrases to human language, retrieves local runbooks, and proposes checks. The pipeline uses multiple Hugging Face transformer models (zero-shot classifier, summarizer, sentence-embedding retriever; optional reranker and NLI verifier).
+## Setup
+1. Python 3.10+ recommended.
+2. Install deps (downloads models on first run):
+   ```bash
+   pip install -r requirements.txt
+   ```
+## Run
+```bash
+python app.py
+```
+Gradio UI will open in the browser. Models load once at startup.
+Если localhost недоступен (WSL/прокси), приложение по умолчанию включает share-ссылку. Чтобы явно управлять:
+```bash
+# форсировать публичный линк
+GRADIO_SHARE=1 python app.py
+# отключить share (если localhost доступен)
+GRADIO_SHARE=0 python app.py
+```
+Запуск уже включает `server_name=0.0.0.0`.
+## Запуск на Hugging Face Spaces
+- Выложите содержимое репозитория в новый Space (Gradio).
+- `app.py` автоматически отключит `share` в окружении Spaces (`SPACE_ID`/`HF_SPACE`), так что дополнительная настройка не нужна.
+- Заводываются зависимости из `requirements.txt` автоматически. При необходимости можно добавить `runtime.txt` с версией Python (например, `python-3.10`).
+## How to use
+- Paste logs/stacktrace in the left text box.
+- Pick a source (auto/python/java/node/k8s).
+- Toggle:
+  - `Use retrieval (local KB)` to search `kb/` runbooks.
+  - `Verify hypothesis (NLI)` to check entailment against the logs.
+- Adjust verbosity (0-2) for explanation detail.
+- Click **Analyze**. Use **Generate ticket template** to get a pre-filled ticket, and **Export JSON** to download results.
+## Samples
+Try the provided snippets:
+- `samples/sample_python.txt` – HTTP timeout stacktrace.
+- `samples/sample_k8s.txt` – CrashLoop/OOMKilled pod.
+- `samples/sample_java.txt` – NullPointerException auth failure.
+## Files
+- `app.py` – Gradio UI wiring.
+- `pipeline.py` – ML pipeline (classification, summarization, retrieval, NLI).
+- `preprocess.py` – Masking, signature detection, safe truncation.
+- `retrieval.py` – Embedding search over `kb/` markdown runbooks.
+- `kb/` – Local runbooks (edit/add your own).
+- `samples/` – Example logs to paste.
+## Notes
+- First run downloads models (`facebook/bart-large-mnli`, `sshleifer/distilbart-cnn-12-6`, `sentence-transformers/all-MiniLM-L6-v2`, optional reranker `cross-encoder/ms-marco-MiniLM-L-6-v2`, and NLI `typeform/distilbert-base-uncased-mnli`).
+- Handles empty/short input with a clear error.
+- Output tabs: Incident Type, Human Explanation, Likely Cause + Checks, Retrieved Runbooks, Verification, Ticket Template.
+- Pin `numpy<2.0` to avoid ABI conflicts with PyTorch/Transformers wheels. If you already installed numpy 2.x, run `pip install 'numpy<2' --upgrade`.

app.py ADDED Viewed

	@@ -0,0 +1,167 @@

+import json
+import os
+import tempfile
+from typing import Any, Dict, List, Optional
+import gradio as gr
+from pipeline import IncidentPipeline, IncidentResult, serialize_result
+from preprocess import truncate_logs
+pipeline = IncidentPipeline()
+def format_incident_section(result: IncidentResult) -> str:
+    alt_text = ", ".join(f"{a['label']} ({a['score']:.2f})" for a in result.incident_alternatives)
+    sigs = ", ".join(result.signatures) if result.signatures else "none"
+    return (
+        f"**Incident:** {result.incident_label} (confidence {result.incident_score:.2f})\n\n"
+        f"**Top alternatives:** {alt_text if alt_text else 'n/a'}\n\n"
+        f"**Detected signatures:** {sigs}"
+    )
+def format_cause_section(result: IncidentResult) -> str:
+    checks_md = "\n".join([f"- {c}" for c in result.checks])
+    return f"**Likely cause:** {result.likely_cause}\n\n**Checks / next steps:**\n{checks_md}"
+def analyze_logs(logs: str, source: str, use_retrieval: bool, use_nli: bool, verbosity: int):
+    try:
+        res = pipeline.process(
+            logs,
+            source=source,
+            use_retrieval=use_retrieval,
+            use_nli=use_nli,
+            verbosity=verbosity,
+        )
+    except Exception as exc:
+        message = f"Error: {exc}"
+        empty_table: List[List[Any]] = []
+        return (
+            message,
+            "",
+            "",
+            empty_table,
+            empty_table,
+            None,
+            f"Failed: {exc}",
+        )
+    retrieval_rows = [
+        [r["title"], round(r["score"], 3), r["path"], r["excerpt"]]
+        for r in res.retrieved
+    ]
+    verification_rows = [
+        [v["hypothesis"], v["label"], round(v["score"], 3)] for v in res.verification
+    ]
+    return (
+        format_incident_section(res),
+        res.explanation,
+        format_cause_section(res),
+        retrieval_rows,
+        verification_rows,
+        res,
+        "Analysis completed.",
+    )
+def ticket_template(state: Optional[IncidentResult], logs: str) -> str:
+    if state is None:
+        return "Run analysis first."
+    clipped_logs = truncate_logs(logs, head_lines=30, tail_lines=10, max_lines=60)
+    checks_md = "\n".join(f"- {c}" for c in state.checks)
+    summary = f"{state.incident_label} — {state.explanation[:180]}"
+    template = (
+        f"Summary:\n{summary}\n\n"
+        f"Steps to reproduce:\n- Describe sequence leading to error (fill in).\n- Attach failing request/sample data.\n\n"
+        f"Expected:\n- Service handles request successfully.\n\n"
+        f"Actual:\n- {state.likely_cause}\n\n"
+        f"Checks performed / next steps:\n{checks_md}\n\n"
+        f"Logs snippet:\n{clipped_logs}\n"
+    )
+    return template
+def export_json(state: Optional[IncidentResult]):
+    if state is None:
+        return None
+    data = serialize_result(state)
+    tmp = tempfile.NamedTemporaryFile("w", delete=False, suffix=".json", encoding="utf-8")
+    tmp.write(data)
+    tmp.flush()
+    tmp.close()
+    return tmp.name
+with gr.Blocks(title="Log Compiler App") as demo:
+    gr.Markdown("# Log Compiler App\nPaste logs/stacktrace to get incident classification, explanations, and runbook suggestions.")
+    state = gr.State()
+    with gr.Row():
+        with gr.Column(scale=1):
+            logs_input = gr.Textbox(lines=20, label="Logs / Stacktrace", placeholder="Paste logs here...")
+            source_dropdown = gr.Dropdown(
+                ["auto", "python", "java", "node", "k8s"],
+                value="auto",
+                label="Source",
+            )
+            use_retrieval = gr.Checkbox(value=True, label="Use retrieval (local KB)")
+            use_nli = gr.Checkbox(value=False, label="Verify hypothesis (NLI)")
+            verbosity_slider = gr.Slider(0, 2, value=1, step=1, label="Verbosity")
+            analyze_btn = gr.Button("Analyze")
+            ticket_btn = gr.Button("Generate ticket template")
+            export_btn = gr.Button("Export JSON")
+            json_output = gr.File(label="JSON export")
+            status = gr.Markdown("Ready.")
+        with gr.Column(scale=1.2):
+            with gr.Tab("Incident Type"):
+                incident_md = gr.Markdown()
+            with gr.Tab("Human Explanation"):
+                explanation_md = gr.Markdown()
+            with gr.Tab("Likely Cause + Checks"):
+                cause_md = gr.Markdown()
+            with gr.Tab("Retrieved Runbooks"):
+                retrieval_df = gr.Dataframe(
+                    headers=["Title", "Score", "Path", "Excerpt"],
+                    datatype=["str", "number", "str", "str"],
+                    interactive=False,
+                )
+            with gr.Tab("Verification"):
+                verification_df = gr.Dataframe(
+                    headers=["Hypothesis", "Label", "Score"],
+                    datatype=["str", "str", "number"],
+                    interactive=False,
+                )
+            with gr.Tab("Ticket Template"):
+                ticket_md = gr.Markdown()
+    analyze_btn.click(
+        fn=analyze_logs,
+        inputs=[logs_input, source_dropdown, use_retrieval, use_nli, verbosity_slider],
+        outputs=[incident_md, explanation_md, cause_md, retrieval_df, verification_df, state, status],
+    )
+    ticket_btn.click(
+        fn=ticket_template,
+        inputs=[state, logs_input],
+        outputs=ticket_md,
+    )
+    export_btn.click(
+        fn=export_json,
+        inputs=state,
+        outputs=json_output,
+    )
+if __name__ == "__main__":
+    share_env = os.getenv("GRADIO_SHARE")
+    in_hf_space = bool(os.getenv("SPACE_ID") or os.getenv("HF_SPACE"))
+    # In Spaces we do not need share=True; locally default to share to bypass localhost issues.
+    if in_hf_space:
+        share_flag = False
+    else:
+        share_flag = True if share_env is None else share_env.lower() in ("1", "true", "yes")
+    demo.launch(server_name="0.0.0.0", share=share_flag)

kb/auth_401_403.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Auth 401/403
+## Symptoms
+- API returns 401/403 for valid requests
+- `Invalid token`, `permission denied`, or `signature mismatch` in logs
+- Clock skew errors in authentication service
+## Checks
+- Validate access token expiration and issuer
+- Confirm user/service account scopes/roles
+- Check client/server clock skew (NTP)
+- Review recent secret/credential rotations
+- Inspect identity provider availability and rate limits
+## Fix
+- Refresh/rotate tokens or credentials
+- Grant correct roles/scopes to caller
+- Align clocks and retry
+- Apply retry/backoff if IDP is throttling

kb/db_connection_pool.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Database Connection Pool Exhaustion
+## Symptoms
+- `too many connections`, `connection refused`, or pool timeout errors
+- Spikes in DB connections and wait time
+- Slow queries or lock contention observed
+## Checks
+- Inspect application pool size vs database max connections
+- Review slow queries and transaction lengths
+- Check connection leak metrics and proper closing
+- Validate DB host/port/DNS and TLS settings
+- Examine recent traffic/load changes
+## Fix
+- Tune pool size and DB max connections
+- Fix connection leaks and ensure pooling is enabled
+- Optimize slow queries; add indexes where needed
+- Scale database or replicas to handle load

kb/dns_failure.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# DNS Resolution Failure
+## Symptoms
+- `getaddrinfo ENOTFOUND` or `NameResolutionFailure`
+- Transient errors resolving service hostnames
+- Works from one namespace/host but not another
+## Checks
+- Resolve target host from pod/host (`nslookup`, `dig`)
+- Inspect `/etc/resolv.conf` search domains and ndots
+- Verify CoreDNS logs for SERVFAIL/REFUSED
+- Check recent DNS changes or missing A/CNAME records
+- Validate network policies allowing DNS traffic
+## Fix
+- Correct service/record names and search domains
+- Restart CoreDNS or propagate zone updates
+- Add caching and lower ndots if needed
+- Update network policies to allow UDP/TCP 53

kb/java_null_pointer.md ADDED Viewed

	@@ -0,0 +1,18 @@

+# Java NullPointerException
+## Symptoms
+- Stacktrace contains `NullPointerException`
+- Fails during specific code paths or after deploy
+- Sometimes triggered by missing config or feature flags
+## Checks
+- Inspect stacktrace frames to find source file and line
+- Validate configuration/feature flag defaults are present
+- Add guards for optional fields and log offending values
+- Review recent code changes around the failing area
+- Add unit tests covering null/missing data inputs
+## Fix
+- Add null checks and default values
+- Ensure config/flags are loaded before use
+- Deploy fix and monitor for recurrence

kb/k8s_crashloop.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# K8s CrashLoopBackOff
+## Symptoms
+- Pod restarts repeatedly with `CrashLoopBackOff`
+- Health checks failing or process exits quickly
+- Logs end abruptly after startup
+## Checks
+- Inspect last container logs before restart (`kubectl logs -p`)
+- Validate readiness/liveness probes and command/args
+- Confirm config/secret mounts paths and permissions
+- Check for missing env vars or failing dependency endpoints
+- Review recent image/config deploys
+## Fix
+- Correct probes or increase initial delays
+- Fix missing configuration or credentials
+- Add retries/backoff around external dependencies
+- Roll back to last known good release if needed

kb/oom_killed.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# OOM / Memory Pressure
+## Symptoms
+- Pod terminated with `OOMKilled` or JVM `OutOfMemoryError`
+- Memory usage climbs until eviction
+- Core dump or heap dump generated
+## Checks
+- Compare pod/container memory limits vs peak usage
+- Inspect heap/thread dumps for leaks or unbounded caches
+- Validate GC settings and memory flags
+- Check for large payloads or unbounded batching
+- Review sidecar/agent memory consumption
+## Fix
+- Right-size memory requests/limits
+- Fix leaks or cap caches/buffers
+- Split large batches/payloads
+- Adjust GC or runtime memory options

kb/timeout_slow_service.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Timeout / Slow Dependency
+## Symptoms
+- Requests fail with `timeout` or `deadline exceeded`
+- Upstream latency spikes in metrics
+- Retries triggered with backoff
+## Checks
+- Measure RTT and latency between caller and dependency
+- Inspect dependency saturation (CPU, connections, queue depth)
+- Check long-running queries or GC pauses
+- Validate timeout/retry configuration and circuit breakers
+- Review recent deploys or infrastructure changes impacting latency
+## Fix
+- Tune timeouts/retries to realistic values
+- Optimize slow queries or handlers
+- Add caching/batching if appropriate
+- Scale dependency horizontally or vertically

kb/tls_handshake.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# TLS Handshake Issues
+## Symptoms
+- `SSLHandshakeException`, `certificate verify failed`, or `unknown_ca`
+- Works with curl -k but fails with client defaults
+- Errors after certificate rotation
+## Checks
+- Validate certificate chain, expiry, and SAN/hostname match
+- Confirm protocol/cipher compatibility between client and server
+- Check ALPN/SNI configuration for proxies or ingress
+- Inspect system trust store and custom CA bundles
+- Review mTLS settings and key/cert presence
+## Fix
+- Install correct CA bundle and full certificate chain
+- Align TLS versions/ciphers or disable legacy protocols
+- Configure SNI/ALPN correctly on clients and proxies
+- Rotate certificates/keys and restart workloads

pipeline.py ADDED Viewed

	@@ -0,0 +1,201 @@

+import json
+import os
+from dataclasses import asdict, dataclass
+from typing import Dict, List, Optional
+# Force CPU usage to avoid CUDA capability issues in WSL/GPU-mismatch environments.
+os.environ.setdefault("CUDA_VISIBLE_DEVICES", "")
+from transformers import pipeline
+from preprocess import PreprocessResult, preprocess_logs
+from retrieval import RunbookRetriever
+CANDIDATE_LABELS = [
+    "oom",
+    "timeout",
+    "auth_failure",
+    "db_connection",
+    "dns_resolution",
+    "tls_handshake",
+    "crashloop",
+    "null_pointer",
+    "resource_exhaustion",
+    "network_partition",
+]
+@dataclass
+class IncidentResult:
+    incident_label: str
+    incident_score: float
+    incident_alternatives: List[Dict]
+    explanation: str
+    likely_cause: str
+    checks: List[str]
+    retrieved: List[Dict]
+    verification: List[Dict]
+    signatures: List[str]
+class ModelStore:
+    def __init__(self):
+        self.classifier = pipeline(
+            "zero-shot-classification",
+            model="facebook/bart-large-mnli",
+            device=-1,
+        )
+        self.summarizer = pipeline(
+            "summarization",
+            model="sshleifer/distilbart-cnn-12-6",
+            device=-1,
+        )
+        self.nli = pipeline(
+            "text-classification",
+            model="typeform/distilbert-base-uncased-mnli",
+            device=-1,
+        )
+class IncidentPipeline:
+    def __init__(self):
+        self.models = ModelStore()
+        self.retriever = RunbookRetriever()
+    def classify(self, text: str, source: str) -> Dict:
+        labels = list(CANDIDATE_LABELS)
+        if source and source != "auto":
+            labels.append(f"{source}_specific")
+        res = self.models.classifier(text, candidate_labels=labels, multi_label=False)
+        label = res["labels"][0]
+        score = float(res["scores"][0])
+        alternatives = [
+            {"label": res["labels"][i], "score": float(res["scores"][i])}
+            for i in range(1, min(4, len(res["labels"])))
+        ]
+        return {"label": label, "score": score, "alternatives": alternatives}
+    def explain(self, text: str, verbosity: int = 1) -> str:
+        max_len = 180 + 60 * verbosity
+        min_len = 40 + 20 * verbosity
+        summary = self.models.summarizer(
+            text,
+            max_length=max_len,
+            min_length=min_len,
+            truncation=True,
+        )[0]["summary_text"]
+        return summary
+    def generate_cause_and_checks(
+        self, result: PreprocessResult, label: str, retrieved: List[Dict]
+    ) -> tuple[str, List[str]]:
+        cause_map = {
+            "oom": "Service likely exhausted memory and was terminated.",
+            "crashloop": "Container keeps restarting due to repeated failures or failed health checks.",
+            "timeout": "Upstream or dependency timed out handling the request.",
+            "auth_failure": "Authentication/authorization failed (expired token, missing permissions, or misconfiguration).",
+            "db_connection": "Database connection pool exhausted or connection refused.",
+            "dns_resolution": "DNS resolution failed for upstream host.",
+            "tls_handshake": "TLS handshake failed (bad cert, protocol mismatch).",
+            "null_pointer": "Application hit null/None reference and crashed.",
+            "resource_exhaustion": "System resources (CPU/file descriptors) exhausted.",
+            "network_partition": "Network partition or connectivity issue between components.",
+        }
+        cause = cause_map.get(label, f"Most likely incident category: {label}.")
+        checks: List[str] = [
+            "Confirm timeframe of failure in logs and recent deploys.",
+            "Check service and pod/resource metrics (CPU, memory, restarts) around the incident window.",
+            "Inspect recent configuration or secrets changes.",
+        ]
+        if label == "oom" or "oom" in result.signatures:
+            checks += [
+                "Inspect container memory limits/requests and current usage.",
+                "Review heap/thread dumps if available.",
+                "Check for memory leaks or unbounded caches.",
+                "Ensure JVM/Runtime memory flags are configured correctly.",
+            ]
+        if label in ("timeout",):
+            checks += [
+                "Measure latency between service and dependencies.",
+                "Verify retry/backoff settings and circuit breakers.",
+                "Check for slow queries or downstream saturation.",
+            ]
+        if label in ("auth_failure",):
+            checks += [
+                "Verify tokens/credentials validity and scopes.",
+                "Check clock skew between services.",
+                "Review authentication provider health and rate limits.",
+            ]
+        if label in ("db_connection",):
+            checks += [
+                "Check DB connection pool size vs load.",
+                "Inspect database for locks or slow queries.",
+                "Verify database host/port/DNS correctness.",
+            ]
+        if label in ("dns_resolution",):
+            checks += [
+                "Resolve target host from pod/host manually.",
+                "Check DNS server health and recent DNS changes.",
+                "Verify search domains and /etc/resolv.conf inside pod/container.",
+            ]
+        if label in ("tls_handshake",):
+            checks += [
+                "Validate certificates (expiry, SANs, chain).",
+                "Check protocol/cipher compatibility between client and server.",
+                "Inspect ALPN/SNI configuration.",
+            ]
+        if label in ("crashloop",):
+            checks += [
+                "Inspect startup probes/health checks and command overrides.",
+                "Review last logs before restart for root cause.",
+                "Confirm config/secret mounts exist and permissions are correct.",
+            ]
+        if retrieved:
+            checks.append(f"Consult runbook: {retrieved[0]['title']} (score {retrieved[0]['score']:.2f}).")
+        # Ensure at least 5 checks
+        while len(checks) < 5:
+            checks.append("Add extra diagnostic step: capture more logs and metrics.")
+        return cause, checks[:10]
+    def verify_hypotheses(self, premise: str, hypotheses: List[str]) -> List[Dict]:
+        results = []
+        for hyp in hypotheses:
+            pred = self.models.nli({"text": premise, "text_pair": hyp})[0]
+            results.append({"hypothesis": hyp, "label": pred["label"], "score": float(pred["score"])})
+        return results
+    def process(
+        self,
+        raw_text: str,
+        source: str = "auto",
+        use_retrieval: bool = True,
+        use_nli: bool = False,
+        verbosity: int = 1,
+    ) -> IncidentResult:
+        if not raw_text or not raw_text.strip():
+            raise ValueError("Logs input is empty. Please provide logs or stacktrace text.")
+        pre = preprocess_logs(raw_text)
+        cls = self.classify(pre.cleaned_text, source)
+        explanation = self.explain(pre.cleaned_text, verbosity=verbosity)
+        retrieved = self.retriever.search(pre.cleaned_text, top_k=3) if use_retrieval else []
+        cause, checks = self.generate_cause_and_checks(pre, cls["label"], retrieved)
+        verification = []
+        if use_nli:
+            hypotheses = [cause] + [f"Runbook match: {r['title']}" for r in retrieved]
+            verification = self.verify_hypotheses(pre.cleaned_text, hypotheses)
+        return IncidentResult(
+            incident_label=cls["label"],
+            incident_score=cls["score"],
+            incident_alternatives=cls["alternatives"],
+            explanation=explanation,
+            likely_cause=cause,
+            checks=checks,
+            retrieved=retrieved,
+            verification=verification,
+            signatures=pre.signatures,
+        )
+def serialize_result(result: IncidentResult) -> str:
+    return json.dumps(asdict(result), indent=2, ensure_ascii=False)

preprocess.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import re
+from dataclasses import dataclass
+from typing import List, Tuple
+UUID_RE = re.compile(r"\b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b")
+IP_RE = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
+EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
+PATH_RE = re.compile(r"(?:[A-Za-z]:)?(?:/|\\)[\w\-/\\\.]+")
+TIMESTAMP_RE = re.compile(r"\b\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}:\d{2}(?:\.\d+)?\b")
+@dataclass
+class PreprocessResult:
+    cleaned_text: str
+    signatures: List[str]
+    masked: List[str]
+def detect_signatures(text: str) -> List[str]:
+    signatures = []
+    if re.search(r"Traceback|Exception|Error:|Caused by:", text, re.IGNORECASE):
+        signatures.append("stacktrace")
+    if TIMESTAMP_RE.search(text):
+        signatures.append("timestamps")
+    if re.search(r"\bINFO\b|\bWARN\b|\bERROR\b|\bDEBUG\b|\bTRACE\b", text):
+        signatures.append("log_levels")
+    if re.search(r"CrashLoopBackOff|OOMKilled|Back-off restarting", text, re.IGNORECASE):
+        signatures.append("k8s")
+    if re.search(r"OutOfMemoryError|Java heap space", text, re.IGNORECASE):
+        signatures.append("oom")
+    if re.search(r"timeout|timed out|Connection timed out", text, re.IGNORECASE):
+        signatures.append("timeout")
+    return signatures
+def mask_sensitive(text: str) -> Tuple[str, List[str]]:
+    masked = []
+    def _mask(pattern: re.Pattern, placeholder: str, value: str) -> str:
+        matches = pattern.findall(value)
+        if matches:
+            masked.extend(f"{placeholder}:{m}" for m in matches)
+        return pattern.sub(placeholder, value)
+    text = _mask(UUID_RE, "<UUID>", text)
+    text = _mask(IP_RE, "<IP>", text)
+    text = _mask(EMAIL_RE, "<EMAIL>", text)
+    text = _mask(PATH_RE, "<PATH>", text)
+    return text, masked
+def truncate_logs(text: str, head_lines: int = 120, tail_lines: int = 80, max_lines: int = 400) -> str:
+    lines = text.splitlines()
+    if len(lines) <= max_lines:
+        return text
+    head = "\n".join(lines[:head_lines])
+    tail = "\n".join(lines[-tail_lines:])
+    return head + "\n...\n" + tail
+def preprocess_logs(raw_text: str) -> PreprocessResult:
+    normalized = raw_text.strip()
+    truncated = truncate_logs(normalized)
+    masked_text, masked = mask_sensitive(truncated)
+    signatures = detect_signatures(masked_text)
+    return PreprocessResult(cleaned_text=masked_text, signatures=signatures, masked=list(masked))

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+gradio==4.44.0
+transformers==4.38.2
+torch>=2.2.0,<3.0
+sentence-transformers==2.5.1
+numpy>=1.26.0,<2.0.0

retrieval.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import os
+from dataclasses import dataclass
+from typing import List, Optional
+import numpy as np
+import torch
+from sentence_transformers import CrossEncoder, SentenceTransformer, util
+@dataclass
+class RunbookDoc:
+    path: str
+    title: str
+    content: str
+class RunbookRetriever:
+    def __init__(
+        self,
+        kb_dir: str = "kb",
+        embed_model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
+        reranker_name: Optional[str] = "cross-encoder/ms-marco-MiniLM-L-6-v2",
+    ):
+        self.kb_dir = kb_dir
+        # Force CPU to avoid CUDA capability mismatches in WSL/GPUs.
+        self.device = torch.device("cpu")
+        self.embed_model = SentenceTransformer(embed_model_name, device=self.device)
+        self.reranker: Optional[CrossEncoder] = None
+        if reranker_name:
+            try:
+                self.reranker = CrossEncoder(reranker_name, device=self.device)
+            except Exception:
+                self.reranker = None
+        self.docs = self._load_docs()
+        if self.docs:
+            self.doc_embeddings = self.embed_model.encode(
+                [doc.content for doc in self.docs],
+                convert_to_tensor=True,
+                device=self.device,
+            )
+        else:
+            self.doc_embeddings = None
+    def _load_docs(self) -> List[RunbookDoc]:
+        docs: List[RunbookDoc] = []
+        if not os.path.isdir(self.kb_dir):
+            return docs
+        for fname in os.listdir(self.kb_dir):
+            if not fname.endswith(".md"):
+                continue
+            path = os.path.join(self.kb_dir, fname)
+            with open(path, "r", encoding="utf-8") as f:
+                content = f.read()
+            title = content.splitlines()[0].lstrip("# ").strip() if content else fname
+            docs.append(RunbookDoc(path=path, title=title, content=content))
+        return docs
+    def search(self, query: str, top_k: int = 3):
+        if not self.docs or self.doc_embeddings is None:
+            return []
+        query_emb = self.embed_model.encode(query, convert_to_tensor=True, device=self.device)
+        scores = util.cos_sim(query_emb, self.doc_embeddings)[0]
+        top_results = np.argsort(-scores.cpu().numpy())[: top_k * 4]
+        candidates = [
+            {"doc": self.docs[idx], "score": float(scores[idx])} for idx in top_results
+        ]
+        if self.reranker:
+            pairs = [[query, c["doc"].content] for c in candidates]
+            rerank_scores = self.reranker.predict(pairs)
+            for cand, rscore in zip(candidates, rerank_scores):
+                cand["rerank_score"] = float(rscore)
+            candidates = sorted(candidates, key=lambda x: x.get("rerank_score", x["score"]), reverse=True)
+        else:
+            candidates = sorted(candidates, key=lambda x: x["score"], reverse=True)
+        return [
+            {
+                "title": cand["doc"].title,
+                "score": cand.get("rerank_score", cand["score"]),
+                "path": cand["doc"].path,
+                "excerpt": cand["doc"].content[:500],
+            }
+            for cand in candidates[:top_k]
+        ]

samples/sample_java.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+2024-11-04 08:15:10,042 ERROR c.example.auth.AuthFilter - Failed to authorize request
+java.lang.NullPointerException: Cannot invoke "String.length()" because "token" is null
+    at com.example.auth.AuthFilter.validate(AuthFilter.java:64)
+    at com.example.auth.AuthFilter.doFilter(AuthFilter.java:45)
+    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
+Caused by: java.lang.IllegalStateException: clock skew too high
+    at com.example.auth.TokenVerifier.verify(TokenVerifier.java:32)
+User received 403 on /api/orders

samples/sample_k8s.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+2024-11-04T09:05:11Z kubelet Warning BackOff Back-off restarting failed container
+2024-11-04T09:05:11Z kubelet Normal Pulled Successfully pulled image "myapp:v2"
+2024-11-04T09:05:12Z kubelet Warning BackOff Back-off restarting failed container
+kubectl describe pod myapp-6c8f6f4b4f-nlg8v:
+  Last State: Terminated
+    Reason: Error
+    Exit Code: 137
+    Started: Mon, 04 Nov 2024 09:05:05 +0000
+    Finished: Mon, 04 Nov 2024 09:05:10 +0000
+  Events:
+    Warning  BackOff  2m (x6 over 3m)  kubelet  Back-off restarting failed container
+  Memory:
+    Limits: 128Mi
+    Usage: 190Mi

samples/sample_python.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+2024-11-04 10:22:18,012 ERROR service.worker failed processing job abc123
+Traceback (most recent call last):
+  File "/app/service/worker.py", line 42, in handle
+    process(payload)
+  File "/app/service/handler.py", line 88, in process
+    resp = requests.post(url, json=data, timeout=2)
+  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 116, in post
+    return request('post', url, data=data, json=json, **kwargs)
+  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 60, in request
+    return session.request(method=method, url=url, **kwargs)
+requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.internal', port=443): Read timed out. (read timeout=2)
+Retrying with backoff...
+2024-11-04 10:22:20,013 WARN service.worker retry attempt 1 failed