--- license: mit tags: - vicidial - call-center - asterisk - otel - observability --- # Asterisk Observability with OpenTelemetry and Grafana **How to actually see what's happening inside your Asterisk servers. OpenTelemetry as the collection layer, Prometheus for storage, Grafana for dashboards, and distributed tracing to follow a call from SIP INVITE to agent headset. Built from production VICIdial clusters pushing 200K+ daily calls.** --- I've been running Asterisk in production since the 1.4 days. For most of that time, "monitoring" meant SSH into the box, run `asterisk -rx "core show channels"`, squint at the output, and hope that the number of active channels looked about right. Maybe check `/var/log/asterisk/full` when something broke. Maybe not. That stopped being acceptable around the time we crossed 50,000 daily calls across a 4-server cluster. When a SIP trunk goes down at 2 PM on a Tuesday and 300 agents go idle, you need to know in seconds, not whenever someone notices the real-time report looks weird and pings you on Slack. This guide covers... ## Overview **How to actually see what's happening inside your Asterisk servers. OpenTelemetry as the collection layer, Prometheus for storage, Grafana for dashboards, and distributed tracing to follow a call from SIP INVITE to agent headset. Built from production VICIdial clusters pushing 200K+ daily calls.** --- I've been running Asterisk in production since the 1.4 days. For most of that time, "monitoring" meant SSH into the box, run `asterisk -rx "core show channels"`, squint at the output, and hope that the number of active channels looked about right. Maybe check `/var/log/asterisk/full` when something broke. Maybe not. That stopped being acceptable around the time we crossed 50,000 daily calls across a 4-server cluster. When a SIP trunk goes down at 2 PM on a Tuesday and 300 agents go idle, you need to know in seconds, not whenever someone notices the real-time report looks weird and pings you on Slack. This guide covers the full observability stack for Asterisk: metrics collection with OpenTelemetry, storage in Prometheus, visualization in Grafana, and distributed tracing for individual call flows. If you're running VICIdial, everything here applies — VICIdial's call processing is just Asterisk dialplan execution under the hood, and all the telemetry surfaces the same way. --- ## Why OpenTelemetry Instead of Just Prometheus You could skip OpenTelemetry entirely. Install `prometheus-node-exporter` on your Asterisk box, write a script that scrapes `asterisk -rx` output into Prometheus metrics, and call it done. I've done exactly that. It works. It's also fragile, custom, and doesn't scale. OpenTelemetry (OTel) gives you three things that roll-your-own monitoring doesn't: **Vendor-neutral collection.** The OTel Collector speaks StatsD, Prometheus, OTLP, syslog, and dozens of other formats. Asterisk's built-in `res_statsd` module pushes metrics via StatsD. AMI events can be forwarded as structured logs. You don't have to write custom parsers — you configure receivers. **Processing pipelines.** OTel lets you filter, transform, aggregate, and route telemetry data before it hits your backend. Want to drop debug-level events but keep warnings? Want to add a `cluster_name` attribute to every metric? Want to sample 10% of traces for non-error calls? All configurable in the collector. **Multi-backend export.** Send metrics to Prometheus, traces to Jaeger or Tempo, and logs to Loki — from one collector instance. If you ever want to switch from Prometheus to Mimir or from Jaeger to Tempo, you change one exporter config. Nothing on the Asterisk side changes. That said, if you have a single Asterisk box running 5,000 calls a day, a Prometheus scraper script is probably fine. OTel shines when you have multiple servers, multiple signal types (metrics + traces + logs), or when you're tired of maintaining custom scripts. --- ## Architecture Overview Here's what we're building: ``` ┌─────────────────────────────────────────────────────┐ │ Asterisk Server │ │ │ │ res_statsd ──→ OTel Collector ──→ Prometheus │ │ (sidecar) │ │ AMI Events ──→ ami-otel-bridge ──→ OTel Collector │ │ │ │ │ CDR/CEL ──→ MySQL ──→ mysqld_exporter ──→ Prometheus│ │ │ └───────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌────────────────┐ ┌──────────┐ │Prometheus│ │ Jaeger / Tempo │ │ Loki │ └────┬─────┘ └───────┬────────┘ └────┬─────┘ │ │ │ └──────────┬───────┘──────────────────┘ │ ┌────▼─────┐ │ Grafana │ └──────────┘ ``` Components: - **res_statsd** — Asterisk's built-in StatsD module. Emits metrics on channel counts, endpoint status, bridge operations, and more. - **OTel Collector** — Runs as a sidecar process on the Asterisk server. Receives StatsD from res_statsd, processes it, exports to Prometheus. - **ami-otel-bridge** — A small script that reads [Asterisk Manager Interface](/blog/asterisk-manager-interface-guide/) (AMI) events and converts them to OTel spans/logs. - **mysqld_exporter** — Exports MySQL metrics for CDR/CEL table monitoring. - **Prometheus** — Time-series database. Stores all metrics. - **Grafana** — Dashboards and alerting. - **Jaeger/Tempo** — Distributed tracing backend for call flow traces. Let's build each layer. --- ## Step 1: Install the OpenTelemetry Collector The OTel Collector runs on each Asterisk server as a systemd service. ```bash # Download the latest stable release (check https://github.com/open-telemetry/opentelemetry-collector-releases) OTEL_VERSION="0.96.0" curl -L "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v${OTEL_VERSION}/otelcol-contrib_${OTEL_VERSION}_linux_amd64.tar.gz" \ -o /tmp/otelcol.tar.gz tar xzf /tmp/otelcol.tar.gz -C /usr/local/bin/ otelcol-contrib chmod +x /usr/local/bin/otelcol-contrib # Verify otelcol-contrib --version ``` Create the systemd unit: ```ini # /etc/systemd/system/otelcol.service [Unit] Description=OpenTelemetry Collector After=network.target [Service] Type=simple User=otelcol Group=otelcol ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol/config.yaml Restart=always RestartSec=5 LimitNOFILE=65536 [Install] WantedBy=multi-user.target ``` Create the user and directories: ```bash useradd --system --no-create-home --shell /usr/sbin/nologin otelcol mkdir -p /etc/otelcol chown otelcol:otelcol /etc/otelcol ``` --- ## Step 2: Configure Asterisk's StatsD Module Asterisk has had built-in StatsD support since version 13 through `res_statsd`. It's compiled in by default on most distributions but not loaded by default. Enable it: ```ini # /etc/asterisk/statsd.conf [general] enabled = yes server = 127.0.0.1:8125 ; OTel Collector's StatsD receiver prefix = asterisk ; All metrics will be prefixed with "asterisk." add_newline = no ``` Load the module: ```bash asterisk -rx "module load res_statsd.so" # Verify it's loaded asterisk -rx "module show like statsd" ``` Output should show: ``` Module Description Use Count Status res_statsd.so StatsD client support 0 Running ``` ### What Metrics Does res_statsd Emit? Once loaded, Asterisk pushes the following metrics as StatsD gauges and counters: | Metric | Type | Description | |--------|------|-------------| | `asterisk.channels.count` | gauge | Current active channel count | | `asterisk.channels.by_type.SIP` | gauge | Active SIP channels | | `asterisk.channels.by_type.PJSIP` | gauge | Active PJSIP channels | | `asterisk.channels.by_type.Local` | gauge | Active Local channels | | `asterisk.endpoints.count` | gauge | Registered endpoints | | `asterisk.endpoints.state.online` | gauge | Endpoints in online state | | `asterisk.endpoints.state.offline` | gauge | Endpoints in offline state | | `asterisk.bridges.count` | gauge | Active bridges | | `asterisk.bridges.channels` | gauge | Channels in bridges | These are updated every 10 seconds by default. For a busy system, that's fine. If you need sub-second resolution (you probably don't), you can adjust the interval in `statsd.conf`. --- ## Step 3: OTel Collector Configuration Here's the collector config that receives StatsD from Asterisk and exports to Prometheus: ```yaml # /etc/otelcol/config.yaml receivers: # Receive StatsD metrics from res_statsd statsd: endpoint: "0.0.0.0:8125" aggregation_interval: 10s timer_histogram_mapping: - statsd_type: "timer" observer_type: "histogram" histogram: explicit: - 10 - 25 - 50 - 100 - 250 - 500 - 1000 - 5000 - 10000 # Scrape host metrics (CPU, memory, disk, network) hostmetrics: collection_interval: 15s scrapers: cpu: metrics: system.cpu.utilization: enabled: true memory: metrics: system.memory.utilization: enabled: true disk: {} network: {} load: {} # Receive OTLP from custom instrumentation (ami-otel-bridge) otlp: protocols: grpc: endpoint: "0.0.0.0:4317" http: endpoint: "0.0.0.0:4318" processors: # Add resource attributes to every metric resource: attributes: - key: service.name value: "asterisk" action: upsert - key: host.name from_attribute: "" action: upsert - key: cluster.name value: "vicidial-prod" action: upsert - key: server.role value: "dialer" action: upsert # Batch metrics to reduce export overhead batch: timeout: 10s send_batch_size: 1000 # Memory limiter to prevent OOM memory_limiter: check_interval: 5s limit_mib: 256 spike_limit_mib: 64 exporters: # Export metrics to Prometheus prometheus: endpoint: "0.0.0.0:8889" namespace: "asterisk" resource_to_telemetry_conversion: enabled: true # Export traces to Jaeger (or Tempo) otlp/jaeger: endpoint: "jaeger.monitoring.local:4317" tls: insecure: true # Export logs to Loki loki: endpoint: "http://loki.monitoring.local:3100/loki/api/v1/push" labels: attributes: service.name: "service_name" host.name: "hostname" # Debug output (disable in production) # debug: # verbosity: detailed service: pipelines: metrics: receivers: [statsd, hostmetrics] processors: [memory_limiter, resource, batch] exporters: [prometheus] traces: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [otlp/jaeger] logs: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [loki] telemetry: logs: level: "warn" metrics: address: ":8888" ``` Start the collector: ```bash systemctl daemon-reload systemctl enable otelcol systemctl start otelcol # Verify it's running and receiving StatsD curl -s http://localhost:8889/metrics | grep asterisk_channels ``` You should see Prometheus-formatted metrics: ``` # HELP asterisk_channels_count Current active channel count # TYPE asterisk_channels_count gauge asterisk_channels_count{cluster_name="vicidial-prod",host_name="dialer01",server_role="dialer"} 47 ``` --- ## Step 4: Custom Metrics via AMI StatsD gives you the basics — channel counts, endpoint status, bridge counts. But for VICIdial-specific observability, you need more. The [Asterisk Manager Interface](/blog/vicidial-custom-mysql-reports/) (AMI) emits events for everything: new channels, hangups, DTMF, queue joins, agent status changes, you name it. Here's a Python script that connects to AMI, listens for events, and pushes them to the OTel Collector as metrics and traces: ```python #!/usr/bin/env python3 """ ami_otel_bridge.py — Bridge AMI events to OpenTelemetry Runs as a daemon alongside Asterisk. """ import socket import time import re import os from opentelemetry import metrics, trace from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter # OTel setup metric_exporter = OTLPMetricExporter(endpoint="localhost:4317", insecure=True) metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=10000) metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader])) meter = metrics.get_meter("ami-bridge") trace_exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True) trace.set_tracer_provider(TracerProvider()) trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(trace_exporter)) tracer = trace.get_tracer("ami-bridge") # Metrics calls_total = meter.create_counter("asterisk.calls.total", description="Total calls") calls_active = meter.create_up_down_counter("asterisk.calls.active", description="Active calls") calls_by_disposition = meter.create_counter("asterisk.calls.by_disposition", description="Calls by disposition") sip_registrations = meter.create_up_down_counter("asterisk.sip.registrations", description="SIP registration events") queue_callers = meter.create_up_down_counter("asterisk.queue.callers", description="Callers waiting in queue") call_duration = meter.create_histogram("asterisk.call.duration_ms", description="Call duration in milliseconds") # Track active call spans for distributed tracing active_spans = {} AMI_HOST = os.environ.get("AMI_HOST", "127.0.0.1") AMI_PORT = int(os.environ.get("AMI_PORT", "5038")) AMI_USER = os.environ.get("AMI_USER", "admin") AMI_SECRET = os.environ.get("AMI_SECRET", "amp111") def connect_ami(): """Connect to AMI and authenticate.""" sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(30) sock.connect((AMI_HOST, AMI_PORT)) # Read banner sock.recv(1024) # Login login = ( f"Action: Login\r\n" f"Username: {AMI_USER}\r\n" f"Secret: {AMI_SECRET}\r\n" f"Events: call,agent,cdr\r\n" f"\r\n" ) sock.sendall(login.encode()) response = sock.recv(4096).decode() if "Success" not in response: raise ConnectionError(f"AMI login failed: {response}") print(f"[ami-otel] Connected to AMI at {AMI_HOST}:{AMI_PORT}") return sock def parse_event(raw): """Parse an AMI event into a dict.""" event = {} for line in raw.strip().split("\r\n"): if ": " in line: key, value = line.split(": ", 1) event[key.strip()] = value.strip() return event def handle_event(event): """Process an AMI event and emit OTel signals.""" event_type = event.get("Event", "") if event_type == "Newchannel": channel = event.get("Channel", "unknown") calls_total.add(1, {"channel_type": channel.split("/")[0]}) calls_active.add(1) # Start a trace span for this call uniqueid = event.get("Uniqueid", "") if uniqueid: span = tracer.start_span( "asterisk.call", attributes={ "asterisk.channel": channel, "asterisk.uniqueid": uniqueid, "asterisk.caller_id": event.get("CallerIDNum", ""), "asterisk.context": event.get("Context", ""), "asterisk.exten": event.get("Exten", ""), } ) active_spans[uniqueid] = { "span": span, "start_time": time.time(), } elif event_type == "Hangup": calls_active.add(-1) uniqueid = event.get("Uniqueid", "") cause = event.get("Cause-txt", "Unknown") # End the trace span if uniqueid in active_spans: span_data = active_spans.pop(uniqueid) duration_ms = (time.time() - span_data["start_time"]) * 1000 span_data["span"].set_attribute("asterisk.hangup_cause", cause) span_data["span"].set_attribute("asterisk.duration_ms", duration_ms) span_data["span"].end() call_duration.record(duration_ms, {"cause": cause}) elif event_type == "AgentComplete": dispo = event.get("Reason", "unknown") calls_by_disposition.add(1, {"disposition": dispo}) elif event_type == "PeerStatus": peer = event.get("Peer", "") status = event.get("PeerStatus", "") if status == "Registered": sip_registrations.add(1, {"peer": peer}) elif status == "Unregistered": sip_registrations.add(-1, {"peer": peer}) elif event_type == "Join": queue_callers.add(1, {"queue": event.get("Queue", "unknown")}) elif event_type == "Leave": queue_callers.add(-1, {"queue": event.get("Queue", "unknown")}) def main(): while True: try: sock = connect_ami() buffer = "" while True: data = sock.recv(4096).decode("utf-8", errors="replace") if not data: raise ConnectionError("AMI connection lost") buffer += data # AMI events are separated by \r\n\r\n while "\r\n\r\n" in buffer: raw_event, buffer = buffer.split("\r\n\r\n", 1) if raw_event.strip(): event = parse_event(raw_event) if "Event" in event: handle_event(event) except Exception as e: print(f"[ami-otel] Error: {e}, reconnecting in 5s...") time.sleep(5) if __name__ == "__main__": main() ``` Install dependencies and run as a service: ```bash pip3 install opentelemetry-api opentelemetry-sdk \ opentelemetry-exporter-otlp-proto-grpc # Create systemd unit cat > /etc/systemd/system/ami-otel-bridge.service << 'EOF' [Unit] Description=AMI to OpenTelemetry Bridge After=asterisk.service otelcol.service [Service] Type=simple User=asterisk Environment=AMI_HOST=127.0.0.1 Environment=AMI_PORT=5038 Environment=AMI_USER=admin Environment=AMI_SECRET=your_ami_password_here ExecStart=/usr/bin/python3 /usr/local/bin/ami_otel_bridge.py Restart=always RestartSec=5 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable ami-otel-bridge systemctl start ami-otel-bridge ``` Now you have two metric sources feeding the OTel Collector: `res_statsd` for Asterisk internals, and the AMI bridge for call-level events and distributed traces. --- ## Step 5: Prometheus Configuration Prometheus needs to scrape the OTel Collector's Prometheus exporter endpoint: ```yaml # /etc/prometheus/prometheus.yml (add to scrape_configs) scrape_configs: - job_name: 'asterisk-otel' scrape_interval: 10s static_configs: - targets: - 'dialer01.internal:8889' - 'dialer02.internal:8889' - 'dialer03.internal:8889' labels: environment: 'production' # Also scrape the OTel Collector's own health metrics - job_name: 'otel-collector' scrape_interval: 30s static_configs: - targets: - 'dialer01.internal:8888' - 'dialer02.internal:8888' - 'dialer03.internal:8888' ``` ### Recording Rules for Call Center KPIs Raw metrics are useful, but derived metrics are where the value lives. Set up recording rules in Prometheus: ```yaml # /etc/prometheus/rules/asterisk.yml groups: - name: asterisk_kpis interval: 30s rules: # Calls per minute (cluster-wide) - record: asterisk:calls_per_minute expr: sum(rate(asterisk_calls_total[5m])) * 60 # Average call duration (5-minute window) - record: asterisk:avg_call_duration_sec expr: | histogram_quantile(0.5, rate(asterisk_call_duration_ms_bucket[5m]) ) / 1000 # 95th percentile call duration - record: asterisk:p95_call_duration_sec expr: | histogram_quantile(0.95, rate(asterisk_call_duration_ms_bucket[5m]) ) / 1000 # Channel utilization per server (active / max) - record: asterisk:channel_utilization expr: | asterisk_channels_count / (asterisk_endpoints_state_online * 2) # SIP registration churn rate - record: asterisk:registration_churn_rate expr: | abs(rate(asterisk_sip_registrations[5m])) # Queue wait callers (cluster total) - record: asterisk:queue_callers_total expr: sum(asterisk_queue_callers) ``` Reload Prometheus: ```bash curl -X POST http://localhost:9090/-/reload ``` --- ## Step 6: Grafana Dashboards Now the fun part. Here's where you actually see things. If you haven't set up Grafana yet, our [VICIdial Grafana dashboard guide](https://vicistack.com/blog/vicidial-grafana-dashboards/) covers the basic installation. ### Dashboard 1: Cluster Overview The single-pane-of-glass dashboard. This should be on a TV on the wall. **Panel 1: Active Channels (Stat)** ``` Query: sum(asterisk_channels_count) Thresholds: 0-100 green, 100-200 yellow, 200+ red ``` **Panel 2: Calls Per Minute (Time Series)** ``` Query: asterisk:calls_per_minute Legend: {{host_name}} ``` **Panel 3: Channel Utilization by Server (Gauge)** ``` Query: asterisk:channel_utilization * 100 Legend: {{host_name}} Min: 0, Max: 100 Thresholds: 0-70 green, 70-85 yellow, 85-100 red ``` **Panel 4: SIP Registrations (Stat + Sparkline)** ``` Query: sum(asterisk_endpoints_state_online) ``` **Panel 5: Call Duration Distribution (Heatmap)** ``` Query: sum(rate(asterisk_call_duration_ms_bucket[5m])) by (le) Format: Heatmap ``` **Panel 6: Queue Callers Waiting (Time Series)** ``` Query: sum(asterisk_queue_callers) by (queue) Legend: {{queue}} Alert: if > 10 for 2 minutes ``` ### Dashboard 2: SIP Health This dashboard tells you when trunks are dying before your agents notice. **Panel 1: Registration Status by Peer (Table)** ``` Query: asterisk_endpoints_state_online Transform: Labels to fields Columns: host_name, peer, value Value mappings: 1 = "Online" (green), 0 = "Offline" (red) ``` **Panel 2: Registration Events Rate (Time Series)** ``` Query: rate(asterisk_sip_registrations[5m]) Legend: {{peer}} ``` A spike in registration events means phones are flapping — registering, dropping, re-registering. This usually indicates a network issue between the phone and Asterisk, or a DNS problem with the SIP registrar. **Panel 3: Active Channels by Type (Pie Chart)** ``` Query: asterisk_channels_by_type Legend: {{channel_type}} ``` In a healthy VICIdial system, you should see mostly PJSIP (or SIP) channels for agent phones and trunk calls, with some Local channels for internal routing. If you see IAX2 channels spiking, that's inter-server traffic in a cluster — normal during peak, but worth watching. ### Dashboard 3: Per-Server Deep Dive For when you suspect a specific server is misbehaving: ``` Variables: - server: label_values(asterisk_channels_count, host_name) Panels: 1. CPU Usage: system_cpu_utilization{host_name="$server"} 2. Memory Usage: system_memory_utilization{host_name="$server"} 3. Channels: asterisk_channels_count{host_name="$server"} 4. Load Average: system_cpu_load_average_5m{host_name="$server"} 5. Network I/O: rate(system_network_io_bytes_total{host_name="$server"}[5m]) 6. Disk I/O: rate(system_disk_io_bytes_total{host_name="$server"}[5m]) ``` The magic correlation: if CPU spikes at the same time channels spike, that's expected. If CPU spikes without a channel increase, something else is eating resources — check for runaway AGI scripts, a MySQL query from hell, or a cron job that shouldn't be running during peak hours. --- ## Step 7: Distributed Tracing for Call Flows This is the part most Asterisk monitoring setups miss entirely. Metrics tell you *that* something happened. Traces tell you *why*. A distributed trace follows a single call from the moment the SIP INVITE arrives through every dialplan step, AGI execution, queue wait, agent delivery, and hangup. When a caller reports "I waited 3 minutes and then got disconnected," you can pull up that exact call's trace and see every hop. The AMI bridge script above creates spans for each call. To get the full picture, you need child spans for key events within a call: ```python # Enhanced event handling with nested spans def handle_dial_begin(event): """Track outbound dial attempts within a call.""" uniqueid = event.get("Uniqueid", "") if uniqueid in active_spans: parent_span = active_spans[uniqueid]["span"] ctx = trace.set_span_in_context(parent_span) child = tracer.start_span( "asterisk.dial", context=ctx, attributes={ "asterisk.dial.destination": event.get("DestChannel", ""), "asterisk.dial.dialstring": event.get("Dialstring", ""), } ) active_spans[uniqueid]["dial_span"] = child def handle_dial_end(event): """Complete the dial span with the result.""" uniqueid = event.get("Uniqueid", "") if uniqueid in active_spans and "dial_span" in active_spans[uniqueid]: dial_span = active_spans[uniqueid].pop("dial_span") dial_span.set_attribute("asterisk.dial.status", event.get("DialStatus", "")) dial_span.end() def handle_queue_join(event): """Track time spent in queue.""" uniqueid = event.get("Uniqueid", "") if uniqueid in active_spans: parent_span = active_spans[uniqueid]["span"] ctx = trace.set_span_in_context(parent_span) child = tracer.start_span( "asterisk.queue.wait", context=ctx, attributes={ "asterisk.queue.name": event.get("Queue", ""), "asterisk.queue.position": event.get("Position", ""), "asterisk.queue.count": event.get("Count", ""), } ) active_spans[uniqueid]["queue_span"] = child def handle_queue_leave(event): """End queue wait span.""" uniqueid = event.get("Uniqueid", "") if uniqueid in active_spans and "queue_span" in active_spans[uniqueid]: queue_span = active_spans[uniqueid].pop("queue_span") queue_span.end() ``` With this instrumentation, a typical inbound call trace in Jaeger looks like: ``` [asterisk.call] ─── 145.2s total ├── [asterisk.dial] ─── 0.8s (to queue) ├── [asterisk.queue.wait] ─── 12.4s (INBOUND_SALES queue) ├── [asterisk.dial] ─── 1.2s (to agent SIP/agent42) └── [asterisk.call] ends ─── hangup cause: Normal Clearing ``` You can immediately see: 12.4 seconds in queue. Was that normal? What was the queue depth? Was the agent available or did we have to wait for one to go READY? The trace answers all of it. --- ## Step 8: Alerting Dashboards are for humans staring at screens. Alerts are for 3 AM. ### Prometheus Alerting Rules ```yaml # /etc/prometheus/rules/asterisk_alerts.yml groups: - name: asterisk_alerts rules: # Trunk down — no channels for 2 minutes - alert: AsteriskTrunkDown expr: | asterisk_channels_count == 0 and on(host_name) (time() - asterisk_channels_count offset 5m) > 300 for: 2m labels: severity: critical annotations: summary: "Asterisk server {{ $labels.host_name }} has zero active channels for 2+ minutes" description: "All trunks may be down. Check SIP registrations and carrier connectivity." # Channel exhaustion warning - alert: AsteriskChannelExhaustion expr: asterisk:channel_utilization > 0.85 for: 5m labels: severity: warning annotations: summary: "Channel utilization above 85% on {{ $labels.host_name }}" description: "Server is approaching channel capacity. Current utilization: {{ $value | humanizePercentage }}" # Registration storm — phones flapping - alert: AsteriskRegistrationStorm expr: asterisk:registration_churn_rate > 5 for: 3m labels: severity: warning annotations: summary: "High SIP registration churn on {{ $labels.host_name }}" description: "Phones are registering/unregistering rapidly. Possible network instability." # Queue backup — callers waiting too long - alert: AsteriskQueueBackup expr: sum(asterisk_queue_callers) by (queue) > 15 for: 2m labels: severity: warning annotations: summary: "Queue {{ $labels.queue }} has {{ $value }} callers waiting" description: "More than 15 callers in queue for 2+ minutes. Check agent availability." # No calls for 10 minutes during business hours - alert: AsteriskNoCalls expr: | asterisk:calls_per_minute == 0 and hour() >= 9 and hour() <= 17 and day_of_week() >= 1 and day_of_week() <= 5 for: 10m labels: severity: critical annotations: summary: "Zero calls per minute during business hours on {{ $labels.host_name }}" # AMI bridge disconnected - alert: AMIBridgeDown expr: up{job="asterisk-otel"} == 0 for: 2m labels: severity: warning annotations: summary: "OTel Collector not reachable for {{ $labels.instance }}" ``` Wire these into Alertmanager to send to Slack, PagerDuty, email, or whatever your on-call system uses. --- ## Step 9: CDR-Based Metrics The AMI bridge captures real-time events, but CDR (Call Detail Records) give you the complete picture after a call ends. VICIdial stores CDRs in MySQL, and Prometheus can query MySQL through `mysqld_exporter` — but honestly, for [CDR analysis](/blog/vicidial-asterisk-cdr-analysis/), you're better off with a dedicated query: ```sql -- Calls per hour with quality metrics (run from a cron-based exporter) SELECT DATE_FORMAT(calldate, '%Y-%m-%d %H:00:00') AS hour_bucket, COUNT(*) AS total_calls, AVG(duration) AS avg_duration, AVG(billsec) AS avg_billsec, SUM(CASE WHEN disposition = 'ANSWERED' THEN 1 ELSE 0 END) AS answered, SUM(CASE WHEN disposition = 'NO ANSWER' THEN 1 ELSE 0 END) AS no_answer, SUM(CASE WHEN disposition = 'BUSY' THEN 1 ELSE 0 END) AS busy, SUM(CASE WHEN disposition = 'FAILED' THEN 1 ELSE 0 END) AS failed FROM cdr WHERE calldate >= DATE_SUB(NOW(), INTERVAL 24 HOUR) GROUP BY hour_bucket ORDER BY hour_bucket; ``` For a Prometheus-native approach, write a small exporter script that queries CDR data and exposes it as Prometheus metrics: ```python #!/usr/bin/env python3 """ cdr_exporter.py — Export Asterisk CDR metrics to Prometheus Run as a service, scrape on :9101/metrics """ import time import pymysql from prometheus_client import start_http_server, Gauge, Counter, Histogram # Metrics cdr_calls_total = Counter('asterisk_cdr_calls_total', 'Total CDR records', ['disposition']) cdr_duration = Histogram('asterisk_cdr_duration_seconds', 'Call duration from CDR', buckets=[5, 10, 30, 60, 120, 300, 600, 1800]) cdr_asr = Gauge('asterisk_cdr_asr', 'Answer-seizure ratio (15 min window)') def collect_cdr_metrics(): conn = pymysql.connect( host='127.0.0.1', user='cdr_readonly', password='readonly_password', db='asterisk', charset='utf8mb4' ) try: with conn.cursor() as cursor: # Recent calls by disposition cursor.execute(""" SELECT disposition, COUNT(*), AVG(billsec) FROM cdr WHERE calldate >= DATE_SUB(NOW(), INTERVAL 15 MINUTE) GROUP BY disposition """) total_calls = 0 answered_calls = 0 for row in cursor.fetchall(): disposition, count, avg_bill = row cdr_calls_total.labels(disposition=disposition).inc(count) total_calls += count if disposition == 'ANSWERED': answered_calls += count if total_calls > 0: cdr_asr.set(answered_calls / total_calls) finally: conn.close() if __name__ == '__main__': start_http_server(9101) while True: collect_cdr_metrics() time.sleep(60) ``` --- ## Putting It All Together: The 3 AM Scenario It's 3:17 AM. PagerDuty wakes you up: **AsteriskTrunkDown** on dialer02. Before this setup, here's what you'd do: SSH into dialer02, run `asterisk -rx "sip show peers"`, stare at the output, grep through logs, try to figure out when it broke and why, call the carrier, wait on hold. With the observability stack, here's what you do: 1. **Open Grafana on your phone.** The Cluster Overview dashboard shows dialer02 with zero channels. Dialer01 and dialer03 are healthy. 2. **Check the SIP Health dashboard.** You see that dialer02's SIP trunk to Carrier A went offline at 3:04 AM. Registration events show a rapid unregister/register cycle starting at 3:01 AM — three minutes of flapping before it gave up. 3. **Open the Prometheus alert timeline.** RegistrationStorm fired at 3:03 AM. TrunkDown fired at 3:06 AM. The flapping started before the trunk died — likely a network issue, not a carrier issue. 4. **Check host metrics.** dialer02's network I/O dropped to zero at 3:04 AM on eth1 (the trunk interface). CPU and memory are fine. It's a network link failure, not a server issue. 5. **Call the NOC**, not the carrier. The network link to the trunk VLAN is down. They fix it. Trunk comes back. Total incident resolution: 12 minutes instead of 45. That's observability. Not dashboards for the sake of dashboards. Dashboards that tell you where to look and what to do. --- ## Performance Impact The question everyone asks: does all this monitoring slow down Asterisk? Short answer: no. Not measurably. Longer answer: `res_statsd` adds roughly 0.1% CPU overhead on a server handling 100 concurrent channels. The AMI bridge script uses about 15MB of RAM and negligible CPU — it's event-driven, not polling. The OTel Collector uses 50-100MB of RAM depending on pipeline complexity. On a production VICIdial cluster processing 200,000+ daily calls, the total overhead of the observability stack is less than what a single poorly-written AGI script adds per call. If you're worried about performance, profile your AGI scripts before cutting monitoring. One exception: if you enable very high-cardinality labels (like including the full caller ID number as a metric label), Prometheus will eat memory proportionally. Keep labels to low-cardinality values — server name, channel type, queue name, disposition status. Not phone numbers, not channel IDs, not caller names. --- ## What We Covered You now have: - **Metrics** from `res_statsd` and AMI events flowing through OTel Collector to Prometheus - **Dashboards** in Grafana for cluster health, SIP status, and per-server deep dives - **Distributed traces** following individual calls through the Asterisk dialplan - **Alerts** for trunk failures, channel exhaustion, registration storms, and queue backups - **CDR-based analytics** exported as Prometheus metrics The monitoring itself is the easy part. The hard part is building the habit of looking at dashboards [before they](/blog/tcpa-compliance-2026/) page you, reviewing traces for calls that went wrong, and treating your observability stack as infrastructure that needs maintenance — not a one-time setup. If you want to start smaller, skip the tracing. Get metrics into Prometheus, build the Cluster Overview dashboard, and set up the TrunkDown alert. That alone will save you from the next 3 AM "nobody noticed the trunk died" incident. Add tracing later when you need to debug specific [call quality](/blog/vicidial-carrier-selection/) issues. For more on the VICIdial monitoring side — real-time agent dashboards, campaign performance panels, and dialer [efficiency metrics](/blog/vicidial-agent-efficiency-metrics/) — see our [VICIdial Grafana real-time dashboard guide](https://vicistack.com/blog/vicidial-grafana-realtime-dashboard/). And if your Asterisk configuration needs attention before you start monitoring it, our [Asterisk configuration guide](https://vicistack.com/blog/vicidial-asterisk-configuration/) covers the SIP, codec, and NAT settings that affect [call quality](/blog/voip-mos-score-guide/). --- *Running a VICIdial cluster and want help setting up production observability? [Contact ViciStack](https://vicistack.com/contact/) — we've instrumented Asterisk environments from single-server shops to 10-server clusters processing a million calls a day.* ## Resources - [Read the full article](https://vicistack.com/blog/asterisk-otel-observability/) on ViciStack - [ViciStack](https://vicistack.com) - VICIdial hosting and optimization - [Free VICIdial Audit](https://vicistack.com/free-audit/)