--- language: en tags: - multiple-instance-learning - fraud-detection - risk-assessment - anomaly-detection - graph-transformer - interpretability license: apache-2.0 library_name: pytorch --- # AC-MIL AC-MIL (Action-Aware Capsule Multiple Instance Learning for Live-Streaming Room Risk Assessment) is a weakly supervised model for **room-level risk assessment** in live-streaming platforms. It is designed for scenarios where only **binary room-level labels** are available, while risk evidence is often **sparse, localized, and manifested through coordinated behaviors** across users and time. AC-MIL formulates each live room as a **Multiple Instance Learning (MIL)** bag, where each instance is a **user–timeslot capsule**—a short action subsequence performed by a particular user within a fixed time window. The model produces: - a **room-level risk score**, indicating the probability that the room is risky, and - **capsule-level attributions**, providing interpretable evidence by highlighting suspicious user–time segments that contribute most to the prediction. --- ## Key idea Given a room’s action stream, we construct a 2D grid of capsules over **users × timeslots**. Each capsule summarizes localized behavioral patterns within a specific user–time window. AC-MIL then models: - **temporal dynamics**: how users’ behaviors evolve over time, - **cross-user dependencies**: interactions between viewers and the streamer, as well as coordination patterns among viewers, - **multi-level signals**: evidence captured at the action, capsule, user, and timeslot levels, and fuses these signals to produce robust room-level risk predictions. --- ## Architecture overview AC-MIL follows a hierarchical serial–parallel design:

1. **Action Field Encoder** - Encodes the full action sequence with a Transformer to produce contextualized action embeddings. - Produces an action-level room representation via a learnable `[CLS]` token. 2. **Capsule Constructor** - Partitions actions into **user–timeslot capsules**. - Encodes each capsule with an LSTM (final hidden state as capsule embedding). 3. **Relational Capsule Reasoner** - Builds an **adaptive relation-aware graph** over capsules using emantic similarity and relation masks. - Runs a **graph-aware Transformer** to refine capsule embeddings. - Provides **capsule-level interpretability** via `[CLS] → capsule` attention. 4. **Dual-View Integrator** - **User-view**: GRU over each user’s capsule sequence and attention pooling across users. - **Timeslot-view**: attention pooling within each timeslot and GRU across timeslots. 5. **Cross-Level Risk Decoder** - Learns gates over multi-level room representations. - Produces the final room embedding and risk score. --- ## Input / output specification ### Input (conceptual) Each dataset sample corresponds to a live room `room_id` with a binary room-level label `room_label ∈ {0,1,2,3}` (>0, risky). A room is represented as an **action sequence** `patch_list = {α_i}` ordered by tuple (user, time), where each action follows the paper’s definition: `α = (u, t, a, x)` (user, timestamp, action type, and optional textual/multimodal feature). In our May/June datasets, each action record is stored with the following fields: - `u_idx` (int): user index within the room, used to build the `users × timeslots` grid (e.g., 0 = streamer, 1..U = selected viewers). - `global_user_idx` (int/str): global user identifier across the whole dataset (before remapping to `u_idx`). - `timestamp` (int/float): the action timestamp`t`. In the formulation, timestamps are within a window `[0, T]` after the room starts. - `t` (int): timeslot index derived by discretizing `timestamp` into fixed-length windows. This is the column index when constructing the `users × timeslots` capsule grid. - `l` (int): role indicator (recommended convention: `0 = viewer`, `1 = streamer`). - `action_id` (int): the action type id `a` (e.g., enter, comment, like, gift, share; streamer-side actions may include stream start, ASR text, OCR text, etc.). - `action_desc` (str / null): raw textual content associated with the action (e.g., comment text, ASR transcript, OCR text). - `action_vec` (numpy): pre-encoded feature vector for `action_desc`. Example (JSONL-like): ```json { "room_id": "1", "room_label": "2", "patch_list": [ (u_idx, t, l, action_id, action_vec, timestamp, action_desc, user_id), (0, 1, 0, 5, [0.0, 0.3, ...], 4, "主播口播:...", 5415431), ... ] } ``` --- ## Intended use **Primary use cases** - Early detection of risky rooms (fraud, collusion, policy-violating coordinated behaviors) - Evidence-based moderation: highlight localized suspicious segments (user–time capsules) **Out of scope** - Identifying or tracking specific individuals - Any use that violates privacy laws, platform policies, or user consent requirements ---