|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- multiple-instance-learning |
|
|
- fraud-detection |
|
|
- risk-assessment |
|
|
- anomaly-detection |
|
|
- graph-transformer |
|
|
- interpretability |
|
|
license: apache-2.0 |
|
|
library_name: pytorch |
|
|
--- |
|
|
|
|
|
# AC-MIL |
|
|
|
|
|
AC-MIL (Action-Aware Capsule Multiple Instance Learning for Live-Streaming Room Risk Assessment) is a weakly supervised model for **room-level risk assessment** in live-streaming platforms. It is designed for scenarios where only **binary room-level labels** are available, while risk evidence is often **sparse, localized, and manifested through coordinated behaviors** across users and time. |
|
|
|
|
|
AC-MIL formulates each live room as a **Multiple Instance Learning (MIL)** bag, where each instance is a **user–timeslot capsule**—a short action subsequence performed by a particular user within a fixed time window. The model produces: |
|
|
- a **room-level risk score**, indicating the probability that the room is risky, and |
|
|
- **capsule-level attributions**, providing interpretable evidence by highlighting suspicious user–time segments that contribute most to the prediction. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Key idea |
|
|
|
|
|
Given a room’s action stream, we construct a 2D grid of capsules over **users × timeslots**. Each capsule summarizes localized behavioral patterns within a specific user–time window. AC-MIL then models: |
|
|
- **temporal dynamics**: how users’ behaviors evolve over time, |
|
|
- **cross-user dependencies**: interactions between viewers and the streamer, as well as coordination patterns among viewers, |
|
|
- **multi-level signals**: evidence captured at the action, capsule, user, and timeslot levels, |
|
|
and fuses these signals to produce robust room-level risk predictions. |
|
|
|
|
|
--- |
|
|
|
|
|
## Architecture overview |
|
|
|
|
|
AC-MIL follows a hierarchical serial–parallel design: |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/671b2ede3fd1d03dc687c641/Vry7GWMLjGKLsCBqq5B6g.png" width="80%"/> |
|
|
<p> |
|
|
|
|
|
|
|
|
1. **Action Field Encoder** |
|
|
- Encodes the full action sequence with a Transformer to produce contextualized action embeddings. |
|
|
- Produces an action-level room representation via a learnable `[CLS]` token. |
|
|
|
|
|
2. **Capsule Constructor** |
|
|
- Partitions actions into **user–timeslot capsules**. |
|
|
- Encodes each capsule with an LSTM (final hidden state as capsule embedding). |
|
|
|
|
|
3. **Relational Capsule Reasoner** |
|
|
- Builds an **adaptive relation-aware graph** over capsules using emantic similarity and relation masks. |
|
|
- Runs a **graph-aware Transformer** to refine capsule embeddings. |
|
|
- Provides **capsule-level interpretability** via `[CLS] → capsule` attention. |
|
|
|
|
|
4. **Dual-View Integrator** |
|
|
- **User-view**: GRU over each user’s capsule sequence and attention pooling across users. |
|
|
- **Timeslot-view**: attention pooling within each timeslot and GRU across timeslots. |
|
|
|
|
|
5. **Cross-Level Risk Decoder** |
|
|
- Learns gates over multi-level room representations. |
|
|
- Produces the final room embedding and risk score. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Input / output specification |
|
|
|
|
|
### Input (conceptual) |
|
|
|
|
|
Each dataset sample corresponds to a live room `room_id` with a binary room-level label `room_label ∈ {0,1,2,3}` (>0, risky). |
|
|
A room is represented as an **action sequence** `patch_list = {α_i}` ordered by tuple (user, time), where each action follows the paper’s definition: |
|
|
`α = (u, t, a, x)` (user, timestamp, action type, and optional textual/multimodal feature). |
|
|
|
|
|
In our May/June datasets, each action record is stored with the following fields: |
|
|
|
|
|
- `u_idx` (int): user index within the room, used to build the `users × timeslots` grid (e.g., 0 = streamer, 1..U = selected viewers). |
|
|
- `global_user_idx` (int/str): global user identifier across the whole dataset (before remapping to `u_idx`). |
|
|
- `timestamp` (int/float): the action timestamp`t`. In the formulation, timestamps are within a window `[0, T]` after the room starts. |
|
|
- `t` (int): timeslot index derived by discretizing `timestamp` into fixed-length windows. This is the column index when constructing the `users × timeslots` capsule grid. |
|
|
- `l` (int): role indicator (recommended convention: `0 = viewer`, `1 = streamer`). |
|
|
- `action_id` (int): the action type id `a` (e.g., enter, comment, like, gift, share; streamer-side actions may include stream start, ASR text, OCR text, etc.). |
|
|
- `action_desc` (str / null): raw textual content associated with the action (e.g., comment text, ASR transcript, OCR text). |
|
|
- `action_vec` (numpy): pre-encoded feature vector for `action_desc`. |
|
|
|
|
|
|
|
|
Example (JSONL-like): |
|
|
```json |
|
|
{ |
|
|
"room_id": "1", |
|
|
"room_label": "2", |
|
|
"patch_list": [ |
|
|
(u_idx, t, l, action_id, action_vec, timestamp, action_desc, user_id), |
|
|
(0, 1, 0, 5, [0.0, 0.3, ...], 4, "主播口播:...", 5415431), |
|
|
... |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended use |
|
|
|
|
|
**Primary use cases** |
|
|
- Early detection of risky rooms (fraud, collusion, policy-violating coordinated behaviors) |
|
|
- Evidence-based moderation: highlight localized suspicious segments (user–time capsules) |
|
|
|
|
|
**Out of scope** |
|
|
- Identifying or tracking specific individuals |
|
|
- Any use that violates privacy laws, platform policies, or user consent requirements |
|
|
|
|
|
--- |