Update README.md
#1
by
qwer1219
- opened
README.md
CHANGED
|
@@ -1,3 +1,114 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
tags:
|
| 4 |
+
- multiple-instance-learning
|
| 5 |
+
- fraud-detection
|
| 6 |
+
- risk-assessment
|
| 7 |
+
- anomaly-detection
|
| 8 |
+
- graph-transformer
|
| 9 |
+
- interpretability
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
library_name: pytorch
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# AC-MIL
|
| 15 |
+
|
| 16 |
+
AC-MIL (Action-Aware Capsule Multiple Instance Learning for Live-Streaming Room Risk Assessment) is a weakly supervised model for **room-level risk assessment** in live-streaming platforms. It is designed for scenarios where only **binary room-level labels** are available, while risk evidence is often **sparse, localized, and manifested through coordinated behaviors** across users and time.
|
| 17 |
+
|
| 18 |
+
AC-MIL formulates each live room as a **Multiple Instance Learning (MIL)** bag, where each instance is a **user–timeslot capsule**—a short action subsequence performed by a particular user within a fixed time window. The model produces:
|
| 19 |
+
- a **room-level risk score**, indicating the probability that the room is risky, and
|
| 20 |
+
- **capsule-level attributions**, providing interpretable evidence by highlighting suspicious user–time segments that contribute most to the prediction.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## Key idea
|
| 26 |
+
|
| 27 |
+
Given a room’s action stream, we construct a 2D grid of capsules over **users × timeslots**. Each capsule summarizes localized behavioral patterns within a specific user–time window. AC-MIL then models:
|
| 28 |
+
- **temporal dynamics**: how users’ behaviors evolve over time,
|
| 29 |
+
- **cross-user dependencies**: interactions between viewers and the streamer, as well as coordination patterns among viewers,
|
| 30 |
+
- **multi-level signals**: evidence captured at the action, capsule, user, and timeslot levels,
|
| 31 |
+
and fuses these signals to produce robust room-level risk predictions.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Architecture overview
|
| 36 |
+
|
| 37 |
+
AC-MIL follows a hierarchical serial–parallel design:
|
| 38 |
+
|
| 39 |
+
<p align="center">
|
| 40 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/671b2ede3fd1d03dc687c641/Vry7GWMLjGKLsCBqq5B6g.png" width="80%"/>
|
| 41 |
+
<p>
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
1. **Action Field Encoder**
|
| 45 |
+
- Encodes the full action sequence with a Transformer to produce contextualized action embeddings.
|
| 46 |
+
- Produces an action-level room representation via a learnable `[CLS]` token.
|
| 47 |
+
|
| 48 |
+
2. **Capsule Constructor**
|
| 49 |
+
- Partitions actions into **user–timeslot capsules**.
|
| 50 |
+
- Encodes each capsule with an LSTM (final hidden state as capsule embedding).
|
| 51 |
+
|
| 52 |
+
3. **Relational Capsule Reasoner**
|
| 53 |
+
- Builds an **adaptive relation-aware graph** over capsules using emantic similarity and relation masks.
|
| 54 |
+
- Runs a **graph-aware Transformer** to refine capsule embeddings.
|
| 55 |
+
- Provides **capsule-level interpretability** via `[CLS] → capsule` attention.
|
| 56 |
+
|
| 57 |
+
4. **Dual-View Integrator**
|
| 58 |
+
- **User-view**: GRU over each user’s capsule sequence and attention pooling across users.
|
| 59 |
+
- **Timeslot-view**: attention pooling within each timeslot and GRU across timeslots.
|
| 60 |
+
|
| 61 |
+
5. **Cross-Level Risk Decoder**
|
| 62 |
+
- Learns gates over multi-level room representations.
|
| 63 |
+
- Produces the final room embedding and risk score.
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
## Input / output specification
|
| 70 |
+
|
| 71 |
+
### Input (conceptual)
|
| 72 |
+
|
| 73 |
+
Each dataset sample corresponds to a live room `room_id` with a binary room-level label `room_label ∈ {0,1,2,3}` (>0, risky).
|
| 74 |
+
A room is represented as an **action sequence** `patch_list = {α_i}` ordered by tuple (user, time), where each action follows the paper’s definition:
|
| 75 |
+
`α = (u, t, a, x)` (user, timestamp, action type, and optional textual/multimodal feature).
|
| 76 |
+
|
| 77 |
+
In our May/June datasets, each action record is stored with the following fields:
|
| 78 |
+
|
| 79 |
+
- `u_idx` (int): user index within the room, used to build the `users × timeslots` grid (e.g., 0 = streamer, 1..U = selected viewers).
|
| 80 |
+
- `global_user_idx` (int/str): global user identifier across the whole dataset (before remapping to `u_idx`).
|
| 81 |
+
- `timestamp` (int/float): the action timestamp`t`. In the formulation, timestamps are within a window `[0, T]` after the room starts.
|
| 82 |
+
- `t` (int): timeslot index derived by discretizing `timestamp` into fixed-length windows. This is the column index when constructing the `users × timeslots` capsule grid.
|
| 83 |
+
- `l` (int): role indicator (recommended convention: `0 = viewer`, `1 = streamer`).
|
| 84 |
+
- `action_id` (int): the action type id `a` (e.g., enter, comment, like, gift, share; streamer-side actions may include stream start, ASR text, OCR text, etc.).
|
| 85 |
+
- `action_desc` (str / null): raw textual content associated with the action (e.g., comment text, ASR transcript, OCR text).
|
| 86 |
+
- `action_vec` (numpy): pre-encoded feature vector for `action_desc`.
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
Example (JSONL-like):
|
| 90 |
+
```json
|
| 91 |
+
{
|
| 92 |
+
"room_id": "1",
|
| 93 |
+
"room_label": "2",
|
| 94 |
+
"patch_list": [
|
| 95 |
+
(u_idx, t, l, action_id, action_vec, timestamp, action_desc, user_id),
|
| 96 |
+
(0, 1, 0, 5, [0.0, 0.3, ...], 4, "主播口播:...", 5415431),
|
| 97 |
+
...
|
| 98 |
+
]
|
| 99 |
+
}
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## Intended use
|
| 105 |
+
|
| 106 |
+
**Primary use cases**
|
| 107 |
+
- Early detection of risky rooms (fraud, collusion, policy-violating coordinated behaviors)
|
| 108 |
+
- Evidence-based moderation: highlight localized suspicious segments (user��time capsules)
|
| 109 |
+
|
| 110 |
+
**Out of scope**
|
| 111 |
+
- Identifying or tracking specific individuals
|
| 112 |
+
- Any use that violates privacy laws, platform policies, or user consent requirements
|
| 113 |
+
|
| 114 |
+
---
|