AC-MIL / README.md

Update README.md (#1)

6bb9367 verified about 1 month ago

5.13 kB

	---
	language: en
	tags:
	- multiple-instance-learning
	- fraud-detection
	- risk-assessment
	- anomaly-detection
	- graph-transformer
	- interpretability
	license: apache-2.0
	library_name: pytorch
	---

	# AC-MIL

	AC-MIL (Action-Aware Capsule Multiple Instance Learning for Live-Streaming Room Risk Assessment) is a weakly supervised model for room-level risk assessment in live-streaming platforms. It is designed for scenarios where only binary room-level labels are available, while risk evidence is often sparse, localized, and manifested through coordinated behaviors across users and time.

	AC-MIL formulates each live room as a Multiple Instance Learning (MIL) bag, where each instance is a user–timeslot capsule—a short action subsequence performed by a particular user within a fixed time window. The model produces:
	- a room-level risk score, indicating the probability that the room is risky, and
	- capsule-level attributions, providing interpretable evidence by highlighting suspicious user–time segments that contribute most to the prediction.


	---

	## Key idea

	Given a room’s action stream, we construct a 2D grid of capsules over users × timeslots. Each capsule summarizes localized behavioral patterns within a specific user–time window. AC-MIL then models:
	- temporal dynamics: how users’ behaviors evolve over time,
	- cross-user dependencies: interactions between viewers and the streamer, as well as coordination patterns among viewers,
	- multi-level signals: evidence captured at the action, capsule, user, and timeslot levels,
	and fuses these signals to produce robust room-level risk predictions.

	---

	## Architecture overview

	AC-MIL follows a hierarchical serial–parallel design:

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/671b2ede3fd1d03dc687c641/Vry7GWMLjGKLsCBqq5B6g.png" width="80%"/>
	<p>


	1. Action Field Encoder
	- Encodes the full action sequence with a Transformer to produce contextualized action embeddings.
	- Produces an action-level room representation via a learnable `[CLS]` token.

	2. Capsule Constructor
	- Partitions actions into user–timeslot capsules.
	- Encodes each capsule with an LSTM (final hidden state as capsule embedding).

	3. Relational Capsule Reasoner
	- Builds an adaptive relation-aware graph over capsules using emantic similarity and relation masks.
	- Runs a graph-aware Transformer to refine capsule embeddings.
	- Provides capsule-level interpretability via `[CLS] → capsule` attention.

	4. Dual-View Integrator
	- User-view: GRU over each user’s capsule sequence and attention pooling across users.
	- Timeslot-view: attention pooling within each timeslot and GRU across timeslots.

	5. Cross-Level Risk Decoder
	- Learns gates over multi-level room representations.
	- Produces the final room embedding and risk score.


	---


	## Input / output specification

	### Input (conceptual)

	Each dataset sample corresponds to a live room `room_id` with a binary room-level label `room_label ∈ {0,1,2,3}` (>0, risky).
	A room is represented as an action sequence `patch_list = {α_i}` ordered by tuple (user, time), where each action follows the paper’s definition:
	`α = (u, t, a, x)` (user, timestamp, action type, and optional textual/multimodal feature).

	In our May/June datasets, each action record is stored with the following fields:

	- `u_idx` (int): user index within the room, used to build the `users × timeslots` grid (e.g., 0 = streamer, 1..U = selected viewers).
	- `global_user_idx` (int/str): global user identifier across the whole dataset (before remapping to `u_idx`).
	- `timestamp` (int/float): the action timestamp`t`. In the formulation, timestamps are within a window `[0, T]` after the room starts.
	- `t` (int): timeslot index derived by discretizing `timestamp` into fixed-length windows. This is the column index when constructing the `users × timeslots` capsule grid.
	- `l` (int): role indicator (recommended convention: `0 = viewer`, `1 = streamer`).
	- `action_id` (int): the action type id `a` (e.g., enter, comment, like, gift, share; streamer-side actions may include stream start, ASR text, OCR text, etc.).
	- `action_desc` (str / null): raw textual content associated with the action (e.g., comment text, ASR transcript, OCR text).
	- `action_vec` (numpy): pre-encoded feature vector for `action_desc`.


	Example (JSONL-like):
	```json
	{
	"room_id": "1",
	"room_label": "2",
	"patch_list": [
	(u_idx, t, l, action_id, action_vec, timestamp, action_desc, user_id),
	(0, 1, 0, 5, [0.0, 0.3, ...], 4, "主播口播:...", 5415431),
	...
	]
	}
	```

	---

	## Intended use

	Primary use cases
	- Early detection of risky rooms (fraud, collusion, policy-violating coordinated behaviors)
	- Evidence-based moderation: highlight localized suspicious segments (user–time capsules)

	Out of scope
	- Identifying or tracking specific individuals
	- Any use that violates privacy laws, platform policies, or user consent requirements

	---