Buckets:

blanchon
/

opencs2_dataset

Files

xet

blanchon/opencs2_dataset / README.md

blanchon

about 6 hours ago

preview code

download

raw

21.1 kB

	---
	license: cc-by-4.0
	task_categories:
	- video-classification
	- reinforcement-learning
	- other
	language:
	- en
	tags:
	- opencs2
	- counter-strike-2
	- torchcodec
	- video
	- audio
	- parquet
	pretty_name: "OpenCS2 - POV Renders"
	configs:
	- config_name: pov_rounds
	data_files:
	- split: train
	path: index/pov_rounds.parquet
	default: true
	- config_name: matches
	data_files:
	- split: train
	path: index/matches.parquet
	- config_name: rounds
	data_files:
	- split: train
	path: index/rounds.parquet
	- config_name: kills
	data_files:
	- split: train
	path: events/kills.parquet
	- config_name: duels
	data_files:
	- split: train
	path: events/duels.parquet
	- config_name: clip_events
	data_files:
	- split: train
	path: events/clip_events.parquet
	- config_name: round_player
	data_files:
	- split: train
	path: events/round_player.parquet
	- config_name: enums
	data_files:
	- split: train
	path: metadata/enums.parquet
	---

	# OpenCS2 - POV Renders

	![OpenCS2](https://huggingface.co/datasets/blanchon/opencs2_dataset/resolve/main/static/header.webp)

	> Browse with the [OpenCS2 Viewer](https://huggingface.co/spaces/blanchon/counter-strike-2-dataset-viewer) - every match, map and round, with all 10 player POVs synced on one timeline.

	Tick-aligned Counter-Strike 2 POV training clips, rendered from
	[`blanchon/cs2_dataset_demo`](https://huggingface.co/datasets/blanchon/cs2_dataset_demo). Each row
	in the main table is one player's perspective for one round; ten POVs per round share the same tick
	clock.

	Per POV round:

	- Video - 1280x720 @ 32 fps, near-lossless H.264, faststart, muxed with audio.
	- Audio - per-player stereo, mixed from that player's position and orientation.
	- Inputs - every tick: keys, mouse delta, view angles, fire/jump/use, weapon switches.
	- World state - every tick: player position, velocity, view, health, armor, weapon, alive flag.

	This is the simple loose-file layout: every POV round has its own directory containing `video.mp4`,
	`video.preview.mp4`, and `ticks.parquet`. For large-scale training with fewer Hub files, use the
	WebDataset packaging: [`blanchon/opencs2_dataset_wds`](https://huggingface.co/datasets/blanchon/opencs2_dataset_wds).

	The same loose-file dataset is also mirrored as a Hugging Face Storage Bucket:
	[`hf://buckets/blanchon/opencs2_dataset`](https://huggingface.co/buckets/blanchon/opencs2_dataset).

	Current build: `165,270` POV rounds (`2974.2` POV video hours, `528.0` synced
	round-timeline hours), `16,527` rounds, `794` match/maps, `111,715` kills.

	## Usage

	The default config is the POV-round index. Use the event/index configs to filter first, then stream
	only the MP4 and tick sidecar you selected.

	\| Config \| Row \| Use \|
	\| --- \| --- \| --- \|
	\| `pov_rounds` (default) \| one `(match_id, map_name, round, player_slot)` with path-only `video.mp4`, `video.preview.mp4`, and `ticks.parquet` \| training, media lookup, download-size estimates \|
	\| `matches` \| one per `(match_id, map_name)` with team/event metadata \| match/map filtering \|
	\| `rounds` \| one per `(match_id, map_name, round)` with tick boundaries and round outcome \| round filtering \|
	\| `kills` \| one per kill \| filtering by weapon, side, headshot, smoke, wallbang, clutch, 1v1 \|
	\| `duels` \| one per kill normalized as winner/loser \| duel mining, winner POV selection \|
	\| `clip_events` \| generic clip-mining event rows \| simple event filters for clip extraction \|
	\| `round_player` \| one per player per round with compact stats \| per-player round filters \|
	\| `enums` \| enum lookup table \| mapping compact `*_id` columns back to labels \|

	## Structure

	```text
	rounds/
	match_id=<id>/map_name=<map>/round=<round>/player=<slot>/
	video.mp4
	video.preview.mp4
	ticks.parquet
	index/
	matches.parquet
	rounds.parquet
	pov_rounds.parquet
	events/
	kills.parquet
	duels.parquet
	clip_events.parquet
	round_player.parquet
	metadata/
	enums.parquet
	```

	Media columns are path-only Hugging Face structs:

	```python
	{"bytes": None, "path": "hf://datasets/blanchon/opencs2_dataset@main/rounds/.../video.mp4"}
	```

	This keeps the Dataset Viewer preview working without embedding MP4 bytes into parquet. Use
	`media_bytes` and `preview_video_bytes` to estimate exact download size after filtering.

	## Parquet Tables

	String-like filter columns are dictionary encoded where useful, and most have a matching `*_id`
	column for fast integer joins or enum-based modeling. Player identity is always `player_slot`
	(`0..9`), not Steam ID or username.

	\| File \| Rows \| Purpose \|
	\| --- \| ---: \| --- \|
	\| `index/pov_rounds.parquet` \| 165,270 \| one row per player POV round; includes side/weapon summary, capture ticks, survival/death, media paths, byte sizes, and tick sidecar path \|
	\| `index/matches.parquet` \| 794 \| one row per match/map with HLTV link, event, teams, score, winner, date, and rounds played \|
	\| `index/rounds.parquet` \| 16,527 \| one row per round with tick boundaries, duration, winner/reason/bomb site, kill counts, opening kill summary, 1v1/clutch flags \|
	\| `events/kills.parquet` \| 111,715 \| attacker/victim slots and sides, event time, weapon/class, hit details, alive counts before/after, trade/1v1/clutch/opening flags \|
	\| `events/duels.parquet` \| 111,715 \| kill events normalized as winner/loser duels, useful for selecting the winner POV \|
	\| `events/clip_events.parquet` \| 111,715 \| generic clip-mining table with event time, target/other player slots, weapon/class, and boolean flags \|
	\| `events/round_player.parquet` \| 168,294 \| player round stats: side, kills, deaths, assists, headshots, KAST \|
	\| `metadata/enums.parquet` \| 115 \| enum lookup table: `enum_name`, `enum_id`, `value` \|
	\| `rounds/**/ticks.parquet` \| per POV \| tick/input/world-state rows: `tick`, `t`, button lists, view angles, weapon, health/armor, position, velocity \|

	Tick column `t` is the timestamp in the POV video. `event_seconds` is already on the POV video timeline. You can seek media directly with
	`event_video_seconds = event_seconds`, or join event `tick` against the selected POV
	`ticks.parquet` and use tick column `t`.

	## Stream One Clip

	```python
	import re
	from datasets import Video, load_dataset
	from huggingface_hub import hf_hub_url
	from torchcodec.decoders import AudioDecoder, VideoDecoder

	def hf_path_to_url(path):
	repo_id, revision, filename = re.match(r"hf://datasets/([^@]+)@([^/]+)/(.+)", path).groups()
	return hf_hub_url(repo_id=repo_id, repo_type="dataset", revision=revision, filename=filename)

	ds = load_dataset("blanchon/opencs2_dataset", "pov_rounds", split="train")
	ds = ds.cast_column("video", Video(decode=False))

	row = ds[0]
	url = hf_path_to_url(row["video"]["path"])

	video = VideoDecoder(url, seek_mode="approximate")
	clip = video.get_frames_played_in_range(20.0, 30.0)

	audio = AudioDecoder(url)
	samples = audio.get_samples_played_in_range(20.0, 30.0)
	```

	Use `seek_mode="approximate"` for streaming so TorchCodec does not scan the entire MP4 during
	initialization.

	## Filter Before Media Access

	```python
	import duckdb

	con = duckdb.connect()
	con.sql("INSTALL httpfs; LOAD httpfs;")

	awp_1v1 = con.sql("""
	SELECT
	d.match_id,
	d.map_name,
	d.round,
	d.winner_player_slot AS player_slot,
	d.event_seconds AS event_table_seconds,
	d.event_seconds AS event_video_seconds,
	p.video,
	p.ticks_parquet_path,
	p.media_bytes
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/duels.parquet' AS d
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' AS p
	ON d.match_id = p.match_id
	AND d.map_name = p.map_name
	AND d.round = p.round
	AND d.winner_player_slot = p.player_slot
	WHERE d.weapon = 'awp'
	AND d.is_1v1_before
	""").df()

	print(awp_1v1.head())
	print("estimated MP4 bytes:", int(awp_1v1["media_bytes"].sum()))
	```

	To extract the 10 seconds around the duel, convert `video.path` with `hf_hub_url()` as above and
	decode `[max(0, event_video_seconds - 5), event_video_seconds + 5]`.

	## Verified Clip Recipes

	> [!TIP]
	> These recipes were verified by exporting 10 local examples each. For kill-derived examples, center clips on
	> `event_video_seconds = event_seconds`.

	Common Python setup:

	This helper writes video-only MP4 clips through TorchCodec. It decodes the selected range as a
	PyTorch `uint8` tensor, then encodes it back to H.264 MP4.

	```python
	import json
	import re
	from pathlib import Path

	import duckdb
	from huggingface_hub import hf_hub_url
	from PIL import Image
	from torchcodec.decoders import VideoDecoder
	from torchcodec.encoders import VideoEncoder

	OUT = Path("opencs2_examples")
	FPS = 32.0

	def hf_path_to_url(path):
	repo_id, revision, filename = re.match(r"hf://datasets/([^@]+)@([^/]+)/(.+)", path).groups()
	return hf_hub_url(repo_id=repo_id, repo_type="dataset", revision=revision, filename=filename)

	def open_mp4(row):
	return hf_path_to_url(row["video_path"])

	def save_clip(row, name, before=5.0, after=5.0):
	center = float(row["event_video_seconds"])
	start = max(0.0, center - before)
	stop = min(float(row["duration_s"]), center + after)
	out = OUT / name / f"{row['event_id']}.mp4"
	out.parent.mkdir(parents=True, exist_ok=True)
	frames = VideoDecoder(
	open_mp4(row),
	seek_mode="approximate",
	dimension_order="NCHW",
	).get_frames_played_in_range(start_seconds=start, stop_seconds=stop, fps=FPS)
	VideoEncoder(frames.data, frame_rate=FPS).to_file(
	out,
	codec="libx264",
	pixel_format="yuv420p",
	crf=20,
	preset="veryfast",
	extra_options={"x264-params": "keyint=32:min-keyint=1:scenecut=0:open-gop=0"},
	)
	return out

	def save_png(frame_hwc, path):
	Image.fromarray(frame_hwc.cpu().numpy()).save(path)

	def save_frame_pair(row, name):
	out = OUT / name / f"{row['media_id']}-{int(row['tick'])}"
	out.mkdir(parents=True, exist_ok=True)
	frames = VideoDecoder(
	open_mp4(row),
	seek_mode="approximate",
	dimension_order="NHWC",
	).get_frames_played_at(seconds=[float(row["t"]), float(row["next_t"])])
	frame_t = frames.data[0]
	frame_t1 = frames.data[1]

	save_png(frame_t, out / "frame_t.png")
	save_png(frame_t1, out / "frame_t_plus_1.png")

	tick_t = {k: v for k, v in row.items() if not k.startswith("next_") and k != "video_path"}
	tick_t_plus_1 = {**tick_t, "tick": int(row["next_tick"]), "t": float(row["next_t"])}
	(out / "tick_t.json").write_text(json.dumps(tick_t, indent=2, default=str) + "\n")
	(out / "tick_t_plus_1.json").write_text(json.dumps(tick_t_plus_1, indent=2, default=str) + "\n")
	return out

	con = duckdb.connect()
	con.sql("INSTALL httpfs; LOAD httpfs;")
	```

	<details>
	<summary><strong>AWP 1v1 Duel</strong></summary>

	Winner POV for AWP kills where the duel table says the fight was a 1v1 before the kill.

	```python
	rows = con.sql("""
	SELECT d.duel_id AS event_id, d.event_seconds AS event_video_seconds,
	d.weapon, d.distance, d.headshot, p.duration_s,
	struct_extract(p.video, 'path') AS video_path
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/duels.parquet' d
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON d.match_id=p.match_id AND d.map_name=p.map_name AND d.round=p.round
	AND d.winner_player_slot=p.player_slot
	WHERE d.weapon='awp' AND d.is_1v1_before
	AND p.duration_s >= d.event_seconds + 5.0
	ORDER BY d.event_seconds
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_clip(row, "awp_1v1_duel")
	```

	</details>

	<details>
	<summary><strong>Kill Through Smoke</strong></summary>

	Attacker POV, with the kill centered five seconds into the exported clip.

	```python
	rows = con.sql("""
	SELECT k.kill_id AS event_id, k.event_seconds AS event_video_seconds,
	k.weapon, k.distance, k.headshot, p.duration_s,
	struct_extract(p.video, 'path') AS video_path
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/kills.parquet' k
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON k.match_id=p.match_id AND k.map_name=p.map_name AND k.round=p.round
	AND k.attacker_player_slot=p.player_slot
	WHERE k.through_smoke
	AND p.duration_s >= k.event_seconds + 5.0
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_clip(row, "kill_through_smoke")
	```

	</details>

	<details>
	<summary><strong>Noscope / Wallbang Highlight</strong></summary>

	Attacker POV for kills flagged as noscope, wallbang, or penetration.

	```python
	rows = con.sql("""
	SELECT k.kill_id AS event_id, k.event_seconds AS event_video_seconds,
	k.weapon, k.noscope, k.wallbang, k.penetrated, p.duration_s,
	struct_extract(p.video, 'path') AS video_path
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/kills.parquet' k
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON k.match_id=p.match_id AND k.map_name=p.map_name AND k.round=p.round
	AND k.attacker_player_slot=p.player_slot
	WHERE (k.noscope OR k.wallbang OR k.penetrated > 0)
	AND p.duration_s >= k.event_seconds + 5.0
	ORDER BY k.noscope DESC, k.wallbang DESC, k.penetrated DESC
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_clip(row, "noscope_wallbang")
	```

	</details>

	<details>
	<summary><strong>Knife Kill</strong></summary>

	Attacker POV for actual knife kills, not just rounds where the player holds a knife.

	```python
	rows = con.sql("""
	SELECT k.kill_id AS event_id, k.event_seconds AS event_video_seconds,
	k.weapon, p.duration_s, struct_extract(p.video, 'path') AS video_path
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/kills.parquet' k
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON k.match_id=p.match_id AND k.map_name=p.map_name AND k.round=p.round
	AND k.attacker_player_slot=p.player_slot
	WHERE (lower(k.weapon_class)='knife' OR lower(k.weapon) LIKE '%knife%'
	OR lower(k.weapon) LIKE '%bayonet%' OR lower(k.weapon) LIKE '%karambit%')
	AND p.duration_s >= k.event_seconds + 5.0
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_clip(row, "knife_kill")
	```

	</details>

	<details>
	<summary><strong>Five Kills Under 10 Seconds</strong></summary>

	Groups kills by player and round, then exports from the first kill through the end of the streak.

	```python
	rows = con.sql("""
	WITH streaks AS (
	SELECT match_id, map_name, round, attacker_player_slot AS player_slot,
	COUNT(*) AS n_kills,
	MIN(event_seconds) AS first_kill_video_seconds,
	MAX(event_seconds) AS last_kill_video_seconds
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/kills.parquet'
	GROUP BY match_id, map_name, round, attacker_player_slot
	HAVING COUNT(*) >= 5 AND MAX(event_seconds) - MIN(event_seconds) < 10.0
	)
	SELECT concat('streak-', s.match_id, '-', s.map_name, '-r', s.round, '-p', s.player_slot) AS event_id,
	s.first_kill_video_seconds AS event_video_seconds,
	s.last_kill_video_seconds, s.n_kills, p.duration_s,
	struct_extract(p.video, 'path') AS video_path
	FROM streaks s
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON s.match_id=p.match_id AND s.map_name=p.map_name AND s.round=p.round
	AND s.player_slot=p.player_slot
	ORDER BY s.last_kill_video_seconds - s.first_kill_video_seconds
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_clip(row, "five_kills_under_10s", before=2.0, after=row["last_kill_video_seconds"] - row["event_video_seconds"] + 2.0)
	```

	</details>

	<details>
	<summary><strong>Very Long Distance Kill</strong></summary>

	Attacker POV for the longest kills by event-table distance.

	```python
	rows = con.sql("""
	SELECT k.kill_id AS event_id, k.event_seconds AS event_video_seconds,
	k.weapon, k.distance, p.duration_s, struct_extract(p.video, 'path') AS video_path
	FROM 'hf://datasets/blanchon/opencs2_dataset/events/kills.parquet' k
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON k.match_id=p.match_id AND k.map_name=p.map_name AND k.round=p.round
	AND k.attacker_player_slot=p.player_slot
	WHERE k.distance IS NOT NULL
	AND p.duration_s >= k.event_seconds + 5.0
	ORDER BY k.distance DESC
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_clip(row, "long_distance_kill")
	```

	</details>

	<details>
	<summary><strong>Position-Based Clip</strong></summary>

	For global position scans, use the consolidated WDS tick index, then export the matching POV from
	this repo.

	```python
	rows = con.sql("""
	WITH ticks AS (
	SELECT media_id, match_id, map_name, round, player_slot, tick, t, x, y, z
	FROM 'hf://datasets/blanchon/opencs2_dataset_wds/ticks/match_id=2391545/map_name=de_anubis/ticks.parquet'
	WHERE is_alive AND t > 5.0
	),
	anchors AS (
	SELECT * FROM ticks
	WHERE tick % 64 = 0
	AND x BETWEEN -875 AND -625 AND y BETWEEN 125 AND 375
	),
	pairs AS (
	SELECT DISTINCT ON (a.media_id) a.*, b.tick AS next_tick, b.t AS next_t
	FROM anchors a JOIN ticks b ON a.media_id=b.media_id AND a.tick + 2 = b.tick
	ORDER BY a.media_id, a.t
	)
	SELECT * FROM pairs LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	pov = con.execute("""
	SELECT duration_s, struct_extract(video, 'path') AS video_path
	FROM 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet'
	WHERE media_id = ?
	""", [row["media_id"]]).df().iloc[0].to_dict()
	save_clip({row, pov, "event_id": f"pos-{row['media_id']}-{row['tick']}", "event_video_seconds": row["t"]}, "position_based_clip")
	```

	</details>

	<details>
	<summary><strong>Boosting, Higher Player POV</strong></summary>

	Uses tick positions to find a higher player above a nearby lower player for multiple consecutive
	ticks. This is a heuristic, so visually inspect results.

	Recipe id: `boosting_top_player`.

	```sql
	xy_distance < 36
	z_delta BETWEEN 45 AND 90
	abs(top.velocity_z) < 12
	abs(lower.velocity_z) < 12
	support_ticks >= 16
	```

	</details>

	<details>
	<summary><strong>Frame Pair Dataset Preview</strong></summary>

	Selects `(frame_t, tick_t, frame_t+1, tick_t+1)` at a specific map position. At 32 fps, adjacent
	video frames are usually two 64 Hz demo ticks apart.

	Recipe id: `frame_pair_preview`.

	```python
	rows = con.sql("""
	WITH ticks AS (
	SELECT media_id, match_id, map_name, round, player_slot, tick, t, x, y, z
	FROM 'hf://datasets/blanchon/opencs2_dataset_wds/ticks/match_id=2391545/map_name=de_anubis/ticks.parquet'
	WHERE is_alive AND t > 5.0
	),
	anchors AS (
	SELECT * FROM ticks
	WHERE tick % 64 = 0
	AND x BETWEEN -875 AND -625 AND y BETWEEN 125 AND 375
	),
	pairs AS (
	SELECT DISTINCT ON (a.media_id) a.*, b.tick AS next_tick, b.t AS next_t
	FROM anchors a JOIN ticks b ON a.media_id=b.media_id AND a.tick + 2 = b.tick
	ORDER BY a.media_id, a.t
	)
	SELECT pairs.*, struct_extract(p.video, 'path') AS video_path
	FROM pairs
	JOIN 'hf://datasets/blanchon/opencs2_dataset/index/pov_rounds.parquet' p
	ON pairs.media_id=p.media_id
	LIMIT 10
	""").df()

	for row in rows.to_dict("records"):
	save_frame_pair(row, "frame_pair_preview")
	```

	</details>

	## Ticks And Frame Pairs

	Each `ticks.parquet` row is one demo tick for one POV. The render is 32 fps while demo ticks are
	64 Hz, so adjacent video frames are usually two tick rows apart. For `(frame_t, tick_t,
	frame_t+1, tick_t+1)`, load the selected tick sidecar, choose timestamps, and decode both frames in
	one TorchCodec call:

	```python
	import pyarrow.parquet as pq
	from huggingface_hub import hf_hub_download

	# For local/cache-first workflows, download the tick parquet path or use its hf:// URL with DuckDB.
	ticks = pq.read_table("ticks.parquet").to_pandas()

	t0 = 12.0
	t1 = t0 + 1.0 / 32.0
	tick0 = ticks.iloc[(ticks["t"] - t0).abs().argmin()]
	tick1 = ticks.iloc[(ticks["t"] - t1).abs().argmin()]

	frames = VideoDecoder(url, seek_mode="approximate").get_frames_played_at(
	seconds=[float(tick0["t"]), float(tick1["t"])]
	)
	```

	For high-throughput frame-pair training, prefer the WDS repo, group many timestamps by `media_id`,
	decode them in batches, then shuffle emitted pairs.

	## Downloading

	```bash
	# Metadata only.
	hf download blanchon/opencs2_dataset --repo-type dataset \
	--include "index/*.parquet" \
	--include "events/*.parquet" \
	--include "metadata/*.parquet"

	# One full POV round.
	hf download blanchon/opencs2_dataset --repo-type dataset \
	--include "rounds/match_id=2391545/map_name=de_anubis/round=01/player=00/**"
	```

	## Creation

	Built with a headless CS2 recorder from HLTV `.dem` files. The recorder replays each demo, captures
	all 10 player POVs, validates tick/frame boundaries, streams frames through FFmpeg, muxes per-player
	audio into the MP4, and writes typed parquet sidecars.

	## Licensing

	`.dem` source data is mirrored from HLTV; downstream use is bound by the original tournament terms.
	Renders and metadata are released as CC-BY-4.0.

	## Citation

	```bibtex
	@misc{blanchon2026opencs2,
	author = {Julien Blanchon},
	title = {OpenCS2 Dataset},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://github.com/julien-blanchon/opencs2-dataset}}
	}
	```

Xet Storage Details

Size:: 21.1 kB
Xet hash:: 3e99801cc51d06265680587facc22e30e44bfae7cf0875134c42113d9d52f485

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.