Buckets:

MisterAI
/

LocalAI_Demo_data

Files

xet

MisterAI/LocalAI_Demo_data / skills /apache-arrow /SKILL.md

MisterAI

about 1 month ago

preview code

download

raw

8.54 kB

	---
	name: apache-arrow
	description: Expert guidance for Apache Arrow, the cross-language columnar memory format for analytics workloads. Helps developers use Arrow for high-performance data interchange between systems, zero-copy reads, and efficient columnar processing in Python (PyArrow) and JavaScript (Arrow JS).
	license: Apache-2.0
	compatibility: No special requirements
	metadata:
	author: terminal-skills
	version: 1.0.0
	category: data-ai
	tags:
	- data-format
	- columnar
	- interop
	- python
	- javascript
	---

	# Apache Arrow — Columnar Data Format


	## Overview


	Apache Arrow, the cross-language columnar memory format for analytics workloads. Helps developers use Arrow for high-performance data interchange between systems, zero-copy reads, and efficient columnar processing in Python (PyArrow) and JavaScript (Arrow JS).


	## Instructions

	### PyArrow — Python Interface

	```python
	# src/data/arrow_ops.py — High-performance data operations with PyArrow
	import pyarrow as pa
	import pyarrow.parquet as pq
	import pyarrow.compute as pc
	import pyarrow.csv as pcsv

	# Create Arrow tables from Python data
	table = pa.table({
	"user_id": pa.array([1, 2, 3, 4, 5], type=pa.int64()),
	"name": pa.array(["Alice", "Bob", "Charlie", "Diana", "Eve"]),
	"revenue": pa.array([150.0, 320.5, 89.0, 1200.0, 45.5], type=pa.float64()),
	"signup_date": pa.array([
	"2026-01-15", "2026-01-20", "2026-02-01", "2026-02-10", "2026-03-01"
	]).cast(pa.date32()),
	"is_active": pa.array([True, True, False, True, False]),
	})

	# Compute operations (vectorized, no Python loops)
	high_value = pc.filter(table, pc.greater(table["revenue"], 100))
	total_revenue = pc.sum(table["revenue"]).as_py() # 1805.0
	avg_revenue = pc.mean(table["revenue"]).as_py() # 361.0
	sorted_table = pc.sort_indices(table, sort_keys=[("revenue", "descending")])

	# Read/write Parquet files (the standard format for Arrow data)
	pq.write_table(table, "users.parquet", compression="zstd")
	loaded = pq.read_table("users.parquet")

	# Read with column selection and row filtering (pushdown to file)
	subset = pq.read_table(
	"users.parquet",
	columns=["user_id", "revenue"], # Only read these columns
	filters=[("revenue", ">", 100)], # Predicate pushdown
	)

	# Read CSV with type inference
	csv_table = pcsv.read_csv("data.csv", convert_options=pcsv.ConvertOptions(
	column_types={"amount": pa.float64(), "count": pa.int32()},
	))

	# Streaming reads for large files (process in batches)
	parquet_file = pq.ParquetFile("large_dataset.parquet")
	for batch in parquet_file.iter_batches(batch_size=10_000):
	# Process each batch (RecordBatch) without loading the full file
	filtered = pc.filter(batch, pc.greater(batch["amount"], 0))
	process_batch(filtered)
	```

	### Zero-Copy Interop

	```python
	# Arrow enables zero-copy conversion between libraries
	import pyarrow as pa
	import pandas as pd
	import polars as pl

	# Arrow → Pandas (zero-copy when possible)
	arrow_table = pa.table({"x": [1, 2, 3], "y": [4.0, 5.0, 6.0]})
	pandas_df = arrow_table.to_pandas() # Near-instant for compatible types

	# Pandas → Arrow
	arrow_from_pandas = pa.Table.from_pandas(pandas_df)

	# Arrow → Polars (zero-copy)
	polars_df = pl.from_arrow(arrow_table)

	# Polars → Arrow (zero-copy)
	arrow_from_polars = polars_df.to_arrow()

	# Arrow enables data exchange between:
	# Python ↔ R (via reticulate)
	# Python ↔ DuckDB (zero-copy)
	# Python ↔ Spark (via PySpark)
	# JavaScript ↔ WASM modules
	```

	### Partitioned Datasets

	```python
	# Work with partitioned datasets on disk or cloud storage
	import pyarrow.dataset as ds

	# Read a partitioned Parquet dataset (Hive-style partitioning)
	# data/
	# year=2025/month=01/part-0.parquet
	# year=2025/month=02/part-0.parquet
	# year=2026/month=01/part-0.parquet

	dataset = ds.dataset(
	"s3://my-bucket/events/",
	format="parquet",
	partitioning=ds.partitioning(
	pa.schema([
	("year", pa.int32()),
	("month", pa.int32()),
	]),
	flavor="hive",
	),
	)

	# Scan with partition pruning (only reads relevant files)
	scanner = dataset.scanner(
	columns=["event_type", "user_id", "timestamp"],
	filter=(ds.field("year") == 2026) & (ds.field("month") >= 1),
	)
	table = scanner.to_table()

	# Write partitioned dataset
	ds.write_dataset(
	table,
	"output/events/",
	format="parquet",
	partitioning=ds.partitioning(
	pa.schema([("year", pa.int32()), ("month", pa.int32())]),
	flavor="hive",
	),
	existing_data_behavior="overwrite_or_ignore",
	)
	```

	### Arrow IPC (Inter-Process Communication)

	```python
	# Share data between processes without serialization overhead
	import pyarrow as pa
	import pyarrow.ipc as ipc

	# Write Arrow IPC format (for streaming between processes)
	table = pa.table({"id": [1, 2, 3], "value": [10.0, 20.0, 30.0]})

	# File format (random access)
	with pa.OSFile("data.arrow", "wb") as f:
	writer = ipc.new_file(f, table.schema)
	writer.write_table(table)
	writer.close()

	# Stream format (append-only, lower overhead)
	sink = pa.BufferOutputStream()
	writer = ipc.new_stream(sink, table.schema)
	writer.write_table(table)
	writer.close()
	buffer = sink.getvalue() # bytes that can be sent over network/pipe

	# Read back
	reader = ipc.open_file("data.arrow")
	loaded = reader.read_all()
	```

	### JavaScript (Arrow JS)

	```typescript
	// src/data/arrow-client.ts — Read Arrow data in the browser
	import { tableFromIPC, tableToIPC } from "apache-arrow";

	// Fetch Arrow IPC data from an API
	async function fetchArrowData(url: string) {
	const response = await fetch(url);
	const buffer = await response.arrayBuffer();

	// Parse Arrow IPC format (zero-copy in WASM-backed implementations)
	const table = tableFromIPC(new Uint8Array(buffer));

	console.log(`Loaded ${table.numRows} rows, ${table.numCols} columns`);
	console.log("Schema:", table.schema.fields.map((f) => `${f.name}: ${f.type}`));

	// Access columns
	const ids = table.getChild("id");
	const values = table.getChild("value");

	// Iterate rows
	for (const row of table) {
	console.log(row.toJSON()); // { id: 1, value: 10.0 }
	}

	return table;
	}

	// Send Arrow data to a server
	async function sendArrowData(url: string, table: any) {
	const buffer = tableToIPC(table);
	await fetch(url, {
	method: "POST",
	headers: { "Content-Type": "application/vnd.apache.arrow.stream" },
	body: buffer,
	});
	}
	```

	## Installation

	```bash
	# Python
	pip install pyarrow

	# JavaScript
	npm install apache-arrow

	# With DuckDB (Arrow-native)
	pip install duckdb # DuckDB uses Arrow internally
	```


	## Examples


	### Example 1: Integrating Apache Arrow into an existing application

	User request:

	```
	Add Apache Arrow to my Next.js app for the AI chat feature. I want streaming responses.
	```

	The agent installs the SDK, creates an API route that initializes the Apache Arrow client, configures streaming, selects an appropriate model, and wires up the frontend to consume the stream. It handles error cases and sets up proper environment variable management for the API key.

	### Example 2: Optimizing zero-copy interop performance

	User request:

	```
	My Apache Arrow calls are slow and expensive. Help me optimize the setup.
	```

	The agent reviews the current implementation, identifies issues (wrong model selection, missing caching, inefficient prompting, no batching), and applies optimizations specific to Apache Arrow's capabilities — adjusting model parameters, adding response caching, and implementing retry logic with exponential backoff.


	## Guidelines

	1. Parquet for storage, Arrow for compute — Write Parquet to disk/S3; use Arrow in-memory for processing
	2. Column pruning — Always specify `columns=` when reading Parquet; reading all columns wastes I/O and memory
	3. Predicate pushdown — Use `filters=` in Parquet reads; the reader skips row groups that don't match
	4. Zero-copy when possible — Use `to_pandas(self_destruct=True)` for large tables; Arrow can transfer memory ownership
	5. Batch processing for large files — Use `iter_batches()` instead of reading entire files into memory
	6. IPC for microservices — Arrow IPC is faster than JSON/CSV for data exchange between services
	7. Partitioned datasets for scale — Partition by date/category; queries only scan relevant partitions
	8. DuckDB for Arrow queries — DuckDB can query Arrow tables directly with zero copy: `duckdb.arrow(table).query("SELECT ...")`

Xet Storage Details

Size:: 8.54 kB
Xet hash:: 0d798f1fc95592d0f4694ff940c6d9da347142392ffc322501642715e82b6097

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.