Buckets:

rparkr
/

lfm-coder-training-bucket

Files

xet

rparkr/lfm-coder-training-bucket / README.md

rparkr

17 days ago

preview code

download

raw

4.96 kB

	# LFM-Coder training bucket

	This bucket contains training artifacts from the fine-tuned model [rparkr/LFM2.5-1.2B-Instruct-Coding](https://huggingface.co/rparkr/LFM2.5-1.2B-Instruct-Coding).

	For an interactive view of training metrics, see the [Trackio space for this training run](https://huggingface.co/spaces/rparkr/lfm-coder-training).

	# Contents

	## [completions](https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/tree/completions)
	This directory contains every group of model completions during training. The model was trained on 1,000 examples for 3 epochs, so 3,000 groups (files) in total, where each group has 8 completions to the same prompt.

	Each completions group is a Parquet file with these columns:
	- `step`: The training step number.
	- `prompt`: The prompt given to the model.
	- `completion`: The model's completion.
	- `coding_accuracy_reward`: The percentage of test cases answered correctly by the completion, or simply 0 or 1 for a binary reward (1 if all test cases passed, 0 otherwise).
	- `advantage`: The advantage value used for updating the LoRA weights through backpropagation, based on the relative `coding_accuracy_reward` compared to other completions in the group.

	You can explore the data using, for example, [duckdb](https://duckdb.org/install/?environment=cli), like this:

	```bash
	# Select any file from completions_00001.parquet to completions_03000.parquet
	COMPLETIONS_FILE="completions_00001.parquet"
	duckdb -c "SELECT
	*
	FROM
	read_parquet('https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/resolve/completions/$COMPLETIONS_FILE?download=true')
	;"
	```

	Alternatively, you can mount the bucket using [`hf-mount`](https://github.com/huggingface/hf-mount) and read all the data at once, following the instructions in the "Mount this bucket" button on the [directory page](https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/tree/completions).

	```bash
	# Install hf-mount
	brew install hf-mount

	# Mount this bucket as a local folder
	hf-mount start bucket rparkr/lfm-coder-training-bucket ./local

	# Query all files
	duckdb -c "SELECT
	*
	FROM
	read_parquet('./local/completions/*.parquet')
	LIMIT 1000;"
	```

	```bash
	# Unmount when done
	hf-mount stop ./local
	```

	## [eval_results](https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/tree/eval_results)
	These are JSON lines files that contain the model's results on the evaluation benchmarks, recorded every 1,000 training steps (i.e., at steps 1,000, 2,000, and 3,000).

	Similar to the completions directory, you can explore the data using, for example, duckdb:


	```bash
	# The three files are named based on the timestamp of when evaluation began.
	# Step 1,000: eval_results/LFM2.5-1.2B-Instruct-grpo_2026-04-28T02-52-22Z.jsonl
	# Step 2,000: eval_results/LFM2.5-1.2B-Instruct-grpo_2026-04-30T01-00-08Z.jsonl
	# Step 3,000: eval_results/LFM2.5-1.2B-Instruct-grpo_2026-05-01T05-54-59Z.jsonl
	EVAL_RESULTS_FILE="eval_results/LFM2.5-1.2B-Instruct-grpo_2026-04-28T02-52-22Z.jsonl"

	duckdb -c "SELECT
	*
	FROM
	read_json('https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/resolve/$EVAL_RESULTS_FILE?download=true')
	LIMIT 10
	;"
	```

	You can also mount the bucket to read all the data at once. See the [completions](#completions) section above for instructions.

	## [trackio](https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/tree/trackio)

	This is the trackio database that stores metrics from the training run. You can view the trackio space [here](https://huggingface.co/spaces/rparkr/lfm-coder-training), or explore the SQLite database using DuckDB:

	```bash
	# Download the SQLite database and journal file
	curl -L -O "https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/resolve/trackio/huggingface.db?download=true"
	curl -L -O "https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/resolve/trackio/huggingface.db-journal?download=true"

	# Connect to the trackio database
	duckdb -c "ATTACH './huggingface.db' AS trackio (TYPE sqlite);"

	# List all tables in the database
	duckdb -c "SHOW TABLES FROM trackio;"

	# Query the metrics table (e.g., loss, coding_accuracy_reward)
	duckdb -c "SELECT * FROM trackio.metrics;"

	# Query the system metrics table (e.g., GPU utilization)
	duckdb -c "SELECT * FROM trackio.system_metrics;"
	```

	## [training_logs](https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/tree/training_logs)
	A JSON lines file with logs from the [training codebase](https://github.com/rparkr/lfm-coder) during the training run.

	You can similarly explore this dataset using duckdb:

	```bash
	duckdb -c "SELECT
	*
	FROM
	read_json('https://huggingface.co/buckets/rparkr/lfm-coder-training-bucket/resolve/training_logs/training-log_2026-04-28.jsonl?download=true')
	;"
	```

	Here's a screenshot of using DuckDB for log analysis (launched with `duckdb -ui` to use the notebook-based web UI):
	![DuckDB log analysis](./images/duckdb-log-analysis.png)

Xet Storage Details

Size:: 4.96 kB
Xet hash:: 639706dc1656e3d602dccf8af3a952725ba44c82789ac0ae7bd0e66c243d0ed3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.