YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

  license: bsd-3-clause
  tags:
  - test-fixtures
  - sklearn
  - tabular
  ---

  # ferrotorch / ml-sklearn-parity-v1

  scikit-learn reference outputs for ferrotorch-ml's tabular
  operations, generated by running the 5-config matrix on a fixed
  deterministic dataset and snapshotting the inputs + outputs as
  `.bin` (multi-tensor f32) and `.json` (integer indices) files.

  Phase D.3 of real-artifact-driven development (#1159). Companion to:
    * `scripts/pin_pretrained_ml_fixtures.py` (this pin)
    * `scripts/verify_ml_inference.py` (the harness)
    * `ferrotorch-ml/examples/ml_op_dump.rs`
    * `ferrotorch-ml/tests/conformance_sklearn_parity.rs`

  sklearn version: 1.5.2.

  ## Configurations

    * `pca_n4` — sklearn.decomposition.PCA(n_components=4).fit_transform (equality_mode=COSINE_SIM_PER_PC)

standard_scaler — sklearn.preprocessing.StandardScaler().fit_transform (equality_mode=MAX_ABS)
one_hot_encoder — sklearn.preprocessing.OneHotEncoder(sparse_output=False).fit_transform (equality_mode=EXACT)
kfold_5 — sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=42).split(arange(50)) (equality_mode=SET)

train_test_split_80_20 — sklearn.model_selection.train_test_split(X[100,10], y[100], test_size=0.2, random_state=42) (equality_mode=SET)

## Layout

One subfolder per configuration:

```
<config_name>/
  meta.json
  input_*.bin        # one or more input tensors (f32 LE multi-tensor)
  output_*.bin       # sklearn reference output(s) (f32 LE multi-tensor)
  fold_indices.json  # kfold_5 only — integer fold index lists
  split_indices.json # train_test_split_80_20 only — split indices
```

## Binary format

Each `.bin` file is a little-endian multi-tensor dump (same as
ferrotorch/dataloader-batches-v1 and ferrotorch/optimizer-trajectories-v1):

```
[u32 num_tensors]
per-tensor:
  [u32 ndim] [u32 × ndim shape] [f32 × prod(shape)]
```

## Equality semantics

* `pca_n4` — cosine_sim ≥ 0.9999 PER PRINCIPAL COMPONENT (PCs may
  flip sign across implementations; the harness aligns each PC's
  sign before computing max_abs).
* `standard_scaler` — max_abs ≤ 1e-6 (essentially exact f32
  arithmetic; sklearn + ferrolearn both use biased variance /n).
* `one_hot_encoder` — exact integer equality.
* `kfold_5` — SET-equality. rust's `rand` crate (SmallRng) and
  numpy's PRNG cannot byte-match the shuffle permutation; each
  test fold must have exact size 10, and the union of all test
  folds must equal [0, 50).
* `train_test_split_80_20` — SET-equality. Sizes must be exactly
  80/20; union of train+test indices == [0, 100); test labels
  must match test X rows (label consistency invariant).

## License

BSD-3-Clause (scikit-learn inherits BSD-3-Clause; the reference
outputs are deterministic projections of public-domain numpy
random state, so the BSD-3-Clause notice flows through).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support