Phase 1: one GPU thread per 1024-byte chunk (parallel). Phase 2: stack tree-merge → root. Correctness vs the CPU oracle, plus throughput on a big block (the size of the 145 MB embed).