google-labs-jules[bot] kastnerp commited on
Commit ·
b4316dc
1
Parent(s): da6c0e0
Optimized inference memory conversions and API bytes manipulation.
Browse files- Swapped `np.add` and `np.multiply` inside `_run_inference_from_array` to an equivalent but faster combination without extra math steps.
- Swapped `astype` and `transpose` order to produce C-contiguous layouts immediately on type cast.
- Passed `memoryview(wind_speeds_arr)` to `gzip.compress` instead of allocating a `wind_speeds_arr.tobytes()` string in the API response logic.
- Allocated a flat 1D `<u4` struct in `_color_to_windspeed` bypassing `np.zeros` using `np.empty` overlaid view.
Co-authored-by: kastnerp <1919773+kastnerp@users.noreply.github.com>
- .jules/bolt.md +14 -0
- api.py +15 -11
.jules/bolt.md
CHANGED
|
@@ -9,9 +9,11 @@
|
|
| 9 |
## 2025-03-03 - NumPy to Python List Conversion Overhead
|
| 10 |
**Learning:** Converting a large NumPy array (e.g., 262,144 elements) to a Python `list[float]` and back to a `np.float32` array using `.tolist()` and `np.array()` takes ~2.3 seconds per request. This overhead is incredibly high and unnecessary when the final format needed is bytes. Returning the NumPy array directly from processing functions completely bypasses this performance bottleneck, reducing conversion time to mere milliseconds.
|
| 11 |
**Action:** Never convert large NumPy arrays to Python lists if the data ultimately stays in a numerical format or gets converted back to an array/bytes. Always maintain the data as a NumPy array and use array operations directly (e.g., `.astype()`, `.tobytes()`).
|
|
|
|
| 12 |
## 2025-03-04 - gzip.compress default level is incredibly slow
|
| 13 |
**Learning:** Python's `gzip.compress` defaults to `compresslevel=9` (maximum compression). For reasonably sized binary arrays (e.g., 1MB float32 array converted to bytes), compression level 9 is incredibly slow (~500ms) but only yields a marginally smaller output compared to compression level 1 (~25ms). This results in massive API response latency with almost no bandwidth saving.
|
| 14 |
**Action:** When compressing API payloads dynamically (especially numpy array bytes), always explicitly specify `compresslevel=1` to optimize for speed over marginal compression size differences.
|
|
|
|
| 15 |
## 2025-03-05 - Array Type Conversion Overhead
|
| 16 |
**Learning:** Expanding dimensions on an array (e.g. `np.expand_dims(data, axis=0).astype(np.float32)`) implicitly allocates a new array if `copy=False` isn't specified, even when the original array is already of type `np.float32`. This creates significant memory overhead and allocation time (~0.3ms vs ~0.003ms for a 3x512x512 array).
|
| 17 |
**Action:** When casting types for inputs that may already be of the target type, always use `copy=False` or check the dtype explicitly before casting.
|
|
@@ -35,3 +37,15 @@
|
|
| 35 |
## 2025-03-05 - Parallelizing NumPy Matmul Startup
|
| 36 |
**Learning:** `np.matmul` natively releases the Python Global Interpreter Lock (GIL). When computing many independent matrix multiplications iteratively on a single thread (like generating a large 1D lookup table for colormaps), using `concurrent.futures.ThreadPoolExecutor` allows full utilization of multi-core CPUs.
|
| 37 |
**Action:** When performing heavy independent matrix math in a loop that blocks server startup, divide the work into chunks and use a standard Python ThreadPoolExecutor to speed it up. Be mindful of temporary array memory consumption per thread and cap the number of threads (e.g. `min(8, os.cpu_count())`).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
## 2025-03-03 - NumPy to Python List Conversion Overhead
|
| 10 |
**Learning:** Converting a large NumPy array (e.g., 262,144 elements) to a Python `list[float]` and back to a `np.float32` array using `.tolist()` and `np.array()` takes ~2.3 seconds per request. This overhead is incredibly high and unnecessary when the final format needed is bytes. Returning the NumPy array directly from processing functions completely bypasses this performance bottleneck, reducing conversion time to mere milliseconds.
|
| 11 |
**Action:** Never convert large NumPy arrays to Python lists if the data ultimately stays in a numerical format or gets converted back to an array/bytes. Always maintain the data as a NumPy array and use array operations directly (e.g., `.astype()`, `.tobytes()`).
|
| 12 |
+
|
| 13 |
## 2025-03-04 - gzip.compress default level is incredibly slow
|
| 14 |
**Learning:** Python's `gzip.compress` defaults to `compresslevel=9` (maximum compression). For reasonably sized binary arrays (e.g., 1MB float32 array converted to bytes), compression level 9 is incredibly slow (~500ms) but only yields a marginally smaller output compared to compression level 1 (~25ms). This results in massive API response latency with almost no bandwidth saving.
|
| 15 |
**Action:** When compressing API payloads dynamically (especially numpy array bytes), always explicitly specify `compresslevel=1` to optimize for speed over marginal compression size differences.
|
| 16 |
+
|
| 17 |
## 2025-03-05 - Array Type Conversion Overhead
|
| 18 |
**Learning:** Expanding dimensions on an array (e.g. `np.expand_dims(data, axis=0).astype(np.float32)`) implicitly allocates a new array if `copy=False` isn't specified, even when the original array is already of type `np.float32`. This creates significant memory overhead and allocation time (~0.3ms vs ~0.003ms for a 3x512x512 array).
|
| 19 |
**Action:** When casting types for inputs that may already be of the target type, always use `copy=False` or check the dtype explicitly before casting.
|
|
|
|
| 37 |
## 2025-03-05 - Parallelizing NumPy Matmul Startup
|
| 38 |
**Learning:** `np.matmul` natively releases the Python Global Interpreter Lock (GIL). When computing many independent matrix multiplications iteratively on a single thread (like generating a large 1D lookup table for colormaps), using `concurrent.futures.ThreadPoolExecutor` allows full utilization of multi-core CPUs.
|
| 39 |
**Action:** When performing heavy independent matrix math in a loop that blocks server startup, divide the work into chunks and use a standard Python ThreadPoolExecutor to speed it up. Be mindful of temporary array memory consumption per thread and cap the number of threads (e.g. `min(8, os.cpu_count())`).
|
| 40 |
+
|
| 41 |
+
## 2025-03-05 - Avoid .tobytes() zero-copy bytes memory compression
|
| 42 |
+
**Learning:** Using `.tobytes()` on a numpy array creates a full copy of the array's data as a byte string. For a 1MB float32 array, this takes up extra memory and time. `gzip.compress` supports buffer protocol objects, so we can pass `memoryview(arr)` directly to compress the array data without copying it into a python `bytes` object first.
|
| 43 |
+
**Action:** When feeding numpy arrays to functions accepting buffers (like `gzip.compress`), always wrap them in a `memoryview` instead of calling `.tobytes()` to eliminate memory copies and improve latency.
|
| 44 |
+
|
| 45 |
+
## 2025-03-05 - NumPy Transpose and Astype Order
|
| 46 |
+
**Learning:** Calling `.astype(np.uint8).transpose(...)` creates an intermediate uint8 array in the original layout, then returns a non-contiguous transposed view. By calling `.transpose(...).astype(np.uint8)`, numpy iterates over the transposed memory view and constructs a brand new, memory-contiguous output array directly.
|
| 47 |
+
**Action:** When a contiguous output array is needed (like for PIL conversion or memory operations), transpose the array *before* casting its type to force contiguous memory layout generation.
|
| 48 |
+
|
| 49 |
+
## 2025-03-05 - Avoid np.zeros for Padded Structs
|
| 50 |
+
**Learning:** Allocating an array with `np.zeros` and immediately overwriting most of it is inefficient. Using `np.empty` and then explicitly zeroing only the required padding parts (like the alpha channel in an RGBA padding structure) skips a full memory-zeroing pass.
|
| 51 |
+
**Action:** Use `np.empty` and explicitly initialize required values instead of `np.zeros` when building a temporary data structure that gets fully overwritten, especially in hot loops or large arrays.
|
api.py
CHANGED
|
@@ -247,14 +247,15 @@ def _color_to_windspeed(denorm: np.ndarray) -> np.ndarray:
|
|
| 247 |
# ⚡ Bolt Optimization: Zero-pad the RGB array to RGBA (4 bytes) and view it
|
| 248 |
# as a 32-bit integer array. This allows for lightning-fast 1D lookup
|
| 249 |
# and completely avoids multi-dimensional array slicing and index math.
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
|
|
|
| 258 |
return COLORMAP_LUT_1D_FLOAT[idx]
|
| 259 |
|
| 260 |
|
|
@@ -343,11 +344,11 @@ def _run_inference_from_array(data: np.ndarray) -> np.ndarray:
|
|
| 343 |
# directly to avoid allocating any new float32 arrays in memory during
|
| 344 |
# math operations. The ONNX runtime returns a standard mutable numpy array.
|
| 345 |
# The final .transpose() returns a memory view.
|
|
|
|
| 346 |
np.multiply(raw, 127.5, out=raw)
|
| 347 |
-
np.add(raw, 127.5, out=raw)
|
| 348 |
np.clip(raw, 0, 255, out=raw)
|
| 349 |
|
| 350 |
-
return raw.
|
| 351 |
|
| 352 |
|
| 353 |
None
|
|
@@ -417,7 +418,10 @@ def predict(request: Request, body: PredictRequest):
|
|
| 417 |
|
| 418 |
wind_speeds_arr = _color_to_windspeed(denorm)
|
| 419 |
|
| 420 |
-
|
|
|
|
|
|
|
|
|
|
| 421 |
|
| 422 |
# Optimization: use compresslevel=1. The default is 9 which is extremely slow
|
| 423 |
# for a 1MB payload (~500ms -> ~25ms) but only ~2% worse compression.
|
|
|
|
| 247 |
# ⚡ Bolt Optimization: Zero-pad the RGB array to RGBA (4 bytes) and view it
|
| 248 |
# as a 32-bit integer array. This allows for lightning-fast 1D lookup
|
| 249 |
# and completely avoids multi-dimensional array slicing and index math.
|
| 250 |
+
# We allocate an empty 1D `<u4` array directly and fill the channels via an
|
| 251 |
+
# overlaid `uint8` view to avoid the cost of `np.zeros` and `view('<u4').ravel()`.
|
| 252 |
+
idx = np.empty(denorm.shape[0] * denorm.shape[1], dtype='<u4')
|
| 253 |
+
view = idx.view(np.uint8).reshape(denorm.shape[0], denorm.shape[1], 4)
|
| 254 |
+
view[:, :, 0] = denorm[:, :, 2] # B
|
| 255 |
+
view[:, :, 1] = denorm[:, :, 1] # G
|
| 256 |
+
view[:, :, 2] = denorm[:, :, 0] # R
|
| 257 |
+
view[:, :, 3] = 0 # A (Padding)
|
| 258 |
+
|
| 259 |
return COLORMAP_LUT_1D_FLOAT[idx]
|
| 260 |
|
| 261 |
|
|
|
|
| 344 |
# directly to avoid allocating any new float32 arrays in memory during
|
| 345 |
# math operations. The ONNX runtime returns a standard mutable numpy array.
|
| 346 |
# The final .transpose() returns a memory view.
|
| 347 |
+
np.add(raw, 1.0, out=raw)
|
| 348 |
np.multiply(raw, 127.5, out=raw)
|
|
|
|
| 349 |
np.clip(raw, 0, 255, out=raw)
|
| 350 |
|
| 351 |
+
return raw.transpose((1, 2, 0)).astype(np.uint8)
|
| 352 |
|
| 353 |
|
| 354 |
None
|
|
|
|
| 418 |
|
| 419 |
wind_speeds_arr = _color_to_windspeed(denorm)
|
| 420 |
|
| 421 |
+
# ⚡ Bolt Optimization: Use memoryview instead of .tobytes() to avoid allocating
|
| 422 |
+
# and copying a new bytes object for the full array before compression.
|
| 423 |
+
# gzip.compress accepts buffer protocol objects directly.
|
| 424 |
+
wind_speeds_bytes = memoryview(wind_speeds_arr)
|
| 425 |
|
| 426 |
# Optimization: use compresslevel=1. The default is 9 which is extremely slow
|
| 427 |
# for a 1MB payload (~500ms -> ~25ms) but only ~2% worse compression.
|