google-labs-jules[bot] kastnerp commited on
Commit
b4316dc
·
1 Parent(s): da6c0e0

Optimized inference memory conversions and API bytes manipulation.

Browse files

- Swapped `np.add` and `np.multiply` inside `_run_inference_from_array` to an equivalent but faster combination without extra math steps.
- Swapped `astype` and `transpose` order to produce C-contiguous layouts immediately on type cast.
- Passed `memoryview(wind_speeds_arr)` to `gzip.compress` instead of allocating a `wind_speeds_arr.tobytes()` string in the API response logic.
- Allocated a flat 1D `<u4` struct in `_color_to_windspeed` bypassing `np.zeros` using `np.empty` overlaid view.

Co-authored-by: kastnerp <1919773+kastnerp@users.noreply.github.com>

Files changed (2) hide show
  1. .jules/bolt.md +14 -0
  2. api.py +15 -11
.jules/bolt.md CHANGED
@@ -9,9 +9,11 @@
9
  ## 2025-03-03 - NumPy to Python List Conversion Overhead
10
  **Learning:** Converting a large NumPy array (e.g., 262,144 elements) to a Python `list[float]` and back to a `np.float32` array using `.tolist()` and `np.array()` takes ~2.3 seconds per request. This overhead is incredibly high and unnecessary when the final format needed is bytes. Returning the NumPy array directly from processing functions completely bypasses this performance bottleneck, reducing conversion time to mere milliseconds.
11
  **Action:** Never convert large NumPy arrays to Python lists if the data ultimately stays in a numerical format or gets converted back to an array/bytes. Always maintain the data as a NumPy array and use array operations directly (e.g., `.astype()`, `.tobytes()`).
 
12
  ## 2025-03-04 - gzip.compress default level is incredibly slow
13
  **Learning:** Python's `gzip.compress` defaults to `compresslevel=9` (maximum compression). For reasonably sized binary arrays (e.g., 1MB float32 array converted to bytes), compression level 9 is incredibly slow (~500ms) but only yields a marginally smaller output compared to compression level 1 (~25ms). This results in massive API response latency with almost no bandwidth saving.
14
  **Action:** When compressing API payloads dynamically (especially numpy array bytes), always explicitly specify `compresslevel=1` to optimize for speed over marginal compression size differences.
 
15
  ## 2025-03-05 - Array Type Conversion Overhead
16
  **Learning:** Expanding dimensions on an array (e.g. `np.expand_dims(data, axis=0).astype(np.float32)`) implicitly allocates a new array if `copy=False` isn't specified, even when the original array is already of type `np.float32`. This creates significant memory overhead and allocation time (~0.3ms vs ~0.003ms for a 3x512x512 array).
17
  **Action:** When casting types for inputs that may already be of the target type, always use `copy=False` or check the dtype explicitly before casting.
@@ -35,3 +37,15 @@
35
  ## 2025-03-05 - Parallelizing NumPy Matmul Startup
36
  **Learning:** `np.matmul` natively releases the Python Global Interpreter Lock (GIL). When computing many independent matrix multiplications iteratively on a single thread (like generating a large 1D lookup table for colormaps), using `concurrent.futures.ThreadPoolExecutor` allows full utilization of multi-core CPUs.
37
  **Action:** When performing heavy independent matrix math in a loop that blocks server startup, divide the work into chunks and use a standard Python ThreadPoolExecutor to speed it up. Be mindful of temporary array memory consumption per thread and cap the number of threads (e.g. `min(8, os.cpu_count())`).
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ## 2025-03-03 - NumPy to Python List Conversion Overhead
10
  **Learning:** Converting a large NumPy array (e.g., 262,144 elements) to a Python `list[float]` and back to a `np.float32` array using `.tolist()` and `np.array()` takes ~2.3 seconds per request. This overhead is incredibly high and unnecessary when the final format needed is bytes. Returning the NumPy array directly from processing functions completely bypasses this performance bottleneck, reducing conversion time to mere milliseconds.
11
  **Action:** Never convert large NumPy arrays to Python lists if the data ultimately stays in a numerical format or gets converted back to an array/bytes. Always maintain the data as a NumPy array and use array operations directly (e.g., `.astype()`, `.tobytes()`).
12
+
13
  ## 2025-03-04 - gzip.compress default level is incredibly slow
14
  **Learning:** Python's `gzip.compress` defaults to `compresslevel=9` (maximum compression). For reasonably sized binary arrays (e.g., 1MB float32 array converted to bytes), compression level 9 is incredibly slow (~500ms) but only yields a marginally smaller output compared to compression level 1 (~25ms). This results in massive API response latency with almost no bandwidth saving.
15
  **Action:** When compressing API payloads dynamically (especially numpy array bytes), always explicitly specify `compresslevel=1` to optimize for speed over marginal compression size differences.
16
+
17
  ## 2025-03-05 - Array Type Conversion Overhead
18
  **Learning:** Expanding dimensions on an array (e.g. `np.expand_dims(data, axis=0).astype(np.float32)`) implicitly allocates a new array if `copy=False` isn't specified, even when the original array is already of type `np.float32`. This creates significant memory overhead and allocation time (~0.3ms vs ~0.003ms for a 3x512x512 array).
19
  **Action:** When casting types for inputs that may already be of the target type, always use `copy=False` or check the dtype explicitly before casting.
 
37
  ## 2025-03-05 - Parallelizing NumPy Matmul Startup
38
  **Learning:** `np.matmul` natively releases the Python Global Interpreter Lock (GIL). When computing many independent matrix multiplications iteratively on a single thread (like generating a large 1D lookup table for colormaps), using `concurrent.futures.ThreadPoolExecutor` allows full utilization of multi-core CPUs.
39
  **Action:** When performing heavy independent matrix math in a loop that blocks server startup, divide the work into chunks and use a standard Python ThreadPoolExecutor to speed it up. Be mindful of temporary array memory consumption per thread and cap the number of threads (e.g. `min(8, os.cpu_count())`).
40
+
41
+ ## 2025-03-05 - Avoid .tobytes() zero-copy bytes memory compression
42
+ **Learning:** Using `.tobytes()` on a numpy array creates a full copy of the array's data as a byte string. For a 1MB float32 array, this takes up extra memory and time. `gzip.compress` supports buffer protocol objects, so we can pass `memoryview(arr)` directly to compress the array data without copying it into a python `bytes` object first.
43
+ **Action:** When feeding numpy arrays to functions accepting buffers (like `gzip.compress`), always wrap them in a `memoryview` instead of calling `.tobytes()` to eliminate memory copies and improve latency.
44
+
45
+ ## 2025-03-05 - NumPy Transpose and Astype Order
46
+ **Learning:** Calling `.astype(np.uint8).transpose(...)` creates an intermediate uint8 array in the original layout, then returns a non-contiguous transposed view. By calling `.transpose(...).astype(np.uint8)`, numpy iterates over the transposed memory view and constructs a brand new, memory-contiguous output array directly.
47
+ **Action:** When a contiguous output array is needed (like for PIL conversion or memory operations), transpose the array *before* casting its type to force contiguous memory layout generation.
48
+
49
+ ## 2025-03-05 - Avoid np.zeros for Padded Structs
50
+ **Learning:** Allocating an array with `np.zeros` and immediately overwriting most of it is inefficient. Using `np.empty` and then explicitly zeroing only the required padding parts (like the alpha channel in an RGBA padding structure) skips a full memory-zeroing pass.
51
+ **Action:** Use `np.empty` and explicitly initialize required values instead of `np.zeros` when building a temporary data structure that gets fully overwritten, especially in hot loops or large arrays.
api.py CHANGED
@@ -247,14 +247,15 @@ def _color_to_windspeed(denorm: np.ndarray) -> np.ndarray:
247
  # ⚡ Bolt Optimization: Zero-pad the RGB array to RGBA (4 bytes) and view it
248
  # as a 32-bit integer array. This allows for lightning-fast 1D lookup
249
  # and completely avoids multi-dimensional array slicing and index math.
250
- padded = np.zeros((denorm.shape[0], denorm.shape[1], 4), dtype=np.uint8)
251
- padded[:, :, 0] = denorm[:, :, 2] # B
252
- padded[:, :, 1] = denorm[:, :, 1] # G
253
- padded[:, :, 2] = denorm[:, :, 0] # R
254
-
255
- # Use '<u4' to explicitly view as little-endian 32-bit unsigned integers
256
- # ensuring cross-platform safety.
257
- idx = padded.view('<u4').ravel()
 
258
  return COLORMAP_LUT_1D_FLOAT[idx]
259
 
260
 
@@ -343,11 +344,11 @@ def _run_inference_from_array(data: np.ndarray) -> np.ndarray:
343
  # directly to avoid allocating any new float32 arrays in memory during
344
  # math operations. The ONNX runtime returns a standard mutable numpy array.
345
  # The final .transpose() returns a memory view.
 
346
  np.multiply(raw, 127.5, out=raw)
347
- np.add(raw, 127.5, out=raw)
348
  np.clip(raw, 0, 255, out=raw)
349
 
350
- return raw.astype(np.uint8).transpose((1, 2, 0))
351
 
352
 
353
  None
@@ -417,7 +418,10 @@ def predict(request: Request, body: PredictRequest):
417
 
418
  wind_speeds_arr = _color_to_windspeed(denorm)
419
 
420
- wind_speeds_bytes = wind_speeds_arr.tobytes()
 
 
 
421
 
422
  # Optimization: use compresslevel=1. The default is 9 which is extremely slow
423
  # for a 1MB payload (~500ms -> ~25ms) but only ~2% worse compression.
 
247
  # ⚡ Bolt Optimization: Zero-pad the RGB array to RGBA (4 bytes) and view it
248
  # as a 32-bit integer array. This allows for lightning-fast 1D lookup
249
  # and completely avoids multi-dimensional array slicing and index math.
250
+ # We allocate an empty 1D `<u4` array directly and fill the channels via an
251
+ # overlaid `uint8` view to avoid the cost of `np.zeros` and `view('<u4').ravel()`.
252
+ idx = np.empty(denorm.shape[0] * denorm.shape[1], dtype='<u4')
253
+ view = idx.view(np.uint8).reshape(denorm.shape[0], denorm.shape[1], 4)
254
+ view[:, :, 0] = denorm[:, :, 2] # B
255
+ view[:, :, 1] = denorm[:, :, 1] # G
256
+ view[:, :, 2] = denorm[:, :, 0] # R
257
+ view[:, :, 3] = 0 # A (Padding)
258
+
259
  return COLORMAP_LUT_1D_FLOAT[idx]
260
 
261
 
 
344
  # directly to avoid allocating any new float32 arrays in memory during
345
  # math operations. The ONNX runtime returns a standard mutable numpy array.
346
  # The final .transpose() returns a memory view.
347
+ np.add(raw, 1.0, out=raw)
348
  np.multiply(raw, 127.5, out=raw)
 
349
  np.clip(raw, 0, 255, out=raw)
350
 
351
+ return raw.transpose((1, 2, 0)).astype(np.uint8)
352
 
353
 
354
  None
 
418
 
419
  wind_speeds_arr = _color_to_windspeed(denorm)
420
 
421
+ # Bolt Optimization: Use memoryview instead of .tobytes() to avoid allocating
422
+ # and copying a new bytes object for the full array before compression.
423
+ # gzip.compress accepts buffer protocol objects directly.
424
+ wind_speeds_bytes = memoryview(wind_speeds_arr)
425
 
426
  # Optimization: use compresslevel=1. The default is 9 which is extremely slow
427
  # for a 1MB payload (~500ms -> ~25ms) but only ~2% worse compression.