# CHANGELOG

## [1.0.0-beta.6] - 2024-01-10

- Do not create CPU copy of grad array when calling `array.numpy()`
- Fix `assert_np_equal()` bug
- Support Linux AArch64 platforms, including Jetson/Tegra devices
- Add parallel testing runner (invoke with `python -m warp.tests`, use `warp/tests/unittest_serial.py` for serial testing)
- Fix support for function calls in `range()`
- `matmul` adjoints now accumulate
- Expand available operators (e.g. vector @ matrix, scalar as dividend) and improve support for calling native built-ins
- Fix multi-gpu synchronization issue in `sparse.py`
- Add depth rendering to `OpenGLRenderer`, document `warp.render`
- Make `atomic_min`, `atomic_max` differentiable
- Fix error reporting using the exact source segment
- Add user-friendly mesh query overloads, returning a struct instead of overwriting parameters
- Address multiple differentiability issues
- Fix backpropagation for returning array element references
- Support passing the return value to adjoints
- Add point basis space and explicit point-based quadrature for `warp.fem`
- Support overriding the LLVM project source directory path using `build_lib.py --build_llvm --llvm_source_path=`
- Fix the error message for accessing non-existing attributes
- Flatten faces array for Mesh constructor in URDF parser

## [1.0.0-beta.5] - 2023-11-22

- Fix for kernel caching when function argument types change
- Fix code-gen ordering of dependent structs
- Fix for `wp.Mesh` build on MGPU systems
- Fix for name clash bug with adjoint code: https://github.com/NVIDIA/warp/issues/154
- Add `wp.frac()` for returning the fractional part of a floating point value
- Add support for custom native CUDA snippets using `@wp.func_native` decorator
- Add support for batched matmul with batch size > 2^16-1
- Add support for transposed CUTLASS `wp.matmul()` and additional error checking
- Add support for quad and hex meshes in `wp.fem`
- Detect and warn when C++ runtime doesn't match compiler during build, e.g.: ``libstdc++.so.6: version `GLIBCXX_3.4.30' not found``
- Documentation update for `wp.BVH`
- Documentation and simplified API for runtime kernel specialization `wp.Kernel`

## [1.0.0-beta.4] - 2023-11-01

- Add `wp.cbrt()` for cube root calculation
- Add `wp.mesh_furthest_point_no_sign()` to compute furthest point on a surface from a query point
- Add support for GPU BVH builds, 10-100x faster than CPU builds for large meshes
- Add support for chained comparisons, i.e.: `0 < x < 2`
- Add support for running `warp.fem` examples headless
- Fix for unit test determinism
- Fix for possible GC collection of array during graph capture
- Fix for `wp.utils.array_sum()` output initialization when used with vector types
- Coverage and documentation updates

## [1.0.0-beta.3] - 2023-10-19

- Add support for code coverage scans (test_coverage.py), coverage at 85% in omni.warp.core
- Add support for named component access for vector types, e.g.: `a = v.x`
- Add support for lvalue expressions, e.g.: `array[i] += b`
- Add casting constructors for matrix and vector types
- Add support for `type()` operator that can be used to return type inside kernels
- Add support for grid-stride kernels to support kernels with > 2^31-1 thread blocks
- Fix for multi-process initialization warnings
- Fix alignment issues with empty `wp.struct`
- Fix for return statement warning with tuple-returning functions
- Fix for `wp.batched_matmul()` registering the wrong function in the Tape
- Fix and document for `wp.sim` forward + inverse kinematics
- Fix for `wp.func` to return a default value if function does not return on all control paths
- Refactor `wp.fem` support for new basis functions, decoupled function spaces
- Optimizations for `wp.noise` functions, up to 10x faster in most cases
- Optimizations for `type_size_in_bytes()` used in array construction'

### Breaking Changes

- To support grid-stride kernels, `wp.tid()` can no longer be called inside `wp.func` functions.

## [1.0.0-beta.2] - 2023-09-01

- Fix for passing bool into `wp.func` functions
- Fix for deprecation warnings appearing on `stderr`, now redirected to `stdout`
- Fix for using `for i in wp.hash_grid_query(..)` syntax

## [1.0.0-beta.1] - 2023-08-29

- Fix for `wp.float16` being passed as kernel arguments
- Fix for compile errors with kernels using structs in backward pass
- Fix for `wp.Mesh.refit()` not being CUDA graph capturable due to synchronous temp. allocs
- Fix for dynamic texture example flickering / MGPU crashes demo in Kit by reusing `ui.DynamicImageProvider` instances
- Fix for a regression that disabled bundle change tracking in samples
- Fix for incorrect surface velocities when meshes are deforming in `OgnClothSimulate`
- Fix for incorrect lower-case when setting USD stage "up_axis" in examples
- Fix for incompatible gradient types when wrapping PyTorch tensor as a vector or matrix type
- Fix for adding open edges when building cloth constraints from meshes in `wp.sim.ModelBuilder.add_cloth_mesh()`
- Add support for `wp.fabricarray` to directly access Fabric data from Warp kernels, see https://omniverse.gitlab-master-pages.nvidia.com/usdrt/docs/usdrt_prim_selection.html for examples
- Add support for user defined gradient functions, see `@wp.func_replay`, and `@wp.func_grad` decorators
- Add support for more OG attribute types in `omni.warp.from_omni_graph()`
- Add support for creating NanoVDB `wp.Volume` objects from dense NumPy arrays
- Add support for `wp.volume_sample_grad_f()` which returns the value + gradient efficiently from an NVDB volume
- Add support for LLVM fp16 intrinsics for half-precision arithmetic
- Add implementation of stochastic gradient descent, see `wp.optim.SGD`
- Add `warp.fem` framework for solving weak-form PDE problems (see https://nvidia.github.io/warp/_build/html/modules/fem.html)
- Optimizations for `omni.warp` extension load time (2.2s to 625ms cold start)
- Make all `omni.ui` dependencies optional so that Warp unit tests can run headless
- Deprecation of `wp.tid()` outside of kernel functions, users should pass `tid()` values to `wp.func` functions explicitly
- Deprecation of `wp.sim.Model.flatten()` for returning all contained tensors from the model
- Add support for clamping particle max velocity in `wp.sim.Model.particle_max_velocity`
- Remove dependency on `urdfpy` package, improve MJCF parser handling of default values

## [0.10.1] - 2023-07-25

- Fix for large multidimensional kernel launches (> 2^32 threads)
- Fix for module hashing with generics
- Fix for unrolling loops with break or continue statements (will skip unrolling)
- Fix for passing boolean arguments to build_lib.py (previously ignored)
- Fix build warnings on Linux
- Fix for creating array of structs from NumPy structured array
- Fix for regression on kernel load times in Kit when using warp.sim
- Update `warp.array.reshape()` to handle `-1` dimensions
- Update margin used by for mesh queries when using `wp.sim.create_soft_body_contacts()`
- Improvements to gradient handling with `warp.from_torch()`, `warp.to_torch()` plus documentation

## [0.10.0] - 2023-07-05

- Add support for macOS universal binaries (x86 + aarch64) for M1+ support
- Add additional methods for SDF generation please see the following new methods:
  - `wp.mesh_query_point_nosign()` - closest point query with no sign determination
  - `wp.mesh_query_point_sign_normal()` - closest point query with sign from angle-weighted normal
  - `wp.mesh_query_point_sign_winding_number()` - closest point query with fast winding number sign determination
- Add CSR/BSR sparse matrix support, see `warp.sparse` module:
  - `wp.sparse.BsrMatrix`
  - `wp.sparse.bsr_zeros()`, `wp.sparse.bsr_set_from_triplets()` for construction
  - `wp.sparse.bsr_mm()`, `wp.sparse_bsr_mv()` for matrix-matrix and matrix-vector products respectively
- Add array-wide utilities:
  - `wp.utils.array_scan()` - prefix sum (inclusive or exclusive)
  - `wp.utils.array_sum()` - sum across array
  - `wp.utils.radix_sort_pairs()` - in-place radix sort (key,value) pairs
- Add support for calling `@wp.func` functions from Python (outside of kernel scope)
- Add support for recording kernel launches using a `wp.Launch` object that can be replayed with low overhead, use `wp.launch(..., record_cmd=True)` to generate a command object
- Optimizations for `wp.struct` kernel arguments, up to 20x faster launches for kernels with large structs or number of params
- Refresh USD samples to use bundle based workflow + change tracking
- Add Python API for manipulating mesh and point bundle data in OmniGraph, see `omni.warp.nodes` module
  - See `omni.warp.nodes.mesh_create_bundle()`, `omni.warp.nodes.mesh_get_points()`, etc.
- Improvements to `wp.array`:
  - Fix a number of array methods misbehaving with empty arrays
  - Fix a number of bugs and memory leaks related to gradient arrays
  - Fix array construction when creating arrays in pinned memory from a data source in pageable memory
  - `wp.empty()` no longer zeroes-out memory and returns an uninitialized array, as intended
  - `array.zero_()` and `array.fill_()` work with non-contiguous arrays
  - Support wrapping non-contiguous NumPy arrays without a copy
  - Support preserving the outer dimensions of NumPy arrays when wrapping them as Warp arrays of vector or matrix types
  - Improve PyTorch and DLPack interop with Warp arrays of arbitrary vectors and matrices
  - `array.fill_()` can now take lists or other sequences when filling arrays of vectors or matrices, e.g. `arr.fill_([[1, 2], [3, 4]])`
  - `array.fill_()` now works with arrays of structs (pass a struct instance)
  - `wp.copy()` gracefully handles copying between non-contiguous arrays on different devices
  - Add `wp.full()` and `wp.full_like()`, e.g., `a = wp.full(shape, value)`
  - Add optional `device` argument to `wp.empty_like()`, `wp.zeros_like()`, `wp.full_like()`, and `wp.clone()`
  - Add `indexedarray` methods `.zero_()`, `.fill_()`, and `.assign()`
  - Fix `indexedarray` methods `.numpy()` and `.list()`
  - Fix `array.list()` to work with arrays of any Warp data type
  - Fix `array.list()` synchronization issue with CUDA arrays
  - `array.numpy()` called on an array of structs returns a structured NumPy array with named fields
  - Improve the performance of creating arrays
- Fix for `Error: No module named 'omni.warp.core'` when running some Kit configurations (e.g.: stubgen)
- Fix for `wp.struct` instance address being included in module content hash
- Fix codegen with overridden function names
- Fix for kernel hashing so it occurs after code generation and before loading to fix a bug with stale kernel cache
- Fix for `wp.BVH.refit()` when executed on the CPU
- Fix adjoint of `wp.struct` constructor
- Fix element accessors for `wp.float16` vectors and matrices in Python
- Fix `wp.float16` members in structs
- Remove deprecated `wp.ScopedCudaGuard()`, please use `wp.ScopedDevice()` instead

## [0.9.0] - 2023-06-01

- Add support for in-place modifications to vector, matrix, and struct types inside kernels (will warn during backward pass with `wp.verbose` if using gradients)
- Add support for step-through VSCode debugging of kernel code with standalone LLVM compiler, see `wp.breakpoint()`, and `walkthrough_debug.py`
- Add support for default values on built-in functions
- Add support for multi-valued `@wp.func` functions
- Add support for `pass`, `continue`, and `break` statements
- Add missing `__sincos_stret` symbol for macOS
- Add support for gradient propagation through `wp.Mesh.points`, and other cases where arrays are passed to native functions
- Add support for Python `@` operator as an alias for `wp.matmul()`
- Add XPBD support for particle-particle collision
- Add support for individual particle radii: `ModelBuilder.add_particle` has a new `radius` argument, `Model.particle_radius` is now a Warp array
- Add per-particle flags as a `Model.particle_flags` Warp array, introduce `PARTICLE_FLAG_ACTIVE` to define whether a particle is being simulated and participates in contact dynamics
- Add support for Python bitwise operators `&`, `|`, `~`, `<<`, `>>`
- Switch to using standalone LLVM compiler by default for `cpu` devices
- Split `omni.warp` into `omni.warp.core` for Omniverse applications that want to use the Warp Python module with minimal additional dependencies
- Disable kernel gradient generation by default inside Omniverse for improved compile times
- Fix for bounds checking on element access of vector/matrix types
- Fix for stream initialization when a custom (non-primary) external CUDA context has been set on the calling thread
- Fix for duplicate `@wp.struct` registration during hot reload
- Fix for array `unot()` operator so kernel writers can use `if not array:` syntax
- Fix for case where dynamic loops are nested within unrolled loops
- Change `wp.hash_grid_point_id()` now returns -1 if the `wp.HashGrid` has not been reserved before
- Deprecate `wp.Model.soft_contact_distance` which is now replaced by `wp.Model.particle_radius`
- Deprecate single scalar particle radius (should be a per-particle array)

## [0.8.2] - 2023-04-21

- Add `ModelBuilder.soft_contact_max` to control the maximum number of soft contacts that can be registered. Use `Model.allocate_soft_contacts(new_count)` to change count on existing `Model` objects.
- Add support for `bool` parameters 
- Add support for logical boolean operators with `int` types
- Fix for `wp.quat()` default constructor
- Fix conditional reassignments
- Add sign determination using angle weighted normal version of `wp.mesh_query_point()` as `wp.mesh_query_sign_normal()`
- Add sign determination using winding number of `wp.mesh_query_point()` as `wp.mesh_query_sign_winding_number()`
- Add query point without sign determination `wp.mesh_query_no_sign()`

## [0.8.1] - 2023-04-13

- Fix for regression when passing flattened numeric lists as matrix arguments to kernels
- Fix for regressions when passing `wp.struct` types with uninitialized (`None`) member attributes

## [0.8.0] - 2023-04-05

- Add `Texture Write` node for updating dynamic RTX textures from Warp kernels / nodes
- Add multi-dimensional kernel support to Warp Kernel Node
- Add `wp.load_module()` to pre-load specific modules (pass `recursive=True` to load recursively)
- Add `wp.poisson()` for sampling Poisson distributions
- Add support for UsdPhysics schema see `warp.sim.parse_usd()`
- Add XPBD rigid body implementation plus diff. simulation examples
- Add support for standalone CPU compilation (no host-compiler) with LLVM backed, enable with `--standalone` build option
- Add support for per-timer color in `wp.ScopedTimer()`
- Add support for row-based construction of matrix types outside of kernels
- Add support for setting and getting row vectors for Python matrices, see `matrix.get_row()`, `matrix.set_row()`
- Add support for instantiating `wp.struct` types within kernels
- Add support for indexed arrays, `slice = array[indices]` will now generate a sparse slice of array data
- Add support for generic kernel params, use `def compute(param: Any):`
- Add support for `with wp.ScopedDevice("cuda") as device:` syntax (same for `wp.ScopedStream()`, `wp.Tape()`)
- Add support for creating custom length vector/matrices inside kernels, see `wp.vector()`, and `wp.matrix()`
- Add support for creating identity matrices in kernels with, e.g.: `I = wp.identity(n=3, dtype=float)`
- Add support for unary plus operator (`wp.pos()`)
- Add support for `wp.constant` variables to be used directly in Python without having to use `.val` member
- Add support for nested `wp.struct` types
- Add support for returning `wp.struct` from functions
- Add `--quick` build for faster local dev. iteration (uses a reduced set of SASS arches)
- Add optional `requires_grad` parameter to `wp.from_torch()` to override gradient allocation
- Add type hints for generic vector / matrix types in Python stubs
- Add support for custom user function recording in `wp.Tape()`
- Add support for registering CUTLASS `wp.matmul()` with tape backward pass
- Add support for grids with > 2^31 threads (each dimension may be up to INT_MAX in length)
- Add CPU fallback for `wp.matmul()`
- Optimizations for `wp.launch()`, up to 3x faster launches in common cases
- Fix `wp.randf()` conversion to float to reduce bias for uniform sampling
- Fix capture of `wp.func` and `wp.constant` types from inside Python closures
- Fix for CUDA on WSL
- Fix for matrices in structs
- Fix for transpose indexing for some non-square matrices
- Enable Python faulthandler by default
- Update to VS2019

### Breaking Changes

- `wp.constant` variables can now be treated as their true type, accessing the underlying value through `constant.val` is no longer supported
- `wp.sim.model.ground_plane` is now a `wp.array` to support gradient, users should call `builder.set_ground_plane()` to create the ground 
- `wp.sim` capsule, cones, and cylinders are now aligned with the default USD up-axis

## [0.7.2] - 2023-02-15

- Reduce test time for vec/math types
- Clean-up CUDA disabled build pipeline
- Remove extension.gen.toml to make Kit packages Python version independent
- Handle additional cases for array indexing inside Python

## [0.7.1] - 2023-02-14

- Disabling some slow tests for Kit
- Make unit tests run on first GPU only by default

## [0.7.0] - 2023-02-13

- Add support for arbitrary length / type vector and matrices e.g.: `wp.vec(length=7, dtype=wp.float16)`, see `wp.vec()`, and `wp.mat()`
- Add support for `array.flatten()`, `array.reshape()`, and `array.view()` with NumPy semantics
- Add support for slicing `wp.array` types in Python
- Add `wp.from_ptr()` helper to construct arrays from an existing allocation
- Add support for `break` statements in ranged-for and while loops (backward pass support currently not implemented)
- Add built-in mathematic constants, see `wp.pi`, `wp.e`, `wp.log2e`, etc.
- Add built-in conversion between degrees and radians, see `wp.degrees()`, `wp.radians()`
- Add security pop-up for Kernel Node
- Improve error handling for kernel return values

## [0.6.3] - 2023-01-31

- Add DLPack utilities, see `wp.from_dlpack()`, `wp.to_dlpack()`
- Add Jax utilities, see `wp.from_jax()`, `wp.to_jax()`, `wp.device_from_jax()`, `wp.device_to_jax()`
- Fix for Linux Kit extensions OM-80132, OM-80133

## [0.6.2] - 2023-01-19

- Updated `wp.from_torch()` to support more data types
- Updated `wp.from_torch()` to automatically determine the target Warp data type if not specified
- Updated `wp.from_torch()` to support non-contiguous tensors with arbitrary strides
- Add CUTLASS integration for dense GEMMs, see `wp.matmul()` and `wp.matmul_batched()` 
- Add QR and Eigen decompositions for `mat33` types, see `wp.qr3()`, and `wp.eig3()`
- Add default (zero) constructors for matrix types
- Add a flag to suppress all output except errors and warnings (set `wp.config.quiet = True`)
- Skip recompilation when Kernel Node attributes are edited
- Allow optional attributes for Kernel Node
- Allow disabling backward pass code-gen on a per-kernel basis, use `@wp.kernel(enable_backward=False)`
- Replace Python `imp` package with `importlib`
- Fix for quaternion slerp gradients (`wp.quat_slerp()`)

## [0.6.1] - 2022-12-05

- Fix for non-CUDA builds
- Fix strides computation in array_t constructor, fixes a bug with accessing mesh indices through mesh.indices[]
- Disable backward pass code generation for kernel node (4-6x faster compilation)
- Switch to linbuild for universal Linux binaries (affects TeamCity builds only)

## [0.6.0] - 2022-11-28

- Add support for CUDA streams, see `wp.Stream`, `wp.get_stream()`, `wp.set_stream()`, `wp.synchronize_stream()`, `wp.ScopedStream`
- Add support for CUDA events, see `wp.Event`, `wp.record_event()`, `wp.wait_event()`, `wp.wait_stream()`, `wp.Stream.record_event()`, `wp.Stream.wait_event()`, `wp.Stream.wait_stream()`
- Add support for PyTorch stream interop, see `wp.stream_from_torch()`, `wp.stream_to_torch()`
- Add support for allocating host arrays in pinned memory for asynchronous data transfers, use `wp.array(..., pinned=True)` (default is non-pinned)
- Add support for direct conversions between all scalar types, e.g.: `x = wp.uint8(wp.float64(3.0))`
- Add per-module option to enable fast math, use `wp.set_module_options({"fast_math": True})`, fast math is now *disabled* by default
- Add support for generating CUBIN kernels instead of PTX on systems with older drivers
- Add user preference options for CUDA kernel output ("ptx" or "cubin", e.g.: `wp.config.cuda_output = "ptx"` or per-module `wp.set_module_options({"cuda_output": "ptx"})`)
- Add kernel node for OmniGraph
- Add `wp.quat_slerp()`, `wp.quat_to_axis_angle()`, `wp.rotate_rodriquez()` and adjoints for all remaining quaternion operations
- Add support for unrolling for-loops when range is a `wp.constant`
- Add support for arithmetic operators on built-in vector / matrix types outside of `wp.kernel`
- Add support for multiple solution variables in `wp.optim` Adam optimization
- Add nested attribute support for `wp.struct` attributes
- Add missing adjoint implementations for spatial math types, and document all functions with missing adjoints
- Add support for retrieving NanoVDB tiles and voxel size, see `wp.Volume.get_tiles()`, and `wp.Volume.get_voxel_size()`
- Add support for store operations on integer NanoVDB volumes, see `wp.volume_store_i()`
- Expose `wp.Mesh` points, indices, as arrays inside kernels, see `wp.mesh_get()`
- Optimizations for `wp.array` construction, 2-3x faster on average
- Optimizations for URDF import
- Fix various deployment issues by statically linking with all CUDA libs
- Update warp.so/warp.dll to CUDA Toolkit 11.5

## [0.5.1] - 2022-11-01

- Fix for unit tests in Kit

## [0.5.0] - 2022-10-31

- Add smoothed particle hydrodynamics (SPH) example, see `example_sph.py`
- Add support for accessing `array.shape` inside kernels, e.g.: `width = arr.shape[0]`
- Add dependency tracking to hot-reload modules if dependencies were modified
- Add lazy acquisition of CUDA kernel contexts (save ~300Mb of GPU memory in MGPU environments)
- Add BVH object, see `wp.Bvh` and `bvh_query_ray()`, `bvh_query_aabb()` functions
- Add component index operations for `spatial_vector`, `spatial_matrix` types
- Add `wp.lerp()` and `wp.smoothstep()` builtins
- Add `wp.optim` module with implementation of the Adam optimizer for float and vector types
- Add support for transient Python modules (fix for Houdini integration)
- Add `wp.length_sq()`, `wp.trace()` for vector / matrix types respectively
- Add missing adjoints for `wp.quat_rpy()`, `wp.determinant()`
- Add `wp.atomic_min()`, `wp.atomic_max()` operators
- Add vectorized version of `warp.sim.model.add_cloth_mesh()` 
- Add NVDB volume allocation API, see `wp.Volume.allocate()`, and `wp.Volume.allocate_by_tiles()`
- Add NVDB volume write methods, see `wp.volume_store_i()`, `wp.volume_store_f()`, `wp.volume_store_v()`
- Add MGPU documentation
- Add example showing how to compute Jacobian of multiple environments in parallel, see `example_jacobian_ik.py`
- Add `wp.Tape.zero()` support for `wp.struct` types
- Make SampleBrowser an optional dependency for Kit extension
- Make `wp.Mesh` object accept both 1d and 2d arrays of face vertex indices
- Fix for reloading of class member kernel / function definitions using `importlib.reload()`
- Fix for hashing of `wp.constants()` not invalidating kernels
- Fix for reload when multiple `.ptx` versions are present
- Improved error reporting during code-gen

## [0.4.3] - 2022-09-20

- Update all samples to use GPU interop path by default
- Fix for arrays > 2GB in length
- Add support for per-vertex USD mesh colors with warp.render class

## [0.4.2] - 2022-09-07

- Register Warp samples to the sample browser in Kit
- Add NDEBUG flag to release mode kernel builds
- Fix for particle solver node when using a large number of particles
- Fix for broken cameras in Warp sample scenes

## [0.4.1] - 2022-08-30

- Add geometry sampling methods, see `wp.sample_unit_cube()`, `wp.sample_unit_disk()`, etc
- Add `wp.lower_bound()` for searching sorted arrays
- Add an option for disabling code-gen of backward pass to improve compilation times, see `wp.set_module_options({"enable_backward": False})`, True by default
- Fix for using Warp from Script Editor or when module does not have a `__file__` attribute
- Fix for hot reload of modules containing `wp.func()` definitions
- Fix for debug flags not being set correctly on CUDA when `wp.config.mode == "debug"`, this enables bounds checking on CUDA kernels in debug mode
- Fix for code gen of functions that do not return a value

## [0.4.0] - 2022-08-09

- Fix for FP16 conversions on GPUs without hardware support
- Fix for `runtime = None` errors when reloading the Warp module
- Fix for PTX architecture version when running with older drivers, see `wp.config.ptx_target_arch`
- Fix for USD imports from `__init__.py`, defer them to individual functions that need them
- Fix for robustness issues with sign determination for `wp.mesh_query_point()`
- Fix for `wp.HashGrid` memory leak when creating/destroying grids
- Add CUDA version checks for toolkit and driver
- Add support for cross-module `@wp.struct` references
- Support running even if CUDA initialization failed, use `wp.is_cuda_available()` to check availability
- Statically linking with the CUDA runtime library to avoid deployment issues

### Breaking Changes

- Removed `wp.runtime` reference from the top-level module, as it should be considered private

## [0.3.2] - 2022-07-19

- Remove Torch import from `__init__.py`, defer import to `wp.from_torch()`, `wp.to_torch()`

## [0.3.1] - 2022-07-12

- Fix for marching cubes reallocation after initialization
- Add support for closest point between line segment tests, see `wp.closest_point_edge_edge()` builtin
- Add support for per-triangle elasticity coefficients in simulation, see `wp.sim.ModelBuilder.add_cloth_mesh()`
- Add support for specifying default device, see `wp.set_device()`, `wp.get_device()`, `wp.ScopedDevice`
- Add support for multiple GPUs (e.g., `"cuda:0"`, `"cuda:1"`), see `wp.get_cuda_devices()`, `wp.get_cuda_device_count()`, `wp.get_cuda_device()`
- Add support for explicitly targeting the current CUDA context using device alias `"cuda"`
- Add support for using arbitrary external CUDA contexts, see `wp.map_cuda_device()`, `wp.unmap_cuda_device()`
- Add PyTorch device aliasing functions, see `wp.device_from_torch()`, `wp.device_to_torch()`

### Breaking Changes

- A CUDA device is used by default, if available (aligned with `wp.get_preferred_device()`)
- `wp.ScopedCudaGuard` is deprecated, use `wp.ScopedDevice` instead
- `wp.synchronize()` now synchronizes all devices; for finer-grained control, use `wp.synchronize_device()`
- Device alias `"cuda"` now refers to the current CUDA context, rather than a specific device like `"cuda:0"` or `"cuda:1"`

## [0.3.0] - 2022-07-08

- Add support for FP16 storage type, see `wp.float16`
- Add support for per-dimension byte strides, see `wp.array.strides`
- Add support for passing Python classes as kernel arguments, see `@wp.struct` decorator
- Add additional bounds checks for builtin matrix types
- Add additional floating point checks, see `wp.config.verify_fp`
- Add interleaved user source with generated code to aid debugging
- Add generalized GPU marching cubes implementation, see `wp.MarchingCubes` class 
- Add additional scalar*matrix vector operators
- Add support for retrieving a single row from builtin types, e.g.: `r = m33[i]`
- Add  `wp.log2()` and `wp.log10()` builtins
- Add support for quickly instancing `wp.sim.ModelBuilder` objects to improve env. creation performance for RL
- Remove custom CUB version and improve compatibility with CUDA 11.7
- Fix to preserve external user-gradients when calling `wp.Tape.zero()`
- Fix to only allocate gradient of a Torch tensor if `requires_grad=True`
- Fix for missing `wp.mat22` constructor adjoint
- Fix for ray-cast precision in edge case on GPU (watertightness issue)
- Fix for kernel hot-reload when definition changes
- Fix for NVCC warnings on Linux
- Fix for generated function names when kernels are defined as class functions 
- Fix for reload of generated CPU kernel code on Linux
- Fix for example scripts to output USD at 60 timecodes per-second (better Kit compatibility)

## [0.2.3] - 2022-06-13

- Fix for incorrect 4d array bounds checking
- Fix for `wp.constant` changes not updating module hash
- Fix for stale CUDA kernel cache when CPU kernels launched first
- Array gradients are now allocated along with the arrays and accessible as `wp.array.grad`, users should take care to always call `wp.Tape.zero()` to clear gradients between different invocations of `wp.Tape.backward()`
- Added `wp.array.fill_()` to set all entries to a scalar value (4-byte values only currently)

### Breaking Changes

- Tape `capture` option has been removed, users can now capture tapes inside existing CUDA graphs (e.g.: inside Torch)
- Scalar loss arrays should now explicitly set `requires_grad=True` at creation time

## [0.2.2] - 2022-05-30

- Fix for `from import *` inside Warp initialization
- Fix for body space velocity when using deforming Mesh objects with scale
- Fix for noise gradient discontinuities affecting `wp.curlnoise()`
- Fix for `wp.from_torch()` to correctly preserve shape
- Fix for URDF parser incorrectly passing density to scale parameter
- Optimizations for startup time from 3s -> 0.3s
- Add support for custom kernel cache location, Warp will now store generated binaries in the user's application directory
- Add support for cross-module function references, e.g.: call another modules @wp.func functions
- Add support for overloading `@wp.func` functions based on argument type
- Add support for calling built-in functions directly from Python interpreter outside kernels (experimental)
- Add support for auto-complete and docstring lookup for builtins in IDEs like VSCode, PyCharm, etc
- Add support for doing partial array copies, see `wp.copy()` for details
- Add support for accessing mesh data directly in kernels, see `wp.mesh_get_point()`, `wp.mesh_get_index()`, `wp.mesh_eval_face_normal()`
- Change to only compile for targets where kernel is launched (e.g.: will not compile CPU unless explicitly requested)

### Breaking Changes

- Builtin methods such as `wp.quat_identity()` now call the Warp native implementation directly and will return a `wp.quat` object instead of NumPy array
- NumPy implementations of many builtin methods have been moved to `warp.utils` and will be deprecated
- Local `@wp.func` functions should not be namespaced when called, e.g.: previously `wp.myfunc()` would work even if `myfunc()` was not a builtin
- Removed `wp.rpy2quat()`, please use `wp.quat_rpy()` instead

## [0.2.1] - 2022-05-11

- Fix for unit tests in Kit

## [0.2.0] - 2022-05-02

### Warp Core

- Fix for unrolling loops with negative bounds
- Fix for unresolved symbol `hash_grid_build_device()` not found when lib is compiled without CUDA support
- Fix for failure to load nvrtc-builtins64_113.dll when user has a newer CUDA toolkit installed on their machine
- Fix for conversion of Torch tensors to wp.arrays() with a vector dtype (incorrect row count)
- Fix for `warp.dll` not found on some Windows installations
- Fix for macOS builds on Clang 13.x
- Fix for step-through debugging of kernels on Linux
- Add argument type checking for user defined `@wp.func` functions
- Add support for custom iterable types, supports ranges, hash grid, and mesh query objects
- Add support for multi-dimensional arrays, for example use `x = array[i,j,k]` syntax to address a 3-dimensional array 
- Add support for multi-dimensional kernel launches, use `launch(kernel, dim=(i,j,k), ...` and `i,j,k = wp.tid()` to obtain thread indices
- Add support for bounds-checking array memory accesses in debug mode, use `wp.config.mode = "debug"` to enable
- Add support for differentiating through dynamic and nested for-loops
- Add support for evaluating MLP neural network layers inside kernels with custom activation functions, see `wp.mlp()`
- Add additional NVDB sampling methods and adjoints, see `wp.volume_sample_i()`, `wp.volume_sample_f()`, and `wp.volume_sample_vec()`
- Add support for loading zlib compressed NVDB volumes, see `wp.Volume.load_from_nvdb()`
- Add support for triangle intersection testing, see `wp.intersect_tri_tri()`
- Add support for NVTX profile zones in `wp.ScopedTimer()`
- Add support for additional transform and quaternion math operations, see `wp.inverse()`, `wp.quat_to_matrix()`, `wp.quat_from_matrix()`
- Add fast math (`--fast-math`) to kernel compilation by default
- Add `warp.torch` import by default (if PyTorch is installed)
  
### Warp Kit

- Add Kit menu for browsing Warp documentation and example scenes under 'Window->Warp'
- Fix for OgnParticleSolver.py example when collider is coming from Read Prim into Bundle node

### Warp Sim

- Fix for joint attachment forces
- Fix for URDF importer and floating base support
- Add examples showing how to use differentiable forward kinematics to solve inverse kinematics
- Add examples for URDF cartpole and quadruped simulation

### Breaking Changes

- `wp.volume_sample_world()` is now replaced by `wp.volume_sample_f/i/vec()` which operate in index (local) space. Users should use `wp.volume_world_to_index()` to transform points from world space to index space before sampling.
- `wp.mlp()` expects multi-dimensional arrays instead of one-dimensional arrays for inference, all other semantics remain the same as earlier versions of this API.
- `wp.array.length` member has been removed, please use `wp.array.shape` to access array dimensions, or use `wp.array.size` to get total element count
- Marking `dense_gemm()`, `dense_chol()`, etc methods as experimental until we revisit them

## [0.1.25] - 2022-03-20

- Add support for class methods to be Warp kernels
- Add HashGrid reserve() so it can be used with CUDA graphs
- Add support for CUDA graph capture of tape forward/backward passes
- Add support for Python 3.8.x and 3.9.x
- Add hyperbolic trigonometric functions, see wp.tanh(), wp.sinh(), wp.cosh()
- Add support for floored division on integer types
- Move tests into core library so they can be run in Kit environment

## [0.1.24] - 2022-03-03

### Warp Core

- Add NanoVDB support, see wp.volume_sample*() methods
- Add support for reading compile-time constants in kernels, see wp.constant()
- Add support for __cuda_array_interface__ protocol for zero-copy interop with PyTorch, see wp.torch.to_torch()
- Add support for additional numeric types, i8, u8, i16, u16, etc
- Add better checks for device strings during allocation / launch
- Add support for sampling random numbers with a normal distribution, see wp.randn()
- Upgrade to CUDA 11.3
- Update example scenes to Kit 103.1
- Deduce array dtype from np.array when one is not provided
- Fix for ranged for loops with negative step sizes
- Fix for 3d and 4d spherical gradient distributions

## [0.1.23] - 2022-02-17

### Warp Core

- Fix for generated code folder being removed during Showroom installation
- Fix for macOS support
- Fix for dynamic for-loop code gen edge case
- Add procedural noise primitives, see noise(), pnoise(), curlnoise()
- Move simulation helpers our of test into warp.sim module

## [0.1.22] - 2022-02-14

### Warp Core

- Fix for .so reloading on Linux
- Fix for while loop code-gen in some edge cases
- Add rounding functions round(), rint(), trunc(), floor(), ceil() 
- Add support for printing strings and formatted strings from kernels
- Add MSVC compiler version detection and require minimum

### Warp Sim

- Add support for universal and compound joint types

## [0.1.21] - 2022-01-19

### Warp Core

- Fix for exception on shutdown in empty wp.array objects
- Fix for hot reload of CPU kernels in Kit
- Add hash grid primitive for point-based spatial queries, see hash_grid_query(), hash_grid_query_next()
- Add new PRNG methods using PCG-based generators, see rand_init(), randf(), randi()
- Add support for AABB mesh queries, see mesh_query_aabb(), mesh_query_aabb_next()
- Add support for all Python range() loop variants
- Add builtin vec2 type and additional math operators, pow(), tan(), atan(), atan2()
- Remove dependency on CUDA driver library at build time
- Remove unused NVRTC binary dependencies (50mb smaller Linux distribution)

### Warp Sim

- Bundle import of multiple shapes for simulation nodes
- New OgnParticleVolume node for sampling shapes -> particles
- New OgnParticleSolver node for DEM style granular materials

## [0.1.20] - 2021-11-02

- Updates to the ripple solver for GTC (support for multiple colliders, buoyancy, etc)

## [0.1.19] - 2021-10-15

- Publish from 2021.3 to avoid omni.graph database incompatibilities

## [0.1.18] - 2021-10-08

- Enable Linux support (tested on 20.04)

## [0.1.17] - 2021-09-30

- Fix for 3x3 SVD adjoint
- Fix for A6000 GPU (bump compute model to sm_52 minimum)
- Fix for .dll unload on rebuild
- Fix for possible array destruction warnings on shutdown
- Rename spatial_transform -> transform
- Documentation update

## [0.1.16] - 2021-09-06

- Fix for case where simple assignments (a = b) incorrectly generated reference rather than value copy
- Handle passing zero-length (empty) arrays to kernels

## [0.1.15] - 2021-09-03

- Add additional math library functions (asin, etc)
- Add builtin 3x3 SVD support
- Add support for named constants (True, False, None)
- Add support for if/else statements (differentiable)
- Add custom memset kernel to avoid CPU overhead of cudaMemset()
- Add rigid body joint model to warp.sim (based on Brax)
- Add Linux, MacOS support in core library
- Fix for incorrectly treating pure assignment as reference instead of value copy 
- Removes the need to transfer array to CPU before numpy conversion (will be done implicitly)
- Update the example OgnRipple wave equation solver to use bundles

## [0.1.14] - 2021-08-09

- Fix for out-of-bounds memory access in CUDA BVH
- Better error checking after kernel launches (use warp.config.verify_cuda=True)
- Fix for vec3 normalize adjoint code

## [0.1.13] - 2021-07-29

- Remove OgnShrinkWrap.py test node

## [0.1.12] - 2021-07-29

- Switch to Woop et al.'s watertight ray-tri intersection test
- Disable --fast-math in CUDA compilation step for improved precision

## [0.1.11] - 2021-07-28

- Fix for mesh_query_ray() returning incorrect t-value

## [0.1.10] - 2021-07-28

- Fix for OV extension fwatcher filters to avoid hot-reload loop due to OGN regeneration

## [0.1.9] - 2021-07-21

- Fix for loading sibling DLL paths
- Better type checking for built-in function arguments
- Added runtime docs, can now list all builtins using wp.print_builtins()

## [0.1.8] - 2021-07-14

- Fix for hot-reload of CUDA kernels
- Add Tape object for replaying differentiable kernels
- Add helpers for Torch interop (convert torch.Tensor to wp.Array)

## [0.1.7] - 2021-07-05

- Switch to NVRTC for CUDA runtime
- Allow running without host compiler
- Disable asserts in kernel release mode (small perf. improvement)

## [0.1.6] - 2021-06-14

- Look for CUDA toolchain in target-deps

## [0.1.5] - 2021-06-14

- Rename OgLang -> Warp
- Improve CUDA environment error checking
- Clean-up some logging, add verbose mode (warp.config.verbose)

## [0.1.4] - 2021-06-10

- Add support for mesh raycast

## [0.1.3] - 2021-06-09

- Add support for unary negation operator
- Add support for mutating variables during dynamic loops (non-differentiable)
- Add support for in-place operators
- Improve kernel cache start up times (avoids adjointing before cache check)
- Update README.md with requirements / examples

## [0.1.2] - 2021-06-03

- Add support for querying mesh velocities
- Add CUDA graph support, see warp.capture_begin(), warp.capture_end(), warp.capture_launch()
- Add explicit initialization phase, warp.init()
- Add variational Euler solver (sim)
- Add contact caching, switch to nonlinear friction model (sim)

- Fix for Linux/macOS support

## [0.1.1] - 2021-05-18

- Fix bug with conflicting CUDA contexts

## [0.1.0] - 2021-05-17

- Initial publish for alpha testing