Spaces:
Runtime error
Runtime error
| # CUB 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1) | |
| ## Summary | |
| CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release | |
| and the CUDA Toolkit 11.1 release. | |
| ## Bug Fixes | |
| - #1217: Move static local in `cub::DeviceCount` to a separate host-only | |
| function because NVC++ doesn't support static locals in host-device | |
| functions. | |
| # CUB 1.9.10 (NVIDIA HPC SDK 20.5) | |
| ## Summary | |
| Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. | |
| It adds CMake `find_package` support. | |
| C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. | |
| Starting with the upcoming 1.10.0 release, C++03 support will be dropped | |
| entirely. | |
| ## Breaking Changes | |
| - Thrust now checks that it is compatible with the version of CUB found | |
| in your include path, generating an error if it is not. | |
| If you are using your own version of CUB, it may be too old. | |
| It is recommended to simply delete your own version of CUB and use the | |
| version of CUB that comes with Thrust. | |
| - C++03 and C++11 are deprecated. | |
| Using these dialects will generate a compile-time warning. | |
| These warnings can be suppressed by defining | |
| `CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 | |
| deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP_11` (to suppress C++11 | |
| deprecation warnings). | |
| Suppression is only a short term solution. | |
| We will be dropping support for C++03 in the 1.10.0 release and C++11 in the | |
| near future. | |
| - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. | |
| Using these compilers will generate a compile-time warning. | |
| These warnings can be suppressed by defining | |
| `CUB_IGNORE_DEPRECATED_COMPILER`. | |
| Suppression is only a short term solution. | |
| We will be dropping support for these compilers in the near future. | |
| ## New Features | |
| - CMake `find_package` support. | |
| Just point CMake at the `cmake` folder in your CUB include directory | |
| (ex: `cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .`) and then you | |
| can add CUB to your CMake project with `find_package(CUB REQUIRED CONFIG)`. | |
| # CUB 1.9.9 (CUDA 11.0) | |
| ## Summary | |
| CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release. | |
| It introduces CMake support, version macros, platform detection machinery, | |
| and support for NVC++, which uses Thrust (and thus CUB) to implement | |
| GPU-accelerated C++17 Parallel Algorithms. | |
| Additionally, the scan dispatch layer was refactored and modernized. | |
| C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. | |
| Starting with the upcoming 1.10.0 release, C++03 support will be dropped | |
| entirely. | |
| ## Breaking Changes | |
| - Thrust now checks that it is compatible with the version of CUB found | |
| in your include path, generating an error if it is not. | |
| If you are using your own version of CUB, it may be too old. | |
| It is recommended to simply delete your own version of CUB and use the | |
| version of CUB that comes with Thrust. | |
| - C++03 and C++11 are deprecated. | |
| Using these dialects will generate a compile-time warning. | |
| These warnings can be suppressed by defining | |
| `CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 | |
| deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP11` (to suppress C++11 | |
| deprecation warnings). | |
| Suppression is only a short term solution. | |
| We will be dropping support for C++03 in the 1.10.0 release and C++11 in the | |
| near future. | |
| - GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. | |
| Using these compilers will generate a compile-time warning. | |
| These warnings can be suppressed by defining | |
| `CUB_IGNORE_DEPRECATED_COMPILER`. | |
| Suppression is only a short term solution. | |
| We will be dropping support for these compilers in the near future. | |
| ## New Features | |
| - CMake support. | |
| Thanks to Francis Lemaire for this contribution. | |
| - Refactorized and modernized scan dispatch layer. | |
| Thanks to Francis Lemaire for this contribution. | |
| - Policy hooks for device-wide reduce, scan, and radix sort facilities | |
| to simplify tuning and allow users to provide custom policies. | |
| Thanks to Francis Lemaire for this contribution. | |
| - `<cub/version.cuh>`: `CUB_VERSION`, `CUB_VERSION_MAJOR`, `CUB_VERSION_MINOR`, | |
| `CUB_VERSION_SUBMINOR`, and `CUB_PATCH_NUMBER`. | |
| - Platform detection machinery: | |
| - `<cub/util_cpp_dialect.cuh>`: Detects the C++ standard dialect. | |
| - `<cub/util_compiler.cuh>`: host and device compiler detection. | |
| - `<cub/util_deprecated.cuh>`: `CUB_DEPRECATED`. | |
| - <cub/config.cuh>`: Includes `<cub/util_arch.cuh>`, | |
| `<cub/util_compiler.cuh>`, `<cub/util_cpp_dialect.cuh>`, | |
| `<cub/util_deprecated.cuh>`, `<cub/util_macro.cuh>`, | |
| `<cub/util_namespace.cuh>` | |
| - `cub::DeviceCount` and `cub::DeviceCountUncached`, caching abstractions for | |
| `cudaGetDeviceCount`. | |
| ## Other Enhancements | |
| - Lazily initialize the per-device CUDAattribute caches, because CUDA context | |
| creation is expensive and adds up with large CUDA binaries on machines with | |
| many GPUs. | |
| Thanks to the NVIDIA PyTorch team for bringing this to our attention. | |
| - Make `cub::SwitchDevice` avoid setting/resetting the device if the current | |
| device is the same as the target device. | |
| ## Bug Fixes | |
| - Add explicit failure parameter to CAS in the CUB attribute cache to workaround | |
| a GCC 4.8 bug. | |
| - Revert a change in reductions that changed the signedness of the `lane_id` | |
| variable to suppress a warning, as this introduces a bug in optimized device | |
| code. | |
| - Fix initialization in `cub::ExclusiveSum`. | |
| Thanks to Conor Hoekstra for this contribution. | |
| - Fix initialization of the `std::array` in the CUB attribute cache. | |
| - Fix `-Wsign-compare` warnings. | |
| Thanks to Elias Stehle for this contribution. | |
| - Fix `test_block_reduce.cu` to build without parameters. | |
| Thanks to Francis Lemaire for this contribution. | |
| - Add missing includes to `grid_even_share.cuh`. | |
| Thanks to Francis Lemaire for this contribution. | |
| - Add missing includes to `thread_search.cuh`. | |
| Thanks to Francis Lemaire for this contribution. | |
| - Add missing includes to `cub.cuh`. | |
| Thanks to Felix Kallenborn for this contribution. | |
| # CUB 1.9.8-1 (NVIDIA HPC SDK 20.3) | |
| ## Summary | |
| CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release. | |
| It contains modifications necessary to serve as the implementation of NVC++'s | |
| GPU-accelerated C++17 Parallel Algorithms. | |
| # CUB 1.9.8 (CUDA 11.0 Early Access) | |
| ## Summary | |
| CUB 1.9.8 is the first release of CUB to be officially supported and included | |
| in the CUDA Toolkit. | |
| When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query | |
| APIs, which improves performance of these queries by 20x to 50x when they | |
| are called concurrently by multiple host threads. | |
| ## Enhancements | |
| - (C++11 or later) Cache calls to `cudaFuncGetAttributes` and | |
| `cudaDeviceGetAttribute` within `cub::PtxVersion` and `cub::SmVersion`. | |
| These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform | |
| poorly under contention; with the caching, they are 20 to 50x faster when | |
| called concurrently. | |
| Thanks to Bilge Acun for bringing this issue to our attention. | |
| - `DispatchReduce` now takes an `OutputT` template parameter so that users can | |
| specify the intermediate type explicitly. | |
| - Radix sort tuning policies updates to fix performance issues for element | |
| types smaller than 4 bytes. | |
| ## Bug Fixes | |
| - Change initialization style from copy initialization to direct initialization | |
| (which is more permissive) in `AgentReduce` to allow a wider range of types | |
| to be used with it. | |
| - Fix bad signed/unsigned comparisons in `WarpReduce`. | |
| - Fix computation of valid lanes in warp-level reduction primitive to correctly | |
| handle the case where there are 0 input items per warp. | |
| # CUB 1.8.0 | |
| ## Summary | |
| CUB 1.8.0 introduces changes to the `cub::Shuffle*` interfaces. | |
| ## Breaking Changes | |
| - The interfaces of `cub::ShuffleIndex`, `cub::ShuffleUp`, and | |
| `cub::ShuffleDown` have been changed to allow for better computation of the | |
| PTX SHFL control constant for logical warps smaller than 32 threads. | |
| ## Bug Fixes | |
| - #112: Fix `cub::WarpScan`'s broadcast of warp-wide aggregate for logical | |
| warps smaller than 32 threads. | |
| # CUB 1.7.5 | |
| ## Summary | |
| CUB 1.7.5 adds support for radix sorting `__half` keys and improved sorting | |
| performance for 1 byte keys. | |
| It was incorporated into Thrust 1.9.2. | |
| ## Enhancements | |
| - Radix sort support for `__half` keys. | |
| - Radix sort tuning policy updates to improve 1 byte key performance. | |
| ## Bug Fixes | |
| - Syntax tweaks to mollify Clang. | |
| - #127: `cub::DeviceRunLengthEncode::Encode` returns incorrect results. | |
| - #128: 7-bit sorting passes fail for SM61 with large values. | |
| # CUB 1.7.4 | |
| ## Summary | |
| CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2. | |
| ## Bug Fixes | |
| - #114: Can't pair non-trivially-constructible values in radix sort. | |
| - #115: `cub::WarpReduce` segmented reduction is broken in CUDA 9 for logical | |
| warp sizes smaller than 32. | |
| # CUB 1.7.3 | |
| ## Summary | |
| CUB 1.7.3 is a minor release. | |
| ## Bug Fixes | |
| - #110: `cub::DeviceHistogram` null-pointer exception bug for iterator inputs. | |
| # CUB 1.7.2 | |
| ## Summary | |
| CUB 1.7.2 is a minor release. | |
| ## Bug Fixes | |
| - #104: Device-wide reduction is now "run-to-run" deterministic for | |
| pseudo-associative reduction operators (like floating point addition). | |
| # CUB 1.7.1 | |
| ## Summary | |
| CUB 1.7.1 delivers improved radix sort performance on SM7x (Volta) GPUs and a | |
| number of bug fixes. | |
| ## Enhancements | |
| - Radix sort tuning policies updated for SM7x (Volta). | |
| ## Bug Fixes | |
| - #104: `uint64_t` `cub::WarpReduce` broken for CUB 1.7.0 on CUDA 8 and older. | |
| - #103: Can't mix Thrust from CUDA 9.0 and CUB. | |
| - #102: CUB pulls in `windows.h` which defines `min`/`max` macros that conflict | |
| with `std::min`/`std::max`. | |
| - #99: Radix sorting crashes NVCC on Windows 10 for SM52. | |
| - #98: cuda-memcheck: --tool initcheck failed with lineOfSight. | |
| - #94: Git clone size. | |
| - #93: Accept iterators for segment offsets. | |
| - #87: CUB uses anonymous unions which is not valid C++. | |
| - #44: Check for C++11 is incorrect for Visual Studio 2013. | |
| # CUB 1.7.0 | |
| ## Summary | |
| CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. | |
| It is compatible with independent thread scheduling. | |
| It was incorporated into Thrust 1.9.0-5. | |
| ## Breaking Changes | |
| - Remove `cub::WarpAll` and `cub::WarpAny`. | |
| These functions served to emulate `__all` and `__any` functionality for | |
| SM1x devices, which did not have those operations. | |
| However, SM1x devices are now deprecated in CUDA, and the interfaces of these | |
| two functions are now lacking the lane-mask needed for collectives to run on | |
| SM7x and newer GPUs which have independent thread scheduling. | |
| ## Other Enhancements | |
| - Remove any assumptions of implicit warp synchronization to be compatible with | |
| SM7x's (Volta) independent thread scheduling. | |
| ## Bug Fixes | |
| - #86: Incorrect results with reduce-by-key. | |
| # CUB 1.6.4 | |
| ## Summary | |
| CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x | |
| (Pascal) GPUs. | |
| ## Enhancements | |
| - Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - | |
| 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively. | |
| ## Bug Fixes | |
| - Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5. | |
| - #65: `cub::DeviceSegmentedRadixSort` should allow inputs to have | |
| pointer-to-const type. | |
| - Mollify Clang device-side warnings. | |
| - Remove out-dated MSVC project files. | |
| # CUB 1.6.3 | |
| ## Summary | |
| CUB 1.6.3 improves support for Windows, changes | |
| `cub::BlockLoad`/`cub::BlockStore` interface to take the local data type, | |
| and enhances radix sort performance for SM6x (Pascal) GPUs. | |
| ## Breaking Changes | |
| - `cub::BlockLoad` and `cub::BlockStore` are now templated by the local data | |
| type, instead of the `Iterator` type. | |
| This allows for output iterators having `void` as their `value_type` (e.g. | |
| discard iterators). | |
| ## Other Enhancements | |
| - Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte | |
| keys/s on GP100. | |
| - Improved support for Windows (warnings, alignment, etc). | |
| ## Bug Fixes | |
| - #74: `cub::WarpReduce` executes reduction operator for out-of-bounds items. | |
| - #72: `cub:InequalityWrapper::operator` should be non-const. | |
| - #71: `cub::KeyValuePair` won't work if `Key` has non-trivial constructor. | |
| - #69: cub::BlockStore::Store` doesn't compile if `OutputIteratorT::value_type` | |
| isn't `T`. | |
| - #68: `cub::TilePrefixCallbackOp::WarpReduce` doesn't permit PTX arch | |
| specialization. | |
| # CUB 1.6.2 (previously 1.5.5) | |
| ## Summary | |
| CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) | |
| GPUs. | |
| ## Enhancements | |
| - Radix sort tuning policies updated for SM6x (Pascal) GPUs. | |
| ## Bug Fixes | |
| - Fix AArch64 compilation of `cub::CachingDeviceAllocator`. | |
| # CUB 1.6.1 (previously 1.5.4) | |
| ## Summary | |
| CUB 1.6.1 (previously 1.5.4) is a minor release. | |
| ## Bug Fixes | |
| - Fix radix sorting bug introduced by scan refactorization. | |
| # CUB 1.6.0 (previously 1.5.3) | |
| ## Summary | |
| CUB 1.6.0 changes the scan and reduce interfaces. | |
| Exclusive scans now accept an "initial value" instead of an "identity value". | |
| Scans and reductions now support differing input and output sequence types. | |
| Additionally, many bugs have been fixed. | |
| ## Breaking Changes | |
| - Device/block/warp-wide exclusive scans have been revised to now accept an | |
| "initial value" (instead of an "identity value") for seeding the computation | |
| with an arbitrary prefix. | |
| - Device-wide reductions and scans can now have input sequence types that are | |
| different from output sequence types (as long as they are convertible). | |
| ## Other Enhancements | |
| - Reduce repository size by moving the doxygen binary to doc repository. | |
| - Minor reduction in `cub::BlockScan` instruction counts. | |
| ## Bug Fixes | |
| - Issue #55: Warning in `cub/device/dispatch/dispatch_reduce_by_key.cuh`. | |
| - Issue #59: `cub::DeviceScan::ExclusiveSum` can't prefix sum of float into | |
| double. | |
| - Issue #58: Infinite loop in `cub::CachingDeviceAllocator::NearestPowerOf`. | |
| - Issue #47: `cub::CachingDeviceAllocator` needs to clean up CUDA global error | |
| state upon successful retry. | |
| - Issue #46: Very high amount of needed memory from the | |
| `cub::DeviceHistogram::HistogramEven`. | |
| - Issue #45: `cub::CachingDeviceAllocator` fails with debug output enabled | |
| # CUB 1.5.2 | |
| ## Summary | |
| CUB 1.5.2 enhances `cub::CachingDeviceAllocator` and improves scan performance | |
| for SM5x (Maxwell). | |
| ## Enhancements | |
| - Improved medium-size scan performance on SM5x (Maxwell). | |
| - Refactored `cub::CachingDeviceAllocator`: | |
| - Now spends less time locked. | |
| - Uses C++11's `std::mutex` when available. | |
| - Failure to allocate a block from the runtime will retry once after | |
| freeing cached allocations. | |
| - Now respects max-bin, fixing an issue where blocks in excess of max-bin | |
| were still being retained in the free cache. | |
| ## Bug fixes: | |
| - Fix for generic-type reduce-by-key `cub::WarpScan` for SM3x and newer GPUs. | |
| # CUB 1.5.1 | |
| ## Summary | |
| CUB 1.5.1 is a minor release. | |
| ## Bug Fixes | |
| - Fix for incorrect `cub::DeviceRadixSort` output for some small problems on | |
| SM52 (Mawell) GPUs. | |
| - Fix for macro redefinition warnings when compiling `thrust::sort`. | |
| # CUB 1.5.0 | |
| CUB 1.5.0 introduces segmented sort and reduction primitives. | |
| ## New Features: | |
| - Segmented device-wide operations for device-wide sort and reduction primitives. | |
| ## Bug Fixes: | |
| - #36: `cub::ThreadLoad` generates compiler errors when loading from | |
| pointer-to-const. | |
| - #29: `cub::DeviceRadixSort::SortKeys<bool>` yields compiler errors. | |
| - #26: Misaligned address after `cub::DeviceRadixSort::SortKeys`. | |
| - #25: Fix for incorrect results and crashes when radix sorting 0-length | |
| problems. | |
| - Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and | |
| warp-reduction on non-primitive data types (e.g. user-defined structs). | |
| - Fix small radix sorting problems where 0 temporary bytes were required and | |
| users code was invoking `malloc(0)` on some systems where that returns | |
| `NULL`. | |
| CUB assumed the user was asking for the size again and not running the sort. | |
| # CUB 1.4.1 | |
| ## Summary | |
| CUB 1.4.1 is a minor release. | |
| ## Enhancements | |
| - Allow `cub::DeviceRadixSort` and `cub::BlockRadixSort` on bool types. | |
| ## Bug Fixes | |
| - Fix minor CUDA 7.0 performance regressions in `cub::DeviceScan` and | |
| `cub::DeviceReduceByKey`. | |
| - Remove requirement for callers to define the `CUB_CDP` macro | |
| when invoking CUB device-wide rountines using CUDA dynamic parallelism. | |
| - Fix headers not being included in the proper order (or missing includes) | |
| for some block-wide functions. | |
| # CUB 1.4.0 | |
| ## Summary | |
| CUB 1.4.0 adds `cub::DeviceSpmv`, `cub::DeviceRunLength::NonTrivialRuns`, | |
| improves `cub::DeviceHistogram`, and introduces support for SM5x (Maxwell) | |
| GPUs. | |
| ## New Features: | |
| - `cub::DeviceSpmv` methods for multiplying sparse matrices by | |
| dense vectors, load-balanced using a merge-based parallel decomposition. | |
| - `cub::DeviceRadixSort` sorting entry-points that always return | |
| the sorted output into the specified buffer, as opposed to the | |
| `cub::DoubleBuffer` in which it could end up in either buffer. | |
| - `cub::DeviceRunLengthEncode::NonTrivialRuns` for finding the starting | |
| offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in | |
| a given sequence. | |
| Useful for top-down partitioning algorithms like MSD sorting of very-large | |
| keys. | |
| ## Other Enhancements | |
| - Support and performance tuning for SM5x (Maxwell) GPUs. | |
| - Updated cub::DeviceHistogram implementation that provides the same | |
| "histogram-even" and "histogram-range" functionality as IPP/NPP. | |
| Provides extremely fast and, perhaps more importantly, very uniform | |
| performance response across diverse real-world datasets, including | |
| pathological (homogeneous) sample distributions. | |
| # CUB 1.3.2 | |
| ## Summary | |
| CUB 1.3.2 is a minor release. | |
| ## Bug Fixes | |
| - Fix `cub::DeviceReduce` where reductions of small problems (small enough to | |
| only dispatch a single thread block) would run in the default stream (stream | |
| zero) regardless of whether an alternate stream was specified. | |
| # CUB 1.3.1 | |
| ## Summary | |
| CUB 1.3.1 is a minor release. | |
| ## Bug Fixes | |
| - Workaround for a benign WAW race warning reported by cuda-memcheck | |
| in `cub::BlockScan` specialized for `BLOCK_SCAN_WARP_SCANS` algorithm. | |
| - Fix bug in `cub::DeviceRadixSort` where the algorithm may sort more | |
| key bits than the caller specified (up to the nearest radix digit). | |
| - Fix for ~3% `cub::DeviceRadixSort` performance regression on SM2x (Fermi) and | |
| SM3x (Kepler) GPUs. | |
| # CUB 1.3.0 | |
| ## Summary | |
| CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide | |
| primitives and adds an enhanced version of `cub::WarpScan`. | |
| ## Breaking Changes | |
| - CUB's collective (block-wide, warp-wide) primitives underwent a minor | |
| interface refactoring: | |
| - To provide the appropriate support for multidimensional thread blocks, | |
| The interfaces for collective classes are now template-parameterized by | |
| X, Y, and Z block dimensions (with `BLOCK_DIM_Y` and `BLOCK_DIM_Z` being | |
| optional, and `BLOCK_DIM_X` replacing `BLOCK_THREADS`). | |
| Furthermore, the constructors that accept remapped linear | |
| thread-identifiers have been removed: all primitives now assume a | |
| row-major thread-ranking for multidimensional thread blocks. | |
| - To allow the host program (compiled by the host-pass) to accurately | |
| determine the device-specific storage requirements for a given collective | |
| (compiled for each device-pass), the interfaces for collective classes | |
| are now (optionally) template-parameterized by the desired PTX compute | |
| capability. | |
| This is useful when aliasing collective storage to shared memory that has | |
| been allocated dynamically by the host at the kernel call site. | |
| - Most CUB programs having typical 1D usage should not require any | |
| changes to accomodate these updates. | |
| ## New Features | |
| - Added "combination" `cub::WarpScan` methods for efficiently computing | |
| both inclusive and exclusive prefix scans (and sums). | |
| ## Bug Fixes | |
| - Fix for bug in `cub::WarpScan` (which affected `cub::BlockScan` and | |
| `cub::DeviceScan`) where incorrect results (e.g., NAN) would often be | |
| returned when parameterized for floating-point types (fp32, fp64). | |
| - Workaround for ptxas error when compiling with with -G flag on Linux (for | |
| debug instrumentation). | |
| - Fixes for certain scan scenarios using custom scan operators where code | |
| compiled for SM1x is run on newer GPUs of higher compute-capability: the | |
| compiler could not tell which memory space was being used collective | |
| operations and was mistakenly using global ops instead of shared ops. | |
| # CUB 1.2.3 | |
| ## Summary | |
| CUB 1.2.3 is a minor release. | |
| ## Bug Fixes | |
| - Fixed access violation bug in `cub::DeviceReduce::ReduceByKey` for | |
| non-primitive value types. | |
| - Fixed code-snippet bug in `ArgIndexInputIteratorT` documentation. | |
| # CUB 1.2.2 | |
| ## Summary | |
| CUB 1.2.2 adds a new variant of `cub::BlockReduce` and MSVC project solections | |
| for examples. | |
| ## New Features | |
| - MSVC project solutions for device-wide and block-wide examples | |
| - New algorithmic variant of cub::BlockReduce for improved performance | |
| when using commutative operators (e.g., numeric addition). | |
| ## Bug Fixes | |
| - Inclusion of Thrust headers in a certain order prevented CUB device-wide | |
| primitives from working properly. | |
| # CUB 1.2.0 | |
| ## Summary | |
| CUB 1.2.0 adds `cub::DeviceReduce::ReduceByKey` and | |
| `cub::DeviceReduce::RunLengthEncode` and support for CUDA 6.0. | |
| ## New Features | |
| - `cub::DeviceReduce::ReduceByKey`. | |
| - `cub::DeviceReduce::RunLengthEncode`. | |
| ## Other Enhancements | |
| - Improved `cub::DeviceScan`, `cub::DeviceSelect`, `cub::DevicePartition` | |
| performance. | |
| - Documentation and testing: | |
| - Added performance-portability plots for many device-wide primitives. | |
| - Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and | |
| Thrust 1.6 (and older). | |
| - Revised the operation of temporary tile status bookkeeping for | |
| `cub::DeviceScan` (and similar) to be safe for current code run on future | |
| platforms (now uses proper fences). | |
| ## Bug Fixes | |
| - Fix `cub::DeviceScan` bug where Windows alignment disagreements between host | |
| and device regarding user-defined data types would corrupt tile status. | |
| - Fix `cub::BlockScan` bug where certain exclusive scans on custom data types | |
| for the `BLOCK_SCAN_WARP_SCANS` variant would return incorrect results for | |
| the first thread in the block. | |
| - Added workaround to make `cub::TexRefInputIteratorT` work with CUDA 6.0. | |
| # CUB 1.1.1 | |
| ## Summary | |
| CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, | |
| `cub::DeviceSelect`, `cub::DevicePartition`, `cub::Shuffle*`, and | |
| `cub::MaxSMOccupancy`. | |
| Additionally, scan and sort performance for older GPUs has been improved and | |
| many bugs have been fixed. | |
| ## Breaking Changes | |
| - Refactored block-wide I/O (`cub::BlockLoad` and `cub::BlockStore`), removing | |
| cache-modifiers from their interfaces. | |
| `cub::CacheModifiedInputIterator` and `cub::CacheModifiedOutputIterator` | |
| should now be used with `cub::BlockLoad` and `cub::BlockStore` to effect that | |
| behavior. | |
| ## New Features | |
| - `cub::TexObjInputIterator`, `cub::TexRefInputIterator`, | |
| `cub::CacheModifiedInputIterator`, and `cub::CacheModifiedOutputIterator` | |
| types for loading & storing arbitrary types through the cache hierarchy. | |
| They are compatible with Thrust. | |
| - Descending sorting for `cub::DeviceRadixSort` and `cub::BlockRadixSort`. | |
| - Min, max, arg-min, and arg-max operators for `cub::DeviceReduce`. | |
| - `cub::DeviceSelect` (select-unique, select-if, and select-flagged). | |
| - `cub::DevicePartition` (partition-if, partition-flagged). | |
| - Generic `cub::ShuffleUp`, `cub::ShuffleDown`, and `cub::ShuffleIndex` for | |
| warp-wide communication of arbitrary data types (SM3x and up). | |
| - `cub::MaxSmOccupancy` for accurately determining SM occupancy for any given | |
| kernel function pointer. | |
| ## Other Enhancements | |
| - Improved `cub::DeviceScan` and `cub::DeviceRadixSort` performance for older | |
| GPUs (SM1x to SM3x). | |
| - Renamed device-wide `stream_synchronous` param to `debug_synchronous` to | |
| avoid confusion about usage. | |
| - Documentation improvements: | |
| - Added simple examples of device-wide methods. | |
| - Improved doxygen documentation and example snippets. | |
| - Improved test coverege to include up to 21,000 kernel variants and 851,000 | |
| unit tests (per architecture, per platform). | |
| ## Bug Fixes | |
| - Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when | |
| operating on non-primitive types for older architectures SM1x. | |
| - SHFL-based scans and reductions produced incorrect results for multi-word | |
| types (size > 4B) on Linux. | |
| - For `cub::WarpScan`-based scans, not all threads in the first warp were | |
| entering the prefix callback functor. | |
| - `cub::DeviceRadixSort` had a race condition with key-value pairs for pre-SM35 | |
| architectures. | |
| - `cub::DeviceRadixSor` bitfield-extract behavior with long keys on 64-bit | |
| Linux was incorrect. | |
| - `cub::BlockDiscontinuity` failed to compile for types other than | |
| `int32_t`/`uint32_t`. | |
| - CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide | |
| methods now report the same temporary storage allocation size requirement as | |
| their host-callable counterparts. | |
| # CUB 1.0.2 | |
| ## Summary | |
| CUB 1.0.2 is a minor release. | |
| ## Bug Fixes | |
| - Corrections to code snippet examples for `cub::BlockLoad`, `cub::BlockStore`, | |
| and `cub::BlockDiscontinuity`. | |
| - Cleaned up unnecessary/missing header includes. | |
| You can now safely include a specific .cuh (instead of `cub.cuh`). | |
| - Bug/compilation fixes for `cub::BlockHistogram`. | |
| # CUB 1.0.1 | |
| ## Summary | |
| CUB 1.0.1 adds `cub::DeviceRadixSort` and `cub::DeviceScan`. | |
| Numerous other performance and correctness fixes and included. | |
| ## Breaking Changes | |
| - New collective interface idiom (specialize/construct/invoke). | |
| ## New Features | |
| - `cub::DeviceRadixSort`. | |
| Implements short-circuiting for homogenous digit passes. | |
| - `cub::DeviceScan`. | |
| Implements single-pass "adaptive-lookback" strategy. | |
| ## Other Enhancements | |
| - Significantly improved documentation (with example code snippets). | |
| - More extensive regression test suit for aggressively testing collective | |
| variants. | |
| - Allow non-trially-constructed types (previously unions had prevented aliasing | |
| temporary storage of those types). | |
| - Improved support for SM3x SHFL (collective ops now use SHFL for types larger | |
| than 32 bits). | |
| - Better code generation for 64-bit addressing within | |
| `cub::BlockLoad`/`cub::BlockStore`. | |
| - `cub::DeviceHistogram` now supports histograms of arbitrary bins. | |
| - Updates to accommodate CUDA 5.5 dynamic parallelism. | |
| ## Bug Fixes | |
| - Workarounds for SM10 codegen issues in uncommonly-used | |
| `cub::WarpScan`/`cub::WarpReduce` specializations. | |
| # CUB 0.9.4 | |
| ## Summary | |
| CUB 0.9.3 is a minor release. | |
| ## Enhancements | |
| - Various documentation updates and corrections. | |
| ## Bug Fixes | |
| - Fixed compilation errors for SM1x. | |
| - Fixed compilation errors for some WarpScan entrypoints on SM3x and up. | |
| # CUB 0.9.3 | |
| ## Summary | |
| CUB 0.9.3 adds histogram algorithms and work management utility descriptors. | |
| ## New Features | |
| - `cub::DevicHistogram256`. | |
| - `cub::BlockHistogram256`. | |
| - `cub::BlockScan` algorithm variant `BLOCK_SCAN_RAKING_MEMOIZE`, which | |
| trades more register consumption for less shared memory I/O. | |
| - `cub::GridQueue`, `cub::GridEvenShare`, work management utility descriptors. | |
| ## Other Enhancements | |
| - Updates to `cub::BlockRadixRank` to use `cub::BlockScan`, which improves | |
| performance on SM3x by using SHFL. | |
| - Allow types other than builtin types to be used in `cub::WarpScan::*Sum` | |
| methods if they only have `operator+` overloaded. | |
| Previously they also required to support assignment from `int(0)`. | |
| - Update `cub::BlockReduce`'s `BLOCK_REDUCE_WARP_REDUCTIONS` algorithm to work | |
| even when block size is not an even multiple of warp size. | |
| - Refactoring of `cub::DeviceAllocator` interface and | |
| `cub::CachingDeviceAllocator` implementation. | |
| # CUB 0.9.2 | |
| ## Summary | |
| CUB 0.9.2 adds `cub::WarpReduce`. | |
| ## New Features | |
| - `cub::WarpReduce`, which uses the SHFL instruction when applicable. | |
| `cub::BlockReduce` now uses this `cub::WarpReduce` instead of implementing | |
| its own. | |
| ## Enhancements | |
| - Documentation updates and corrections. | |
| ## Bug Fixes | |
| - Fixes for 64-bit Linux compilation warnings and errors. | |
| # CUB 0.9.1 | |
| ## Summary | |
| CUB 0.9.1 is a minor release. | |
| ## Bug Fixes | |
| - Fix for ambiguity in `cub::BlockScan::Reduce` between generic reduction and | |
| summation. | |
| Summation entrypoints are now called `::Sum()`, similar to the | |
| convention in `cub::BlockScan`. | |
| - Small edits to documentation and download tracking. | |
| # CUB 0.9.0 | |
| ## Summary | |
| Initial preview release. | |
| CUB is the first durable, high-performance library of cooperative block-level, | |
| warp-level, and thread-level primitives for CUDA kernel programming. | |