|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Dynamic Hierarchical Sparse Attention (DHSA) |
|
|
|
|
|
This repository hosts the boundary predictor weights used in *[Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs](https://arxiv.org/pdf/2510.24606)* (NeurIPS 2025 Workshop on Efficient Reasoning). |
|
|
|
|
|
## Overview |
|
|
|
|
|
The boundary predictor is a lightweight transformer module trained to dynamically segment long text sequences into variable-length chunks based on semantic boundaries. |
|
|
It forms the first stage of Dynamic Hierarchical Sparse Attention (DHSA), enabling large language models to efficiently process long contexts by predicting where to attend sparsely. |
|
|
|
|
|
## Model Architecture |
|
|
The boundary predictor consists of three main parts: |
|
|
1. Shared Encoder – Uses attention and pooling layers to capture the left and right context around each token. |
|
|
2. Feature Fusion – Combines the two contextual features along with their difference, product, and similarity to represent local semantic changes. |
|
|
3. MLP Classifier – Takes the fused features and predicts whether a given position marks a boundary. |
|
|
|
|
|
This lightweight design efficiently identifies semantic shifts in long text sequences. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The predictor was trained on diverse long-context datasets combining multiple reasoning and QA sources: |
|
|
|
|
|
* [Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections) |
|
|
* [Trivia QA](https://huggingface.co/datasets/mandarjoshi/trivia_qa) |
|
|
* [ChatQA2-Long-SFT](https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data) |
|
|
|
|
|
Data were automatically annotated using an internal semantic-boundary labeling process based on similarity shifts between consecutive key embeddings. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
``` |
|
|
@misc{xiong2025longcontextmodelingdynamichierarchical, |
|
|
title={Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs}, |
|
|
author={Siheng Xiong and Joe Zou and Faramarz Fekri and Yae Jee Cho}, |
|
|
year={2025}, |
|
|
eprint={2510.24606}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL} |
|
|
} |
|
|
``` |
|
|
|
|
|
|