Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Dynamic Hierarchical Sparse Attention (DHSA)
|
| 8 |
+
|
| 9 |
+
This repository hosts the boundary predictor weights used in *[Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs](https://arxiv.org/pdf/2510.24606)* (NeurIPS 2025 Workshop on Efficient Reasoning).
|
| 10 |
+
|
| 11 |
+
## Overview
|
| 12 |
+
|
| 13 |
+
The boundary predictor is a lightweight transformer module trained to dynamically segment long text sequences into variable-length chunks based on semantic boundaries.
|
| 14 |
+
It forms the first stage of Dynamic Hierarchical Sparse Attention (DHSA), enabling large language models to efficiently process long contexts by predicting where to attend sparsely.
|
| 15 |
+
|
| 16 |
+
## Model Architecture
|
| 17 |
+
The boundary predictor consists of three main parts:
|
| 18 |
+
1. Shared Encoder – Uses attention and pooling layers to capture the left and right context around each token.
|
| 19 |
+
2. Feature Fusion – Combines the two contextual features along with their difference, product, and similarity to represent local semantic changes.
|
| 20 |
+
3. MLP Classifier – Takes the fused features and predicts whether a given position marks a boundary.
|
| 21 |
+
|
| 22 |
+
This lightweight design efficiently identifies semantic shifts in long text sequences.
|
| 23 |
+
|
| 24 |
+
## Training Data
|
| 25 |
+
|
| 26 |
+
The predictor was trained on diverse long-context datasets combining multiple reasoning and QA sources:
|
| 27 |
+
|
| 28 |
+
* [Long Data Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)
|
| 29 |
+
* [Trivia QA](https://huggingface.co/datasets/mandarjoshi/trivia_qa)
|
| 30 |
+
* [ChatQA2-Long-SFT](https://huggingface.co/datasets/nvidia/ChatQA2-Long-SFT-data)
|
| 31 |
+
|
| 32 |
+
Data were automatically annotated using an internal semantic-boundary labeling process based on similarity shifts between consecutive key embeddings.
|
| 33 |
+
|
| 34 |
+
## Citation
|
| 35 |
+
|
| 36 |
+
If you use this model, please cite:
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
@misc{xiong2025longcontextmodelingdynamichierarchical,
|
| 40 |
+
title={Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs},
|
| 41 |
+
author={Siheng Xiong and Joe Zou and Faramarz Fekri and Yae Jee Cho},
|
| 42 |
+
year={2025},
|
| 43 |
+
eprint={2510.24606},
|
| 44 |
+
archivePrefix={arXiv},
|
| 45 |
+
primaryClass={cs.CL}
|
| 46 |
+
}
|
| 47 |
+
```
|
| 48 |
+
|