| --- |
| language: |
| - en |
| tags: |
| - charboundary |
| - sentence-boundary-detection |
| - paragraph-detection |
| - legal-text |
| - legal-nlp |
| - text-segmentation |
| - cpu |
| - document-processing |
| - rag |
| license: mit |
| library_name: charboundary |
| pipeline_tag: text-classification |
| datasets: |
| - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries |
| - alea-institute/kl3m-data-snapshot-20250324 |
| metrics: |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| - throughput |
| papers: |
| - https://arxiv.org/abs/2504.04131 |
| --- |
| |
| # CharBoundary small Model |
|
|
| This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0), |
| a fast character-based sentence and paragraph boundary detection system optimized for legal text. |
|
|
| ## Model Details |
|
|
| - **Size**: small |
| - **Model Size**: 3.0 MB (SKOPS compressed) |
| - **Memory Usage**: 1026 MB at runtime |
| - **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324) |
| - **Model Type**: Random Forest (32 trees, max depth 16) |
| - **Format**: scikit-learn model (serialized with skops) |
| - **Task**: Character-level boundary detection for text segmentation |
| - **License**: MIT |
| - **Throughput**: ~748K characters/second |
|
|
| ## Usage |
|
|
| > **Important:** When loading models from Hugging Face Hub, you must set `trust_model=True` to allow loading custom class types. |
| > |
| > **Security Note:** The ONNX model variants are recommended in security-sensitive environments as they don't require bypassing skops security measures with `trust_model=True`. See the [ONNX versions](https://huggingface.co/alea-institute/charboundary-small-onnx) for a safer alternative. |
|
|
| ```python |
| # pip install charboundary |
| from huggingface_hub import hf_hub_download |
| from charboundary import TextSegmenter |
| |
| # Download the model |
| model_path = hf_hub_download(repo_id="alea-institute/charboundary-small", filename="model.pkl") |
| |
| # Load the model (trust_model=True is required when loading from external sources) |
| segmenter = TextSegmenter.load(model_path, trust_model=True) |
| |
| # Use the model |
| text = "This is a test sentence. Here's another one!" |
| sentences = segmenter.segment_to_sentences(text) |
| print(sentences) |
| # Output: ['This is a test sentence.', " Here's another one!"] |
| |
| # Segment to spans |
| sentence_spans = segmenter.get_sentence_spans(text) |
| print(sentence_spans) |
| # Output: [(0, 24), (24, 44)] |
| ``` |
|
|
| ## Performance |
|
|
| The model uses a character-based random forest classifier with the following configuration: |
| - Window Size: 5 characters before, 3 characters after potential boundary |
| - Accuracy: 0.9970 |
| - F1 Score: 0.7730 |
| - Precision: 0.7460 |
| - Recall: 0.9870 |
|
|
| ### Dataset-specific Performance |
|
|
| | Dataset | Precision | F1 | Recall | |
| |---------|-----------|-------|--------| |
| | ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 | |
| | SCOTUS | 0.926 | 0.773 | 0.664 | |
| | Cyber Crime | 0.939 | 0.837 | 0.755 | |
| | BVA | 0.937 | 0.870 | 0.812 | |
| | Intellectual Property | 0.927 | 0.883 | 0.843 | |
|
|
| ## Available Models |
|
|
| CharBoundary comes in three sizes, balancing accuracy and efficiency: |
|
|
| | Model | Format | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score | |
| |-------|--------|-----------|-------------|------------------------|----------| |
| | Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 | ~748K | 0.773 | |
| | Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 | ~587K | 0.779 | |
| | Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 | ~518K | 0.782 | |
|
|
| ## Paper and Citation |
|
|
| This model is part of the research presented in the following paper: |
|
|
| ``` |
| @article{bommarito2025precise, |
| title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary}, |
| author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian}, |
| journal={arXiv preprint arXiv:2504.04131}, |
| year={2025} |
| } |
| ``` |
|
|
| For more details on the model architecture, training, and evaluation, please see: |
| - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131) |
| - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary) |
| - [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries) |
|
|
| ## Contact |
|
|
| This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai). |
|
|
| For technical support, collaboration opportunities, or general inquiries: |
| |
| - GitHub: https://github.com/alea-institute/kl3m-model-research |
| - Email: hello@aleainstitute.ai |
| - Website: https://aleainstitute.ai |
|
|
| For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or |
| create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research). |
|
|
|  |
|
|