YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
language:language:
- en- en
library_name: transformerslibrary_name: transformers
pipeline_tag: token-classificationpipeline_tag: token-classification
language: - en library_name: transformers pipeline_tag: token-classification tags: - ner - legal - indian-law - bert license: cc-by-nc-4.0 model-index: - name: IN_Lexi_X_BERT results: - task: name: Named Entity Recognition type: token-classification dataset: name: Indian Legal NER (enriched) type: custom split: test metrics: - name: Micro F1 type: f1 value: 0.7773406766325727 - name: Macro F1 type: f1 value: 0.780584439098286
IN_Lexi_X_BERT — Indian Legal NER (Improved)
This repository contains a fine-tuned BERT model (InLegalBERT-style) for Named Entity Recognition on Indian legal texts.
- Task: Token classification (NER)
- Domain: Indian legal documents
- Max sequence length: 512
- Framework: Hugging Face Transformers
What’s new
This version improves NER performance on the test split compared to the previous push.
- Micro F1: 0.77734
- Macro F1: 0.78058
Per-class highlights (F1): COURT ≈ 0.894, PROVISION ≈ 0.939, STATUTE ≈ 0.925. See compare_checkpoints_report.json for full details.
Labels
The model predicts BIO-formatted legal entities:
- B-CASE_NUMBER, I-CASE_NUMBER
- B-COURT, I-COURT
- B-DATE, I-DATE
- B-GPE, I-GPE
- B-JUDGE, I-JUDGE
- B-LAWYER, I-LAWYER
- B-ORG, I-ORG
- B-OTHER_PERSON, I-OTHER_PERSON
- B-PETITIONER, I-PETITIONER
- B-PRECEDENT, I-PRECEDENT
- B-PROVISION, I-PROVISION
- B-RESPONDENT, I-RESPONDENT
- B-STATUTE, I-STATUTE
- B-WITNESS, I-WITNESS
- O
These correspond to the config.json id2label mapping packaged with the checkpoint.
Quick start
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
repo_id = "shreyas2809/IN_Lexi_X_BERT"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple", # aggregates B/I into spans
)
text = "In The High Court Of Kerala At Ernakulam ..."
print(ner(text))
Intended use
- Information extraction from Indian legal case texts
- Downstream legal analytics (parties, courts, statutes, provisions, etc.)
Evaluation
The provided metrics are computed on a held-out test split of the Indian Legal NER dataset (enriched). For a more detailed analysis, generate a confusion matrix and per-class metrics locally.
Limitations
- Domain-specific: optimized for Indian legal language patterns
- Long documents: sequences >512 tokens are truncated
- Class imbalance: some labels are less frequent relative to 'O'
License
CC-BY-NC 4.0
B-ORG, I-ORG model=model,
B-OTHER_PERSON, I-OTHER_PERSON tokenizer=tokenizer,
B-PETITIONER, I-PETITIONER aggregation_strategy="simple", # aggregates B/I into single spans
B-PRECEDENT, I-PRECEDENT)
B-PROVISION, I-PROVISION
B-RESPONDENT, I-RESPONDENTtext = "On query by the Bench about an entry of Rs. 1,31,37,500 on deposit side of Hongkong Bank account of ..."
B-STATUTE, I-STATUTEprint(ner(text))
B-WITNESS, I-WITNESS```
O
Example output (structure):
These correspond to the config.json id2label mapping packaged with the checkpoint.
## Quick start[
```python {
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline 'entity_group': 'ORG',
'score': 0.98,
repo_id = "shreyas2809/IN_Lexi_X_BERT" 'word': 'Hongkong Bank',
'start': 86,
tokenizer = AutoTokenizer.from_pretrained(repo_id) 'end': 99,
model = AutoModelForTokenClassification.from_pretrained(repo_id) },
# ...
ner = pipeline(]
"token-classification",```
model=model,
tokenizer=tokenizer,## Intended use
aggregation_strategy="simple", # aggregates B/I into spans- Information extraction from Indian legal case texts
)- Downstream legal analytics (parties, courts, statutes, provisions, etc.)
text = "In The High Court Of Kerala At Ernakulam ..."## Training and data
print(ner(text))The model was fine-tuned on Indian legal NER data prepared by the author. BIO-formatted splits are not included here; labels are baked into the checkpoint. If you want to reproduce evaluation locally, run a confusion-matrix evaluation on your BIO files using a simple script like:
## Intended use# local usage example (not part of the Hub repo)
- Information extraction from Indian legal case textspython evaluate_confusion_matrix.py \
- Downstream legal analytics (parties, courts, statutes, provisions, etc.) --model-dir best \
--data-path enriched_data/test.bio \
## Evaluation --output-image reports/test_confusion.png
The provided metrics are computed on a held-out test split of the Indian Legal NER dataset (enriched). For a more detailed analysis, generate a confusion matrix and per-class metrics locally.```
## LimitationsThis script aligns word-piece tokens to word labels, computes a confusion matrix (optionally excluding the dominant 'O' class), and can export a heatmap.
- Domain-specific: optimized for Indian legal language patterns
- Long documents: sequences >512 tokens are truncated## Limitations
- Class imbalance: some labels are less frequent relative to 'O'- Domain-specific: optimized for Indian legal language and may generalize poorly to other domains or jurisdictions.
- Long documents: sequences longer than 512 tokens are truncated.
## License- Class imbalance: some entities may be under-represented relative to 'O'.
Not specified.
## Citation
If you use this model in academic or industry work, please cite this repository and the underlying InLegalBERT base.
## License
CC-BY-NC 4.0
- Downloads last month
- -