You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

language:language:

en- en

library_name: transformerslibrary_name: transformers

pipeline_tag: token-classificationpipeline_tag: token-classification

language: - en library_name: transformers pipeline_tag: token-classification tags: - ner - legal - indian-law - bert license: cc-by-nc-4.0 model-index: - name: IN_Lexi_X_BERT results: - task: name: Named Entity Recognition type: token-classification dataset: name: Indian Legal NER (enriched) type: custom split: test metrics: - name: Micro F1 type: f1 value: 0.7773406766325727 - name: Macro F1 type: f1 value: 0.780584439098286

IN_Lexi_X_BERT — Indian Legal NER (Improved)

This repository contains a fine-tuned BERT model (InLegalBERT-style) for Named Entity Recognition on Indian legal texts.

Task: Token classification (NER)
Domain: Indian legal documents
Max sequence length: 512
Framework: Hugging Face Transformers

What’s new

This version improves NER performance on the test split compared to the previous push.

Micro F1: 0.77734
Macro F1: 0.78058

Per-class highlights (F1): COURT ≈ 0.894, PROVISION ≈ 0.939, STATUTE ≈ 0.925. See compare_checkpoints_report.json for full details.

Labels

The model predicts BIO-formatted legal entities:

B-CASE_NUMBER, I-CASE_NUMBER
B-COURT, I-COURT
B-DATE, I-DATE
B-GPE, I-GPE
B-JUDGE, I-JUDGE
B-LAWYER, I-LAWYER
B-ORG, I-ORG
B-OTHER_PERSON, I-OTHER_PERSON
B-PETITIONER, I-PETITIONER
B-PRECEDENT, I-PRECEDENT
B-PROVISION, I-PROVISION
B-RESPONDENT, I-RESPONDENT
B-STATUTE, I-STATUTE
B-WITNESS, I-WITNESS
O

These correspond to the config.json id2label mapping packaged with the checkpoint.

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo_id = "shreyas2809/IN_Lexi_X_BERT"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",  # aggregates B/I into spans
)

text = "In The High Court Of Kerala At Ernakulam ..."
print(ner(text))

Intended use

Information extraction from Indian legal case texts
Downstream legal analytics (parties, courts, statutes, provisions, etc.)

Evaluation

The provided metrics are computed on a held-out test split of the Indian Legal NER dataset (enriched). For a more detailed analysis, generate a confusion matrix and per-class metrics locally.

Limitations

Domain-specific: optimized for Indian legal language patterns
Long documents: sequences >512 tokens are truncated
Class imbalance: some labels are less frequent relative to 'O'

License

CC-BY-NC 4.0

B-ORG, I-ORG model=model,
B-OTHER_PERSON, I-OTHER_PERSON tokenizer=tokenizer,
B-PETITIONER, I-PETITIONER aggregation_strategy="simple", # aggregates B/I into single spans
B-PRECEDENT, I-PRECEDENT)
B-PROVISION, I-PROVISION
B-RESPONDENT, I-RESPONDENTtext = "On query by the Bench about an entry of Rs. 1,31,37,500 on deposit side of Hongkong Bank account of ..."
B-STATUTE, I-STATUTEprint(ner(text))
B-WITNESS, I-WITNESS```
O

Example output (structure):

These correspond to the config.json id2label mapping packaged with the checkpoint.


## Quick start[

```python  {

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline    'entity_group': 'ORG',

    'score': 0.98,

repo_id = "shreyas2809/IN_Lexi_X_BERT"    'word': 'Hongkong Bank',

    'start': 86,

tokenizer = AutoTokenizer.from_pretrained(repo_id)    'end': 99,

model = AutoModelForTokenClassification.from_pretrained(repo_id)  },

  # ...

ner = pipeline(]

    "token-classification",```

    model=model,

    tokenizer=tokenizer,## Intended use

    aggregation_strategy="simple",  # aggregates B/I into spans- Information extraction from Indian legal case texts

)- Downstream legal analytics (parties, courts, statutes, provisions, etc.)



text = "In The High Court Of Kerala At Ernakulam ..."## Training and data

print(ner(text))The model was fine-tuned on Indian legal NER data prepared by the author. BIO-formatted splits are not included here; labels are baked into the checkpoint. If you want to reproduce evaluation locally, run a confusion-matrix evaluation on your BIO files using a simple script like:


## Intended use# local usage example (not part of the Hub repo)

- Information extraction from Indian legal case textspython evaluate_confusion_matrix.py \

- Downstream legal analytics (parties, courts, statutes, provisions, etc.)  --model-dir best \

  --data-path enriched_data/test.bio \

## Evaluation  --output-image reports/test_confusion.png

The provided metrics are computed on a held-out test split of the Indian Legal NER dataset (enriched). For a more detailed analysis, generate a confusion matrix and per-class metrics locally.```



## LimitationsThis script aligns word-piece tokens to word labels, computes a confusion matrix (optionally excluding the dominant 'O' class), and can export a heatmap.

- Domain-specific: optimized for Indian legal language patterns

- Long documents: sequences >512 tokens are truncated## Limitations

- Class imbalance: some labels are less frequent relative to 'O'- Domain-specific: optimized for Indian legal language and may generalize poorly to other domains or jurisdictions.

- Long documents: sequences longer than 512 tokens are truncated.

## License- Class imbalance: some entities may be under-represented relative to 'O'.

Not specified.

## Citation
If you use this model in academic or industry work, please cite this repository and the underlying InLegalBERT base.

## License
CC-BY-NC 4.0

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support