tim1900
/

bert-chunker

Token Classification

feature-extraction

Model card Files Files and versions

tim1900 commited on May 17, 2024

Commit

6d51e4f

·

verified ·

1 Parent(s): 6bcdbb0

Update README.md

Files changed (1) hide show

README.md +64 -3

README.md CHANGED Viewed

@@ -1,3 +1,64 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+- zh
+pipeline_tag: token-classification
+---
+# BertChunker
+## Introduction
+BertChunker is an end-to-end trained chunker for chunking text for RAG. It's trained based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) with an adapter.
+This repo includes model checkpoint, BertChunker class definition file and all the other files needed.
+## Quickstart
+Download this repository. Then enter it. Run the following:
+```python
+import safetensors
+from transformers import AutoConfig,AutoTokenizer
+from modeling_bertchunker import BertChunker
+# load bert tokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "./",
+    padding_side="right",
+    model_max_length=255,
+    trust_remote_code=True,
+)
+# load MiniLM-L6-H384-uncased bert config
+config = AutoConfig.from_pretrained(
+    "./",
+    trust_remote_code=True,
+)
+# initialize model
+model = BertChunker(config)
+device='cuda'
+model.to(device)
+# load parameters
+state_dict = safetensors.torch.load_file("./model.safetensors")
+model.load_state_dict(state_dict)
+# text to be chunked
+text="In the heart of the bustling city, where towering skyscrapers touch the clouds and the symphony \
+    of honking cars never ceases, Sarah, an aspiring novelist, found solace in the quiet corners of the ancient library. \
+    Surrounded by shelves that whispered stories of centuries past, she crafted her own world with words, oblivious to the rush outside.\
+    Dr. Alexander Thompson, aboard the spaceship 'Pandora's Venture', was en route to the newly discovered exoplanet Zephyr-7. \
+    As the lead astrobiologist of the expedition, his mission was to uncover signs of microbial life within the planet's subterranean ice caves. \
+    With each passing light year, the anticipation of unraveling secrets that could alter humanity's\
+     understanding of life in the universe grew ever stronger."
+# chunk the text
+chunks=model.chunk_text(text, tokenizer)
+# print chunks
+for i, c in enumerate(chunks):
+    print(f'------------------')
+    print(c)
+```