enochsjoseph commited on
Commit
ee44f59
·
1 Parent(s): 3280fe4
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -11,3 +11,36 @@ license: apache-2.0
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+ #LLMlocal's Text Tokenization Tool
16
+
17
+ ##Introduction
18
+
19
+ LLMlocal's Text Tokenization Tool is a user-friendly application designed to tokenize large bodies of text using various methods. This tool is built using Gradio, a Python library that allows the easy creation of web interfaces for machine learning models.
20
+
21
+ ##Features
22
+ Tokenization Method Selection: Users can select different text tokenization methods. The current version supports the RecursiveCharacterTextSplitter method.
23
+ Customizable Parameters: Users have the flexibility to set parameters like chunk size, chunk overlap, and the number of chunks to display.
24
+ Interactive Interface: The tool features an intuitive interface with dropdowns, textboxes, and number inputs for easy interaction.
25
+ Installation
26
+ Before using the tool, ensure you have Python and the necessary packages installed. You can install the required packages using the following command:
27
+
28
+ bash'
29
+ pip install gradio pandas langchain'
30
+
31
+ ##Usage
32
+ To use the tool, follow these simple steps:
33
+
34
+ Launch the Tool: Run the provided Python script. This will launch the Gradio interface in your default web browser.
35
+ Select a Tokenization Method: Choose the desired method from the dropdown menu.
36
+ Enter Text: Type or paste the text you want to tokenize in the textbox.
37
+ Set Parameters: Adjust the chunk size, chunk overlap, and the number of chunks as per your requirements.
38
+ View Results: The tokenized text will be displayed in a table format with details such as chunk number, text chunk, character count, and token count.
39
+ Contributing
40
+ Feedback and contributions to enhance the tool are always welcome. Please feel free to raise issues or submit pull requests on the repository.
41
+
42
+ ##License
43
+ This tool is open-source and available under apache-2.0 .
44
+
45
+ Acknowledgments
46
+ Thanks to the developers of Gradio, Pandas, and Langchain for providing the libraries that made this tool possible.