Guilherme Favaron commited on
Commit
bbb1e4b
·
1 Parent(s): 70c2a42

Add application file

Browse files
Files changed (3) hide show
  1. README.md +89 -4
  2. app.py +135 -0
  3. requirements.txt +8 -0
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Token Tortoise
3
- emoji:
4
- colorFrom: green
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 5.10.0
8
  app_file: app.py
@@ -11,4 +11,89 @@ license: mit
11
  short_description: Bulk Document Token Counter
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Token Tortoise
3
+ emoji: 🐢
4
+ colorFrom: pink
5
+ colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 5.10.0
8
  app_file: app.py
 
11
  short_description: Bulk Document Token Counter
12
  ---
13
 
14
+ # Token Tortoise - Bulk Document Token Counter
15
+
16
+ A powerful and reliable tool for counting tokens across multiple document types simultaneously. Perfect for content creators, developers, and AI practitioners who need to manage token counts for large-scale text processing.
17
+
18
+ ## Features
19
+
20
+ - **Multi-Format Support**: Process multiple files simultaneously in various formats:
21
+ - PDF (.pdf)
22
+ - Microsoft Word (.docx)
23
+ - PowerPoint (.pptx)
24
+ - Excel (.xlsx, .xls)
25
+ - CSV (.csv)
26
+ - Text files (.txt)
27
+
28
+ - **Bulk Processing**: Upload multiple files at once for efficient token counting
29
+ - **Accurate Counting**: Uses `tiktoken` encoder (cl100k_base) for precise token counting
30
+ - **Clear Results**: Get detailed token counts per file and total count
31
+ - **User-Friendly Interface**: Clean, intuitive design with instant results
32
+
33
+ ## Usage
34
+
35
+ 1. Visit [Token Tortoise on Hugging Face](https://huggingface.co/spaces/guifav/token_tortoise)
36
+ 2. Click the "Upload Files" button or drag and drop your files
37
+ 3. View the token count results for each file and the total count
38
+
39
+ ## Technical Details
40
+
41
+ - **Token Encoding**: Uses OpenAI's `tiktoken` with cl100k_base encoding
42
+ - **Document Processing**:
43
+ - PDFs: PyPDF2 for text extraction
44
+ - Word: python-docx for .docx parsing
45
+ - PowerPoint: python-pptx for .pptx parsing
46
+ - Excel/CSV: pandas for structured data handling
47
+
48
+ ## Installation for Local Development
49
+
50
+ ```bash
51
+ git clone https://huggingface.co/spaces/guifav/token_tortoise
52
+ cd token-tortoise
53
+ pip install -r requirements.txt
54
+ python app.py
55
+ ```
56
+
57
+ ## Requirements
58
+
59
+ ```
60
+ gradio
61
+ tiktoken
62
+ pandas
63
+ PyPDF2
64
+ python-docx
65
+ python-pptx
66
+ openpyxl
67
+ ```
68
+
69
+ ## Limitations
70
+
71
+ - Maximum file size: 100MB per file
72
+ - Text extraction quality depends on document formatting
73
+ - Some complex document formatting may affect token count accuracy
74
+
75
+ ## Contributing
76
+
77
+ Contributions are welcome! Please feel free to submit a Pull Request.
78
+
79
+ ## License
80
+
81
+ MIT License - see LICENSE file for details
82
+
83
+ ## About
84
+
85
+ Created by [Guilherme Favaron](https://www.guilhermefavaron.com.br)
86
+ Part of the [MindApps.ai](https://mindapps.ai) suite of AI tools
87
+
88
+ ## Support
89
+
90
+ For issues and feature requests, please visit:
91
+ [GitHub Issues](https://github.com/GuilhermeFavaron/token-tortoise/issues)
92
+
93
+ Meet the developer: falecom_guilhermefavaron@googlegroups.com
94
+
95
+ More information about AI & Business: www.guilhermefavaron.com.br
96
+
97
+ 🐢 Token Tortoise: Count with confidence, process with precision.
98
+
99
+ ---
app.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import tiktoken
3
+ import pandas as pd
4
+ import PyPDF2
5
+ import docx
6
+ import pptx
7
+ import openpyxl
8
+ from pathlib import Path
9
+ import csv
10
+ import io
11
+
12
+ def get_encoding():
13
+ return tiktoken.get_encoding("cl100k_base")
14
+
15
+ def count_tokens_text(text):
16
+ enc = get_encoding()
17
+ return len(enc.encode(text))
18
+
19
+ def read_pdf(file):
20
+ text = ""
21
+ pdf_reader = PyPDF2.PdfReader(file)
22
+ for page in pdf_reader.pages:
23
+ text += page.extract_text() + "\n"
24
+ return text
25
+
26
+ def read_docx(file):
27
+ doc = docx.Document(file)
28
+ text = ""
29
+ for paragraph in doc.paragraphs:
30
+ text += paragraph.text + "\n"
31
+ return text
32
+
33
+ def read_pptx(file):
34
+ prs = pptx.Presentation(file)
35
+ text = ""
36
+ for slide in prs.slides:
37
+ for shape in slide.shapes:
38
+ if hasattr(shape, "text"):
39
+ text += shape.text + "\n"
40
+ return text
41
+
42
+ def read_excel(file):
43
+ df = pd.read_excel(file)
44
+ return df.to_string()
45
+
46
+ def read_csv(file):
47
+ df = pd.read_csv(file)
48
+ return df.to_string()
49
+
50
+ def process_files(files):
51
+ results = []
52
+ total_tokens = 0
53
+ enc = get_encoding()
54
+
55
+ for file in files:
56
+ try:
57
+ file_ext = Path(file.name).suffix.lower()
58
+ file_name = Path(file.name).name
59
+
60
+ if file_ext == '.pdf':
61
+ text = read_pdf(file)
62
+ elif file_ext == '.docx':
63
+ text = read_docx(file)
64
+ elif file_ext == '.pptx':
65
+ text = read_pptx(file)
66
+ elif file_ext in ['.xlsx', '.xls']:
67
+ text = read_excel(file)
68
+ elif file_ext == '.csv':
69
+ text = read_csv(file)
70
+ elif file_ext == '.txt':
71
+ text = file.read().decode('utf-8')
72
+ else:
73
+ results.append(f"Unsupported file format: {file_name}")
74
+ continue
75
+
76
+ token_count = count_tokens_text(text)
77
+ total_tokens += token_count
78
+ results.append(f"File: {file_name} - Token count: {token_count:,}")
79
+
80
+ except Exception as e:
81
+ results.append(f"Error processing {file.name}: {str(e)}")
82
+
83
+ # Add total tokens to the beginning of results
84
+ if total_tokens > 0:
85
+ results.insert(0, f"\nTotal tokens across all files: {total_tokens:,}\n")
86
+ results.insert(1, "-" * 50) # Adding a separator line
87
+
88
+ return "\n".join(results)
89
+
90
+ # Custom CSS for Source Sans Pro font
91
+ custom_css = """
92
+ @import url('https://fonts.googleapis.com/css2?family=Source+Sans+Pro:wght@400;600&display=swap');
93
+
94
+ body, .gradio-container {
95
+ font-family: 'Source Sans Pro', sans-serif !important;
96
+ }
97
+
98
+ .output-text {
99
+ font-family: 'Source Sans Pro', monospace !important;
100
+ font-size: 16px !important;
101
+ line-height: 1.5 !important;
102
+ }
103
+ """
104
+
105
+ # Create Gradio interface
106
+ with gr.Blocks(css=custom_css) as iface:
107
+ gr.Markdown(
108
+ """
109
+ # 📚 Bulk Token Counter
110
+ Upload multiple files (PDF, DOCX, PPTX, XLSX, CSV, TXT) to count their tokens.
111
+ """
112
+ )
113
+
114
+ with gr.Row():
115
+ file_input = gr.File(
116
+ file_count="multiple",
117
+ label="Upload Files"
118
+ )
119
+
120
+ with gr.Row():
121
+ output = gr.Textbox(
122
+ label="Results",
123
+ lines=10,
124
+ elem_classes=["output-text"]
125
+ )
126
+
127
+ file_input.change(
128
+ fn=process_files,
129
+ inputs=[file_input],
130
+ outputs=[output]
131
+ )
132
+
133
+ # Launch the app
134
+ if __name__ == "__main__":
135
+ iface.launch()
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ tiktoken
3
+ pandas
4
+ PyPDF2
5
+ python-docx
6
+ python-pptx
7
+ openpyxl
8
+ plotly