Spaces:

guifav
/

token_tortoise

Sleeping

App Files Files Community

Guilherme Favaron commited on Jan 7, 2025

Commit

bbb1e4b

1 Parent(s): 70c2a42

Add application file

Browse files

Files changed (3) hide show

README.md +89 -4
app.py +135 -0
requirements.txt +8 -0

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
 title: Token Tortoise
-emoji: ⚡
-colorFrom: green
-colorTo: pink
 sdk: gradio
 sdk_version: 5.10.0
 app_file: app.py
@@ -11,4 +11,89 @@ license: mit
 short_description: Bulk Document Token Counter
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Token Tortoise
+emoji: 🐢
+colorFrom: pink
+colorTo: yellow
 sdk: gradio
 sdk_version: 5.10.0
 app_file: app.py
 short_description: Bulk Document Token Counter
 ---
+# Token Tortoise - Bulk Document Token Counter
+A powerful and reliable tool for counting tokens across multiple document types simultaneously. Perfect for content creators, developers, and AI practitioners who need to manage token counts for large-scale text processing.
+## Features
+- **Multi-Format Support**: Process multiple files simultaneously in various formats:
+  - PDF (.pdf)
+  - Microsoft Word (.docx)
+  - PowerPoint (.pptx)
+  - Excel (.xlsx, .xls)
+  - CSV (.csv)
+  - Text files (.txt)
+- **Bulk Processing**: Upload multiple files at once for efficient token counting
+- **Accurate Counting**: Uses `tiktoken` encoder (cl100k_base) for precise token counting
+- **Clear Results**: Get detailed token counts per file and total count
+- **User-Friendly Interface**: Clean, intuitive design with instant results
+## Usage
+1. Visit [Token Tortoise on Hugging Face](https://huggingface.co/spaces/guifav/token_tortoise)
+2. Click the "Upload Files" button or drag and drop your files
+3. View the token count results for each file and the total count
+## Technical Details
+- **Token Encoding**: Uses OpenAI's `tiktoken` with cl100k_base encoding
+- **Document Processing**:
+  - PDFs: PyPDF2 for text extraction
+  - Word: python-docx for .docx parsing
+  - PowerPoint: python-pptx for .pptx parsing
+  - Excel/CSV: pandas for structured data handling
+## Installation for Local Development
+```bash
+git clone https://huggingface.co/spaces/guifav/token_tortoise
+cd token-tortoise
+pip install -r requirements.txt
+python app.py
+```
+## Requirements
+```
+gradio
+tiktoken
+pandas
+PyPDF2
+python-docx
+python-pptx
+openpyxl
+```
+## Limitations
+- Maximum file size: 100MB per file
+- Text extraction quality depends on document formatting
+- Some complex document formatting may affect token count accuracy
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## License
+MIT License - see LICENSE file for details
+## About
+Created by [Guilherme Favaron](https://www.guilhermefavaron.com.br)
+Part of the [MindApps.ai](https://mindapps.ai) suite of AI tools
+## Support
+For issues and feature requests, please visit:
+[GitHub Issues](https://github.com/GuilhermeFavaron/token-tortoise/issues)
+Meet the developer: falecom_guilhermefavaron@googlegroups.com
+More information about AI & Business: www.guilhermefavaron.com.br
+🐢 Token Tortoise: Count with confidence, process with precision.
+---

app.py ADDED Viewed

	@@ -0,0 +1,135 @@

+import gradio as gr
+import tiktoken
+import pandas as pd
+import PyPDF2
+import docx
+import pptx
+import openpyxl
+from pathlib import Path
+import csv
+import io
+def get_encoding():
+    return tiktoken.get_encoding("cl100k_base")
+def count_tokens_text(text):
+    enc = get_encoding()
+    return len(enc.encode(text))
+def read_pdf(file):
+    text = ""
+    pdf_reader = PyPDF2.PdfReader(file)
+    for page in pdf_reader.pages:
+        text += page.extract_text() + "\n"
+    return text
+def read_docx(file):
+    doc = docx.Document(file)
+    text = ""
+    for paragraph in doc.paragraphs:
+        text += paragraph.text + "\n"
+    return text
+def read_pptx(file):
+    prs = pptx.Presentation(file)
+    text = ""
+    for slide in prs.slides:
+        for shape in slide.shapes:
+            if hasattr(shape, "text"):
+                text += shape.text + "\n"
+    return text
+def read_excel(file):
+    df = pd.read_excel(file)
+    return df.to_string()
+def read_csv(file):
+    df = pd.read_csv(file)
+    return df.to_string()
+def process_files(files):
+    results = []
+    total_tokens = 0
+    enc = get_encoding()
+    for file in files:
+        try:
+            file_ext = Path(file.name).suffix.lower()
+            file_name = Path(file.name).name
+            if file_ext == '.pdf':
+                text = read_pdf(file)
+            elif file_ext == '.docx':
+                text = read_docx(file)
+            elif file_ext == '.pptx':
+                text = read_pptx(file)
+            elif file_ext in ['.xlsx', '.xls']:
+                text = read_excel(file)
+            elif file_ext == '.csv':
+                text = read_csv(file)
+            elif file_ext == '.txt':
+                text = file.read().decode('utf-8')
+            else:
+                results.append(f"Unsupported file format: {file_name}")
+                continue
+            token_count = count_tokens_text(text)
+            total_tokens += token_count
+            results.append(f"File: {file_name} - Token count: {token_count:,}")
+        except Exception as e:
+            results.append(f"Error processing {file.name}: {str(e)}")
+    # Add total tokens to the beginning of results
+    if total_tokens > 0:
+        results.insert(0, f"\nTotal tokens across all files: {total_tokens:,}\n")
+        results.insert(1, "-" * 50)  # Adding a separator line
+    return "\n".join(results)
+# Custom CSS for Source Sans Pro font
+custom_css = """
+@import url('https://fonts.googleapis.com/css2?family=Source+Sans+Pro:wght@400;600&display=swap');
+body, .gradio-container {
+    font-family: 'Source Sans Pro', sans-serif !important;
+}
+.output-text {
+    font-family: 'Source Sans Pro', monospace !important;
+    font-size: 16px !important;
+    line-height: 1.5 !important;
+}
+"""
+# Create Gradio interface
+with gr.Blocks(css=custom_css) as iface:
+    gr.Markdown(
+        """
+        # 📚 Bulk Token Counter
+        Upload multiple files (PDF, DOCX, PPTX, XLSX, CSV, TXT) to count their tokens.
+        """
+    )
+    with gr.Row():
+        file_input = gr.File(
+            file_count="multiple",
+            label="Upload Files"
+        )
+    with gr.Row():
+        output = gr.Textbox(
+            label="Results",
+            lines=10,
+            elem_classes=["output-text"]
+        )
+    file_input.change(
+        fn=process_files,
+        inputs=[file_input],
+        outputs=[output]
+    )
+# Launch the app
+if __name__ == "__main__":
+    iface.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+gradio
+tiktoken
+pandas
+PyPDF2
+python-docx
+python-pptx
+openpyxl
+plotly