Upload folder using huggingface_hub
Browse files- README_HF.md +39 -0
- requirements.txt +14 -58
README_HF.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Advanced Tokenizer System for LiMp
|
| 2 |
+
|
| 3 |
+
## ๐ง Overview
|
| 4 |
+
Sophisticated multi-modal tokenization system with semantic awareness, mathematical processing, and fractal-based tokenization.
|
| 5 |
+
|
| 6 |
+
## ๐ Key Features
|
| 7 |
+
- **Multi-Modal Tokenization**: Traditional, semantic, mathematical, and fractal
|
| 8 |
+
- **High Capacity Processing**: Handles unlimited character counts
|
| 9 |
+
- **Intelligent Chunking**: Semantic-aware with context preservation
|
| 10 |
+
- **Batch Processing**: High-performance parallel processing
|
| 11 |
+
- **Training Data Generation**: Creates high-quality training datasets
|
| 12 |
+
- **Mathematical AI**: Advanced mathematical expression processing
|
| 13 |
+
|
| 14 |
+
## ๐ Quick Start
|
| 15 |
+
```python
|
| 16 |
+
from advanced_tokenizer_system import AdvancedTokenizer, TokenizerConfig
|
| 17 |
+
|
| 18 |
+
config = TokenizerConfig()
|
| 19 |
+
tokenizer = AdvancedTokenizer(config)
|
| 20 |
+
|
| 21 |
+
import asyncio
|
| 22 |
+
result = await tokenizer.tokenize("Hello world! x^2 + y^2 = z^2")
|
| 23 |
+
print(f"Tokens: {result.total_tokens}")
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
## ๐ Files
|
| 27 |
+
- `advanced_tokenizer_system.py` - Main tokenizer
|
| 28 |
+
- `batch_processing_system.py` - Batch processing
|
| 29 |
+
- `high_capacity_input_processor.py` - Large text processing
|
| 30 |
+
- `intelligent_chunking_processor.py` - Smart chunking
|
| 31 |
+
- `advanced_training_data_generator.py` - Training data
|
| 32 |
+
- `matrix_training_data.jsonl` - Sample data
|
| 33 |
+
|
| 34 |
+
## ๐งช Test
|
| 35 |
+
```bash
|
| 36 |
+
python3 working_test.py
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
Ready for advanced AI tokenization! ๐
|
requirements.txt
CHANGED
|
@@ -1,58 +1,14 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
# Async HTTP and networking
|
| 17 |
-
httpx==0.28.1 # Updated from >=0.24.0 - includes security fixes
|
| 18 |
-
aiofiles==25.1.0 # Updated from >=23.2.1
|
| 19 |
-
|
| 20 |
-
# Database connectivity
|
| 21 |
-
asyncpg==0.30.0 # Updated from >=0.28.0
|
| 22 |
-
psycopg2-binary==2.9.11 # Updated from >=2.9.0 - includes security patches
|
| 23 |
-
|
| 24 |
-
# Data processing
|
| 25 |
-
pandas==2.3.3 # Updated from >=2.0.0
|
| 26 |
-
pydantic==2.12.0 # Updated from >=2.0.0 - includes validation improvements
|
| 27 |
-
|
| 28 |
-
# Web framework (for API endpoints)
|
| 29 |
-
fastapi==0.118.3 # Updated from >=0.100.0 - includes security fixes
|
| 30 |
-
uvicorn==0.37.0 # Updated from >=0.23.0 - includes security updates
|
| 31 |
-
|
| 32 |
-
# Utilities
|
| 33 |
-
python-dateutil==2.9.0.post0 # Updated from >=2.8.0
|
| 34 |
-
python-multipart==0.0.20 # Updated from >=0.0.6
|
| 35 |
-
|
| 36 |
-
# Development and testing
|
| 37 |
-
pytest==8.4.2 # Updated from >=7.4.0
|
| 38 |
-
pytest-asyncio==1.2.0 # Updated from >=0.21.0
|
| 39 |
-
black==25.9.0 # Updated from >=23.0.0
|
| 40 |
-
flake8==7.3.0 # Updated from >=6.0.0
|
| 41 |
-
|
| 42 |
-
# Graph/complex networks for emergent modules
|
| 43 |
-
networkx==3.5 # Updated from >=3.1
|
| 44 |
-
|
| 45 |
-
# Optional dependencies (install separately if needed)
|
| 46 |
-
# sentence-transformers>=2.2.0
|
| 47 |
-
# transformers>=4.30.0
|
| 48 |
-
# torch>=2.0.0
|
| 49 |
-
# faiss-cpu>=1.7.4
|
| 50 |
-
# annoy>=1.17.0
|
| 51 |
-
# hnswlib>=0.7.0
|
| 52 |
-
|
| 53 |
-
# Numbskull integration - Advanced embedding pipeline
|
| 54 |
-
# Install as editable package from local path
|
| 55 |
-
-e /home/kill/numbskull
|
| 56 |
-
|
| 57 |
-
# Additional dependency for HTTP requests in dual orchestrator
|
| 58 |
-
requests>=2.31.0
|
|
|
|
| 1 |
+
numpy>=1.21.0
|
| 2 |
+
torch>=1.9.0
|
| 3 |
+
asyncio
|
| 4 |
+
pathlib
|
| 5 |
+
dataclasses
|
| 6 |
+
typing
|
| 7 |
+
datetime
|
| 8 |
+
json
|
| 9 |
+
hashlib
|
| 10 |
+
re
|
| 11 |
+
multiprocessing
|
| 12 |
+
threading
|
| 13 |
+
queue
|
| 14 |
+
psutil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|