--- title: Telegram Analytics Dashboard emoji: ๐Ÿ“Š colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 --- # Telegram JSON Indexer & Analyzer A high-performance system for indexing, searching, and analyzing Telegram chat exports using SQLite FTS5 and advanced algorithms from Data Structures course. Includes a full-featured **Web Dashboard** with **AI-powered search**. ``` โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ TELEGRAM CHAT ANALYZER โ•‘ โ•‘ โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ JSON โ”‚โ”€โ”€โ”€โ–ถโ”‚ INDEXER โ”‚โ”€โ”€โ”€โ–ถโ”‚ SQLite โ”‚โ”€โ”€โ”€โ–ถโ”‚ WEB DASHBOARD โ”‚ โ•‘ โ•‘ โ”‚ Export โ”‚ โ”‚ Bloom โ”‚ โ”‚ + FTS5 โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ”‚ Filter โ”‚ โ”‚ โ”‚ โ”‚ โ”‚Statsโ”‚Usersโ”‚Chat โ”‚ โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ•‘ โ•‘ โ”‚ โ”‚Searchโ”‚ AI โ”‚Mod โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• ``` ## Features ### Core Features - **Full-Text Search** - Fast search with Hebrew support using SQLite FTS5 - **Fuzzy Search** - Find messages even with typos using trigram similarity - **Similar Message Detection** - LCS algorithm finds duplicates/reposts - **Conversation Threads** - DFS/BFS traversal reconstructs reply chains - **User Rankings** - O(log n) rank queries using AVL Rank Tree - **Time Analytics** - Bucket Sort for efficient histograms - **Top-K Queries** - Heap-based O(n log k) instead of O(n log n) - **Percentiles** - O(n) median/percentiles using Selection algorithm ### Web Dashboard - **Interactive Overview** - Charts, stats, activity graphs - **User Leaderboard** - Rankings with detailed user profiles - **Telegram-like Chat View** - Browse all messages like in Telegram - **Advanced Search** - Full-text + fuzzy search with filters - **AI-Powered Search** - Natural language queries (Hebrew/English) - **Moderation Analytics** - Links, mentions, domains analysis - **Database Updates** - Upload new JSON files via web UI ### AI Search (Free Providers) - **Ollama** - Local LLM (recommended, 100% free) - **Groq** - Free API tier available - **Google Gemini** - Free API tier available --- ## Table of Contents 1. [Installation](#installation) 2. [Quick Start](#quick-start) 3. [Web Dashboard](#web-dashboard) 4. [AI Search](#ai-search) 5. [Database Updates](#database-updates) 6. [Architecture](#architecture) 7. [Usage Guide](#usage-guide) 8. [Algorithms](#algorithms) 9. [API Reference](#api-reference) 10. [Examples](#examples) --- ## Installation ### Requirements - Python 3.10 or higher - No external packages required for core functionality ### Setup ```bash # Clone or download the project cd telegram # Verify Python version python --version # Should be 3.10+ # Test the system python algorithms.py # Should print "ALL TESTS PASSED!" ``` ### Optional: Semantic Search For AI-powered semantic similarity search: ```bash pip install numpy faiss-cpu sentence-transformers ``` --- ## Quick Start ### Step 1: Export from Telegram 1. Open Telegram Desktop 2. Go to any chat/group 3. Click โ‹ฎ โ†’ Export Chat History 4. Select JSON format 5. Save as `result.json` ### Step 2: Index Your Data ```bash python indexer.py result.json --db telegram.db ``` ### Step 3: Launch Web Dashboard ```bash # Start the dashboard (recommended) python dashboard.py # Open in browser: http://localhost:5000 ``` ### Step 4: Search & Analyze (CLI) ```bash # Search messages python search.py "ืฉืœื•ื" # View statistics python analyzer.py --stats # Find similar messages python analyzer.py --similar ``` --- ## Web Dashboard The web dashboard provides a complete visual interface for analyzing your Telegram data. ### Starting the Dashboard ```bash python dashboard.py # Or with custom port: python dashboard.py --port 8080 ``` ### Dashboard Pages ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ WEB DASHBOARD โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ ๐Ÿ“ˆ Overview โ”‚ Main statistics, charts, activity graphs โ”‚ โ”‚ โ”‚ - Total messages, users, links, media โ”‚ โ”‚ โ”‚ - Daily/hourly activity charts โ”‚ โ”‚ โ”‚ - Top users leaderboard โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ‘ฅ Users โ”‚ User leaderboard with detailed profiles โ”‚ โ”‚ โ”‚ - Ranking by message count โ”‚ โ”‚ โ”‚ - User details modal (hourly activity) โ”‚ โ”‚ โ”‚ - Export users to CSV โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ’ฌ Chat โ”‚ Telegram-like message view โ”‚ โ”‚ โ”‚ - Browse all messages chronologically โ”‚ โ”‚ โ”‚ - Filter by user, date, media type โ”‚ โ”‚ โ”‚ - Click message to view full thread โ”‚ โ”‚ โ”‚ - AI search with natural language โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ” Search โ”‚ Advanced search interface โ”‚ โ”‚ โ”‚ - Full-text search (Hebrew supported) โ”‚ โ”‚ โ”‚ - AI-powered natural language search โ”‚ โ”‚ โ”‚ - Boolean operators (AND, OR, NOT) โ”‚ โ”‚ โ”‚ - Export search results โ”‚ โ”‚ โ”‚ โ”‚ ๐Ÿ›ก๏ธ Moderation โ”‚ Content analytics โ”‚ โ”‚ โ”‚ - Top shared domains โ”‚ โ”‚ โ”‚ - Most mentioned users โ”‚ โ”‚ โ”‚ - Link sharers leaderboard โ”‚ โ”‚ โ”‚ - Word frequency analysis โ”‚ โ”‚ โ”‚ โ”‚ โš™๏ธ Settings โ”‚ Database management โ”‚ โ”‚ โ”‚ - View database statistics โ”‚ โ”‚ โ”‚ - Upload new JSON files โ”‚ โ”‚ โ”‚ - Automatic duplicate detection โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Dashboard Features - **Dark Theme** - Modern dark UI, easy on the eyes - **RTL Support** - Full Hebrew/Arabic text support - **Responsive** - Works on mobile and desktop - **Real-time Charts** - Interactive Chart.js visualizations - **Export** - Download data as CSV/JSON --- ## AI Search Ask questions about your chat data in natural language (Hebrew or English). ### Setup AI Provider (Free Options) #### Option 1: Ollama (Recommended - 100% Local & Free) ```bash # Install Ollama (https://ollama.ai) curl -fsSL https://ollama.ai/install.sh | sh # Pull a model ollama pull llama3.2 # Start Ollama server ollama serve ``` #### Option 2: Groq (Free API Tier) ```bash # Get free API key from https://console.groq.com export GROQ_API_KEY="your_api_key" ``` #### Option 3: Google Gemini (Free API Tier) ```bash # Get free API key from https://makersuite.google.com/app/apikey export GEMINI_API_KEY="your_api_key" ``` ### AI Search Examples ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๐Ÿค– AI Search - Natural Language Queries โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Query: "ืžื™ ืฉืœื— ื”ื›ื™ ื”ืจื‘ื” ื”ื•ื“ืขื•ืช?" โ”‚ โ”‚ Answer: ื”ืžืฉืชืžืฉ ื”ืคืขื™ืœ ื‘ื™ื•ืชืจ ื”ื•ื ื“ื ื™ ืขื 5,432 ื”ื•ื“ืขื•ืช โ”‚ โ”‚ โ”‚ โ”‚ Query: "ืžืชื™ ื”ื™ื• ื”ื›ื™ ื”ืจื‘ื” ื”ื•ื“ืขื•ืช?" โ”‚ โ”‚ Answer: ื”ื™ื•ื ื”ืคืขื™ืœ ื‘ื™ื•ืชืจ ื”ื™ื” 15.03.2024 ืขื 342 ื”ื•ื“ืขื•ืช โ”‚ โ”‚ โ”‚ โ”‚ Query: "Who mentioned @admin the most?" โ”‚ โ”‚ Answer: User "Mike" mentioned @admin 47 times โ”‚ โ”‚ โ”‚ โ”‚ Query: "ื”ืจืื” ื”ื•ื“ืขื•ืช ืขื ืงื™ืฉื•ืจื™ื ืžื”ืฉื‘ื•ืข ื”ืื—ืจื•ืŸ" โ”‚ โ”‚ Answer: ื ืžืฆืื• 23 ื”ื•ื“ืขื•ืช ืขื ืงื™ืฉื•ืจื™ื... โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### AI Search API ```python from ai_search import AISearchEngine # Initialize with Ollama (local) ai = AISearchEngine('telegram.db', provider='ollama') # Or with Groq ai = AISearchEngine('telegram.db', provider='groq', api_key='your_key') # Search result = ai.search("ืžื™ ื”ื›ื™ ืคืขื™ืœ ื‘ืœื™ืœื”?") print(result['answer']) # Natural language answer print(result['sql']) # Generated SQL query print(result['results']) # Raw data ``` --- ## Database Updates Update your database with new JSON exports without losing existing data. ### Via Web UI 1. Go to **Settings** page in the dashboard 2. Drag & drop your new `result.json` file 3. Wait for processing (duplicate detection automatic) 4. See summary of new messages added ### Via CLI ```bash # Update existing database with new JSON python indexer.py new_export.json --db telegram.db --update # What happens: # 1. Loads existing message IDs into Bloom filter (O(n)) # 2. For each message in JSON: # - Check if exists using Bloom filter (O(1)) # - Only insert if new # 3. Re-index FTS if needed # 4. Report: X new messages, Y duplicates skipped ``` ### Incremental Update Process ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ INCREMENTAL UPDATE PROCESS โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Existing DB New JSON โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ msg_1 โœ“ โ”‚ โ”‚ msg_1 โ”‚ โ†’ Skip (duplicate) โ”‚ โ”‚ โ”‚ msg_2 โœ“ โ”‚ โ”‚ msg_2 โ”‚ โ†’ Skip (duplicate) โ”‚ โ”‚ โ”‚ msg_3 โœ“ โ”‚ โ”‚ msg_5 NEW โ”‚ โ†’ Insert โ”‚ โ”‚ โ”‚ msg_4 โœ“ โ”‚ โ”‚ msg_6 NEW โ”‚ โ†’ Insert โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Bloom Filter โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ O(1) test โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ Result: Only msg_5 and msg_6 added (fast!) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## Architecture ### System Overview ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ INPUT โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Telegram JSON Export (result.json) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ messages[] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ id, date, from, text โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ reply_to_message_id โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ text_entities[] (links, mentions) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ ... โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ INDEXER (indexer.py) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Batch โ”‚ โ”‚ Bloom โ”‚ โ”‚ Reply โ”‚ โ”‚ โ”‚ โ”‚ Processing โ”‚ โ”‚ Filter โ”‚ โ”‚ Graph โ”‚ โ”‚ โ”‚ โ”‚ (1000/tx) โ”‚ โ”‚ (Dedup O(1))โ”‚ โ”‚ Builder โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SQLite DATABASE โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ messages โ”‚ FTS5 Index โ”‚ reply_graph โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ id (PK) โ”‚ โ”œโ”€โ”€ text_plain โ”‚ โ”œโ”€โ”€ parent_id โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ text_plain โ”‚ โ””โ”€โ”€ from_name โ”‚ โ””โ”€โ”€ child_id โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ from_id โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ date_unixtime โ”‚ entities โ”‚ threads โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ ... โ”‚ โ”œโ”€โ”€ links โ”‚ โ””โ”€โ”€ messages โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ mentions โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SEARCH โ”‚ โ”‚ ANALYZER โ”‚ โ”‚ VECTOR โ”‚ โ”‚ (search.py) โ”‚ โ”‚(analyzer.py)โ”‚ โ”‚ (optional) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข FTS5+BM25 โ”‚ โ”‚ โ€ข Top-K โ”‚ โ”‚ โ€ข FAISS โ”‚ โ”‚ โ€ข Fuzzy โ”‚ โ”‚ โ€ข LCS โ”‚ โ”‚ โ€ข Semantic โ”‚ โ”‚ โ€ข Threads โ”‚ โ”‚ โ€ข Rank Tree โ”‚ โ”‚ โ€ข Clusteringโ”‚ โ”‚ โ€ข LRU Cache โ”‚ โ”‚ โ€ข Percentileโ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Data Flow ``` JSON Message Database Tables Search/Analytics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ { โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” "id": 548795, โ”€โ”€โ”€โ–ถ โ”‚ messages โ”‚ โ”€โ”€โ”€โ–ถ Full-text search "text": "ืฉืœื•ื", โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ User filtering "from": "User1", Date range queries "from_id": "user123", โ”€โ–ถ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” "date_unixtime": ..., โ”‚ users โ”‚ โ”€โ”€โ”€โ–ถ Top users (Heap) โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ User rank (Rank Tree) "text_entities": [ {"type": "link", โ”€โ”€โ”€โ”€โ–ถ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” "text": "url"} โ”‚ entities โ”‚ โ”€โ”€โ”€โ–ถ Link analysis ], โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Mention network "reply_to_message_id" โ”€โ–ถ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ reply_graph โ”‚ โ”€โ”€โ”€โ–ถ Thread DFS/BFS } โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Conversation view ``` ### File Structure ``` telegram/ โ”‚ โ”œโ”€โ”€ dashboard.py # ๐ŸŒ Web Dashboard (Flask) โ”‚ โ””โ”€โ”€ Routes: /, /users, /chat, /search, /moderation, /settings โ”‚ โ””โ”€โ”€ API: /api/overview, /api/users, /api/search, /api/update, etc. โ”‚ โ”œโ”€โ”€ ai_search.py # ๐Ÿค– AI-Powered Search โ”‚ โ””โ”€โ”€ AISearchEngine class โ”‚ โ”œโ”€โ”€ Natural language to SQL โ”‚ โ”œโ”€โ”€ Ollama/Groq/Gemini providers โ”‚ โ””โ”€โ”€ Hebrew/English support โ”‚ โ”œโ”€โ”€ indexer.py # JSON โ†’ SQLite indexer โ”‚ โ”œโ”€โ”€ OptimizedIndexer class โ”‚ โ”‚ โ”œโ”€โ”€ Batch processing (100x faster) โ”‚ โ”‚ โ”œโ”€โ”€ Bloom filter (duplicate detection) โ”‚ โ”‚ โ””โ”€โ”€ Graph builder (reply threads) โ”‚ โ””โ”€โ”€ IncrementalIndexer class โ”‚ โ”œโ”€โ”€ Update existing database โ”‚ โ”œโ”€โ”€ Bloom filter duplicate check โ”‚ โ””โ”€โ”€ Only insert new messages โ”‚ โ”œโ”€โ”€ search.py # Search interface โ”‚ โ””โ”€โ”€ TelegramSearch class โ”‚ โ”œโ”€โ”€ FTS5 full-text search โ”‚ โ”œโ”€โ”€ Fuzzy trigram search โ”‚ โ”œโ”€โ”€ LRU query cache โ”‚ โ””โ”€โ”€ DFS/BFS thread traversal โ”‚ โ”œโ”€โ”€ analyzer.py # Analytics & statistics โ”‚ โ””โ”€โ”€ TelegramAnalyzer class โ”‚ โ”œโ”€โ”€ LCS similar messages โ”‚ โ”œโ”€โ”€ Heap-based Top-K โ”‚ โ”œโ”€โ”€ Selection percentiles โ”‚ โ”œโ”€โ”€ Rank Tree queries โ”‚ โ””โ”€โ”€ Bucket Sort histograms โ”‚ โ”œโ”€โ”€ data_structures.py # Core data structures โ”‚ โ”œโ”€โ”€ BloomFilter # O(1) membership test โ”‚ โ”œโ”€โ”€ Trie # O(k) prefix search โ”‚ โ”œโ”€โ”€ LRUCache # O(1) caching โ”‚ โ”œโ”€โ”€ ReplyGraph # DFS/BFS traversal โ”‚ โ””โ”€โ”€ TrigramIndex # Fuzzy matching โ”‚ โ”œโ”€โ”€ algorithms.py # Course algorithms โ”‚ โ”œโ”€โ”€ LCS # Similar message detection โ”‚ โ”œโ”€โ”€ TopK (Heap) # Efficient ranking โ”‚ โ”œโ”€โ”€ Selection # O(n) percentiles โ”‚ โ”œโ”€โ”€ RankTree # O(log n) rank queries โ”‚ โ””โ”€โ”€ BucketSort # Time histograms โ”‚ โ”œโ”€โ”€ templates/ # ๐ŸŽจ HTML Templates โ”‚ โ”œโ”€โ”€ index.html # Overview dashboard โ”‚ โ”œโ”€โ”€ users.html # User leaderboard โ”‚ โ”œโ”€โ”€ chat.html # Telegram-like chat view โ”‚ โ”œโ”€โ”€ search.html # Search interface โ”‚ โ”œโ”€โ”€ moderation.html # Content analytics โ”‚ โ””โ”€โ”€ settings.html # Settings & DB update โ”‚ โ”œโ”€โ”€ static/ # ๐Ÿ“ Static assets โ”‚ โ”œโ”€โ”€ css/style.css # Dashboard styles โ”‚ โ””โ”€โ”€ js/dashboard.js # Dashboard scripts โ”‚ โ”œโ”€โ”€ vector_search.py # Optional: Semantic search โ”‚ โ””โ”€โ”€ VectorSearch class (requires FAISS) โ”‚ โ”œโ”€โ”€ schema.sql # Database schema โ””โ”€โ”€ telegram.db # SQLite database (created) ``` --- ## Usage Guide ### Web Dashboard (Recommended) ```bash # Start the dashboard python dashboard.py # Custom port python dashboard.py --port 8080 # Custom database python dashboard.py --db my_chat.db ``` ### Indexing ```bash # Basic indexing python indexer.py result.json # Custom database name python indexer.py result.json --db my_chat.db # With trigram index (for fuzzy search) python indexer.py result.json --build-trigrams # Larger batch size (faster for big files) python indexer.py result.json --batch-size 5000 # Update existing database with new JSON (incremental) python indexer.py new_export.json --db telegram.db --update ``` ### Searching ```bash # Basic search (Hebrew supported) python search.py "ืฉืœื•ื" # Search with filters python search.py "ืžื™ืœื”" --user user123456 --limit 50 # Date range python search.py "ื—ื“ืฉื•ืช" --from-date 2024-01-01 --to-date 2024-12-31 # Fuzzy search (finds typos) python search.py "ืฉืœืž" --fuzzy --threshold 0.3 # View conversation thread python search.py --thread 548795 # List all links python search.py --list-links # List all mentions python search.py --list-mentions ``` ### Analytics ```bash # General statistics python analyzer.py --stats # Top users (Heap-based O(n log k)) python analyzer.py --top-users --limit 10 # Hourly activity python analyzer.py --hourly # Daily activity python analyzer.py --daily # Top words python analyzer.py --words --limit 30 # Top domains python analyzer.py --domains # Find similar messages (LCS algorithm) python analyzer.py --similar --threshold 0.7 # Find reposts python analyzer.py --reposts # Message length percentiles (Selection algorithm) python analyzer.py --percentiles # Response time percentiles python analyzer.py --response-times # User rank (Rank Tree O(log n)) python analyzer.py --user-rank user123456 # Get user at rank #5 python analyzer.py --rank 5 # Activity histogram (Bucket Sort) python analyzer.py --histogram --bucket-size 86400 # Export as JSON python analyzer.py --stats --json > stats.json ``` --- ## Algorithms ### Algorithm Complexity Comparison ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Operation โ”‚ Naive Method โ”‚ Our Algorithm โ”‚ Improvement โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Top-K users โ”‚ O(n log n) sort โ”‚ O(n log k) heap โ”‚ ~10x โ”‚ โ”‚ Find median โ”‚ O(n log n) sort โ”‚ O(n) selection โ”‚ ~5x โ”‚ โ”‚ User rank query โ”‚ O(n) scan โ”‚ O(log n) tree โ”‚ ~100x โ”‚ โ”‚ Duplicate check โ”‚ O(n) lookup โ”‚ O(1) bloom โ”‚ ~1000x โ”‚ โ”‚ Similar messages โ”‚ O(nยฒmยฒ) naive โ”‚ O(nยฒm) LCS+DP โ”‚ ~10x โ”‚ โ”‚ Time histogram โ”‚ O(n log n) sort โ”‚ O(n+k) bucket โ”‚ ~5x โ”‚ โ”‚ Thread traversal โ”‚ O(n) repeated โ”‚ O(V+E) DFS/BFS โ”‚ ~10x โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### 1. LCS (Longest Common Subsequence) **Purpose:** Find similar/duplicate messages ``` String 1: "ืฉืœื•ื ืœื›ื•ืœื ืžื” ืงื•ืจื”" String 2: "ืฉืœื•ื ืœื›ื•ืœื ืžื” ื ืฉืžืข" โ†“ LCS: "ืฉืœื•ื ืœื›ื•ืœื ืžื” " Similarity: 77.78% ``` **Algorithm:** ``` โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ” โ”‚ โ”‚ โˆ… โ”‚ A โ”‚ B โ”‚ C โ”‚ D โ”‚ DP Table โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค โ”‚ โˆ… โ”‚ 0 โ”‚ 0 โ”‚ 0 โ”‚ 0 โ”‚ 0 โ”‚ dp[i][j] = length of LCS โ”‚ A โ”‚ 0 โ”‚ 1 โ”‚ 1 โ”‚ 1 โ”‚ 1 โ”‚ for first i and j chars โ”‚ C โ”‚ 0 โ”‚ 1 โ”‚ 1 โ”‚ 2 โ”‚ 2 โ”‚ โ”‚ B โ”‚ 0 โ”‚ 1 โ”‚ 2 โ”‚ 2 โ”‚ 2 โ”‚ Time: O(m ร— n) โ”‚ D โ”‚ 0 โ”‚ 1 โ”‚ 2 โ”‚ 2 โ”‚ 3 โ”‚ Space: O(min(m,n)) โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜ ``` ### 2. Heap-based Top-K **Purpose:** Find top K items without sorting everything ``` Finding Top 3 from [5,2,8,1,9,3,7,4,6] Min-Heap (size K=3): Step 1: [5] Add 5 Step 2: [2,5] Add 2 Step 3: [2,5,8] Add 8 (heap full) Step 4: [2,5,8] Skip 1 (< min) Step 5: [5,9,8] Replace 2 with 9 Step 6: [5,9,8] Skip 3 (< min) Step 7: [7,9,8] Replace 5 with 7 ... Result: [7,8,9] Top 3! Time: O(n log k) vs O(n log n) for full sort ``` ### 3. Selection Algorithm (Median of Medians) **Purpose:** Find k-th element or percentiles in O(n) ``` Find median of [3,1,4,1,5,9,2,6,5,3,5] โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Divide into groups of 5: โ”‚ โ”‚ [3,1,4,1,5] [9,2,6,5,3] [5] โ”‚ โ”‚ โ†“ โ†“ โ†“ โ”‚ โ”‚ Medians: 3 5 5 โ”‚ โ”‚ โ†“ โ”‚ โ”‚ Median of medians: 5 (pivot) โ”‚ โ”‚ โ†“ โ”‚ โ”‚ Partition around 5 โ”‚ โ”‚ [3,1,4,1,2,3] [5,5,5] [9,6] โ”‚ โ”‚ 6 elements 3 2 โ”‚ โ”‚ โ†“ โ”‚ โ”‚ Median is at position 5 โ†’ found! โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Time: O(n) guaranteed (not just average!) ``` ### 4. Rank Tree (Order Statistics Tree) **Purpose:** O(log n) rank queries ``` AVL Tree with size augmentation: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 150 (size=5) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”‚ 100 (s=2) โ”‚ โ”‚ 250 (s=2) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ด โ”Œโ”€โ”€โ”ด โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”‚50 (1) โ”‚ โ”‚300 (1)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ select(3) โ†’ 150 (3rd smallest) rank(150) โ†’ 3 (rank of 150) Time: O(log n) for both operations ``` ### 5. Bucket Sort (Time Histograms) **Purpose:** O(n+k) time-based grouping ``` Messages with timestamps: [1000, 1500, 2500, 1200, 3000] Bucket size: 1000 seconds โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 0-1000 โ”‚1000-2000โ”‚2000-3000โ”‚3000-4000โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ 1000 โ”‚ 2500 โ”‚ 3000 โ”‚ โ”‚ โ”‚ 1500 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ 1200 โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Count:0 โ”‚ Count:3 โ”‚ Count:1 โ”‚ Count:1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Time: O(n + k) where k = number of buckets ``` ### 6. DFS/BFS Thread Traversal **Purpose:** Reconstruct conversation threads ``` Reply Graph: [1] Original message โ”‚ โ”œโ”€โ”€[2] Reply to 1 โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€[4] Reply to 2 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€[5] Reply to 2 โ”‚ โ””โ”€โ”€[3] Reply to 1 DFS order: [1, 2, 4, 5, 3] (deep first) BFS order: [1, 2, 3, 4, 5] (level by level) With depth info: [1] depth=0 [2] depth=1 [4] depth=2 [5] depth=2 [3] depth=1 Time: O(V + E) ``` --- ## API Reference ### Dashboard REST API The web dashboard exposes a REST API for all operations: ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ REST API ENDPOINTS โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ GET /api/overview Overview statistics โ”‚ โ”‚ ?timeframe=month (today|yesterday|week|month|year|all) โ”‚ โ”‚ โ”‚ โ”‚ GET /api/users User leaderboard โ”‚ โ”‚ ?timeframe=month Timeframe filter โ”‚ โ”‚ &limit=100 Max users โ”‚ โ”‚ โ”‚ โ”‚ GET /api/user/ User details โ”‚ โ”‚ ?timeframe=month Includes hourly activity โ”‚ โ”‚ โ”‚ โ”‚ GET /api/search Full-text search โ”‚ โ”‚ ?q=search_term Search query โ”‚ โ”‚ &timeframe=all Timeframe filter โ”‚ โ”‚ &limit=20&offset=0 Pagination โ”‚ โ”‚ โ”‚ โ”‚ POST /api/ai/search AI-powered search โ”‚ โ”‚ {"query": "..."} Natural language query โ”‚ โ”‚ โ”‚ โ”‚ GET /api/chat/messages Chat messages โ”‚ โ”‚ ?limit=50&offset=0 Pagination โ”‚ โ”‚ &user_id=... Filter by user โ”‚ โ”‚ &from_date=... Date range โ”‚ โ”‚ โ”‚ โ”‚ GET /api/chat/thread/ Get conversation thread โ”‚ โ”‚ Returns full thread with DFS โ”‚ โ”‚ โ”‚ โ”‚ GET /api/top/domains Top shared domains โ”‚ โ”‚ GET /api/top/mentions Top mentioned users โ”‚ โ”‚ GET /api/top/words Most frequent words โ”‚ โ”‚ โ”‚ โ”‚ POST /api/update Update database with JSON โ”‚ โ”‚ (multipart form) File upload โ”‚ โ”‚ โ”‚ โ”‚ GET /api/db/stats Database statistics โ”‚ โ”‚ Size, counts, date range โ”‚ โ”‚ โ”‚ โ”‚ GET /api/export/users Export users as CSV โ”‚ โ”‚ GET /api/export/messages Export messages as CSV โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ ALGORITHM-POWERED ENDPOINTS โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ GET /api/similar/ Find similar messages (LCS algorithm) โ”‚ โ”‚ ?threshold=0.7 Similarity threshold โ”‚ โ”‚ ?limit=10 Max results โ”‚ โ”‚ Complexity: O(n*m) n=sample, m=avg length โ”‚ โ”‚ โ”‚ โ”‚ GET /api/analytics/similar Find all similar pairs in DB โ”‚ โ”‚ ?threshold=0.8 Similarity threshold โ”‚ โ”‚ Algorithm: LCS O(nยฒ * m) with early termination โ”‚ โ”‚ โ”‚ โ”‚ GET /api/user/rank/ Get user rank (RankTree) โ”‚ โ”‚ Complexity: O(log n) vs O(n) SQL scan โ”‚ โ”‚ โ”‚ โ”‚ GET /api/user/by-rank/ Get k-th ranked user (RankTree) โ”‚ โ”‚ Algorithm: select(k) O(log n) โ”‚ โ”‚ โ”‚ โ”‚ GET /api/analytics/histogram Activity histogram (Bucket Sort) โ”‚ โ”‚ ?bucket=86400 Bucket size in seconds โ”‚ โ”‚ Complexity: O(n + k) k=number of buckets โ”‚ โ”‚ โ”‚ โ”‚ GET /api/analytics/percentiles Message length stats (Selection) โ”‚ โ”‚ Algorithm: Quickselect O(n) guaranteed โ”‚ โ”‚ Returns: min,max,median,p25,p75,p90,p95,p99 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### TelegramSearch ```python from search import TelegramSearch with TelegramSearch('telegram.db') as search: # Full-text search results = search.search("ืฉืœื•ื", limit=50) # With filters results = search.search( "ืžื™ืœื”", user_id="user123", from_date=1704067200, # Unix timestamp to_date=1735689600, has_links=True ) # Fuzzy search results = search.fuzzy_search("ืฉืœืž", threshold=0.3) # Get thread (DFS) thread = search.get_thread_dfs(message_id=548795) # Get thread with depth thread = search.get_thread_with_depth(message_id=548795) # Returns: [(message_dict, depth), ...] # Autocomplete usernames suggestions = search.autocomplete_user("@user") ``` ### TelegramAnalyzer ```python from analyzer import TelegramAnalyzer with TelegramAnalyzer('telegram.db') as analyzer: # Statistics stats = analyzer.get_stats() # Top users (Heap-based) top_users = analyzer.get_top_users(limit=10) # Similar messages (LCS) similar = analyzer.find_similar_messages(threshold=0.7) # Percentiles (Selection algorithm) percentiles = analyzer.get_message_length_stats() # Returns: {min, max, median, p25, p75, p90, p95, p99} # User rank (Rank Tree) rank_info = analyzer.get_user_rank("user123") # Returns: {rank, total_users, percentile} # Get user by rank user = analyzer.get_user_by_rank(5) # Histogram (Bucket Sort) hist = analyzer.get_activity_histogram(bucket_size=86400) ``` --- ## Examples ### Example 1: Find Most Active Hours ```python from analyzer import TelegramAnalyzer with TelegramAnalyzer('telegram.db') as analyzer: hourly = analyzer.get_hourly_activity() # Find peak hour peak_hour = max(hourly, key=hourly.get) print(f"Most active hour: {peak_hour}:00 ({hourly[peak_hour]} messages)") ``` ### Example 2: Detect Spam/Reposts ```python from analyzer import TelegramAnalyzer with TelegramAnalyzer('telegram.db') as analyzer: reposts = analyzer.find_reposts(threshold=0.9) for r in reposts[:10]: print(f"Similarity: {r['similarity']:.0%}") print(f" User 1: {r['user_1']}") print(f" User 2: {r['user_2']}") print(f" Text: {r['text_preview'][:50]}...") ``` ### Example 3: Conversation Thread Analysis ```python from search import TelegramSearch with TelegramSearch('telegram.db') as search: # Get full thread thread = search.get_thread_with_depth(548795) print("Conversation thread:") for msg, depth in thread: indent = " " * depth print(f"{indent}[{msg['from_name']}]: {msg['text_plain'][:50]}") ``` ### Example 4: User Ranking ```python from analyzer import TelegramAnalyzer with TelegramAnalyzer('telegram.db') as analyzer: # Get rank of specific user rank = analyzer.get_user_rank("user123456") print(f"Rank: #{rank['rank']} of {rank['total_users']}") print(f"Top {rank['percentile']:.1f}%") # Get top 3 users for i in range(1, 4): user = analyzer.get_user_by_rank(i) print(f"#{i}: {user['name']} ({user['count']} messages)") ``` --- ## Performance Tested on 100,000 messages: | Operation | Time | |-----------|------| | Indexing | ~10 seconds | | Full-text search | <10ms | | Fuzzy search | ~100ms | | Top-K (k=20) | ~50ms | | User rank query | <1ms | | Thread traversal | <5ms | | Similar messages (1000 sample) | ~2 seconds | --- ## License MIT License - Free for personal and commercial use. --- ## Contributing 1. Fork the repository 2. Create feature branch 3. Commit changes 4. Push and create PR --- ## Troubleshooting ### "Module not found" error ```bash # Make sure you're in the telegram directory cd /path/to/telegram python indexer.py result.json ``` ### "Database is locked" error ```bash # Close any other programs using the database # Or use a different database name python indexer.py result.json --db telegram2.db ``` ### Hebrew text not displaying correctly ```bash # Ensure your terminal supports UTF-8 export LANG=en_US.UTF-8 ``` --- ## Credits Algorithms implemented from "Data Structures and Introduction to Algorithms" course: - LCS (Longest Common Subsequence) - Heap-based Top-K - Selection Algorithm (Median of Medians) - Rank Tree (Order Statistics Tree) - Bucket Sort - DFS/BFS Graph Traversal - Bloom Filter - Trie (Prefix Tree)