telegram-analytics / README.md
rottg's picture
Upload folder using huggingface_hub
a99d4dc
---
title: Telegram Analytics Dashboard
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
---
# Telegram JSON Indexer & Analyzer
A high-performance system for indexing, searching, and analyzing Telegram chat exports using SQLite FTS5 and advanced algorithms from Data Structures course. Includes a full-featured **Web Dashboard** with **AI-powered search**.
```
╔══════════════════════════════════════════════════════════════════════════════╗
║ TELEGRAM CHAT ANALYZER ║
║ ║
║ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────────────┐ ║
║ │ JSON │───▶│ INDEXER │───▶│ SQLite │───▶│ WEB DASHBOARD │ ║
║ │ Export │ │ Bloom │ │ + FTS5 │ │ ┌─────┬─────┬─────┐ │ ║
║ │ │ │ Filter │ │ │ │ │Stats│Users│Chat │ │ ║
║ └─────────┘ └─────────┘ └─────────┘ │ ├─────┼─────┼─────┤ │ ║
║ │ │Search│ AI │Mod │ │ ║
║ │ └─────┴─────┴─────┘ │ ║
║ └─────────────────────────┘ ║
╚══════════════════════════════════════════════════════════════════════════════╝
```
## Features
### Core Features
- **Full-Text Search** - Fast search with Hebrew support using SQLite FTS5
- **Fuzzy Search** - Find messages even with typos using trigram similarity
- **Similar Message Detection** - LCS algorithm finds duplicates/reposts
- **Conversation Threads** - DFS/BFS traversal reconstructs reply chains
- **User Rankings** - O(log n) rank queries using AVL Rank Tree
- **Time Analytics** - Bucket Sort for efficient histograms
- **Top-K Queries** - Heap-based O(n log k) instead of O(n log n)
- **Percentiles** - O(n) median/percentiles using Selection algorithm
### Web Dashboard
- **Interactive Overview** - Charts, stats, activity graphs
- **User Leaderboard** - Rankings with detailed user profiles
- **Telegram-like Chat View** - Browse all messages like in Telegram
- **Advanced Search** - Full-text + fuzzy search with filters
- **AI-Powered Search** - Natural language queries (Hebrew/English)
- **Moderation Analytics** - Links, mentions, domains analysis
- **Database Updates** - Upload new JSON files via web UI
### AI Search (Free Providers)
- **Ollama** - Local LLM (recommended, 100% free)
- **Groq** - Free API tier available
- **Google Gemini** - Free API tier available
---
## Table of Contents
1. [Installation](#installation)
2. [Quick Start](#quick-start)
3. [Web Dashboard](#web-dashboard)
4. [AI Search](#ai-search)
5. [Database Updates](#database-updates)
6. [Architecture](#architecture)
7. [Usage Guide](#usage-guide)
8. [Algorithms](#algorithms)
9. [API Reference](#api-reference)
10. [Examples](#examples)
---
## Installation
### Requirements
- Python 3.10 or higher
- No external packages required for core functionality
### Setup
```bash
# Clone or download the project
cd telegram
# Verify Python version
python --version # Should be 3.10+
# Test the system
python algorithms.py # Should print "ALL TESTS PASSED!"
```
### Optional: Semantic Search
For AI-powered semantic similarity search:
```bash
pip install numpy faiss-cpu sentence-transformers
```
---
## Quick Start
### Step 1: Export from Telegram
1. Open Telegram Desktop
2. Go to any chat/group
3. Click ⋮ → Export Chat History
4. Select JSON format
5. Save as `result.json`
### Step 2: Index Your Data
```bash
python indexer.py result.json --db telegram.db
```
### Step 3: Launch Web Dashboard
```bash
# Start the dashboard (recommended)
python dashboard.py
# Open in browser: http://localhost:5000
```
### Step 4: Search & Analyze (CLI)
```bash
# Search messages
python search.py "שלום"
# View statistics
python analyzer.py --stats
# Find similar messages
python analyzer.py --similar
```
---
## Web Dashboard
The web dashboard provides a complete visual interface for analyzing your Telegram data.
### Starting the Dashboard
```bash
python dashboard.py
# Or with custom port:
python dashboard.py --port 8080
```
### Dashboard Pages
```
┌─────────────────────────────────────────────────────────────────────────┐
│ WEB DASHBOARD │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 📈 Overview │ Main statistics, charts, activity graphs │
│ │ - Total messages, users, links, media │
│ │ - Daily/hourly activity charts │
│ │ - Top users leaderboard │
│ │
│ 👥 Users │ User leaderboard with detailed profiles │
│ │ - Ranking by message count │
│ │ - User details modal (hourly activity) │
│ │ - Export users to CSV │
│ │
│ 💬 Chat │ Telegram-like message view │
│ │ - Browse all messages chronologically │
│ │ - Filter by user, date, media type │
│ │ - Click message to view full thread │
│ │ - AI search with natural language │
│ │
│ 🔍 Search │ Advanced search interface │
│ │ - Full-text search (Hebrew supported) │
│ │ - AI-powered natural language search │
│ │ - Boolean operators (AND, OR, NOT) │
│ │ - Export search results │
│ │
│ 🛡️ Moderation │ Content analytics │
│ │ - Top shared domains │
│ │ - Most mentioned users │
│ │ - Link sharers leaderboard │
│ │ - Word frequency analysis │
│ │
│ ⚙️ Settings │ Database management │
│ │ - View database statistics │
│ │ - Upload new JSON files │
│ │ - Automatic duplicate detection │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### Dashboard Features
- **Dark Theme** - Modern dark UI, easy on the eyes
- **RTL Support** - Full Hebrew/Arabic text support
- **Responsive** - Works on mobile and desktop
- **Real-time Charts** - Interactive Chart.js visualizations
- **Export** - Download data as CSV/JSON
---
## AI Search
Ask questions about your chat data in natural language (Hebrew or English).
### Setup AI Provider (Free Options)
#### Option 1: Ollama (Recommended - 100% Local & Free)
```bash
# Install Ollama (https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull llama3.2
# Start Ollama server
ollama serve
```
#### Option 2: Groq (Free API Tier)
```bash
# Get free API key from https://console.groq.com
export GROQ_API_KEY="your_api_key"
```
#### Option 3: Google Gemini (Free API Tier)
```bash
# Get free API key from https://makersuite.google.com/app/apikey
export GEMINI_API_KEY="your_api_key"
```
### AI Search Examples
```
┌─────────────────────────────────────────────────────────────────────────┐
│ 🤖 AI Search - Natural Language Queries │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Query: "מי שלח הכי הרבה הודעות?" │
│ Answer: המשתמש הפעיל ביותר הוא דני עם 5,432 הודעות │
│ │
│ Query: "מתי היו הכי הרבה הודעות?" │
│ Answer: היום הפעיל ביותר היה 15.03.2024 עם 342 הודעות │
│ │
│ Query: "Who mentioned @admin the most?" │
│ Answer: User "Mike" mentioned @admin 47 times │
│ │
│ Query: "הראה הודעות עם קישורים מהשבוע האחרון" │
│ Answer: נמצאו 23 הודעות עם קישורים... │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### AI Search API
```python
from ai_search import AISearchEngine
# Initialize with Ollama (local)
ai = AISearchEngine('telegram.db', provider='ollama')
# Or with Groq
ai = AISearchEngine('telegram.db', provider='groq', api_key='your_key')
# Search
result = ai.search("מי הכי פעיל בלילה?")
print(result['answer']) # Natural language answer
print(result['sql']) # Generated SQL query
print(result['results']) # Raw data
```
---
## Database Updates
Update your database with new JSON exports without losing existing data.
### Via Web UI
1. Go to **Settings** page in the dashboard
2. Drag & drop your new `result.json` file
3. Wait for processing (duplicate detection automatic)
4. See summary of new messages added
### Via CLI
```bash
# Update existing database with new JSON
python indexer.py new_export.json --db telegram.db --update
# What happens:
# 1. Loads existing message IDs into Bloom filter (O(n))
# 2. For each message in JSON:
# - Check if exists using Bloom filter (O(1))
# - Only insert if new
# 3. Re-index FTS if needed
# 4. Report: X new messages, Y duplicates skipped
```
### Incremental Update Process
```
┌─────────────────────────────────────────────────────────────────────────┐
│ INCREMENTAL UPDATE PROCESS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Existing DB New JSON │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ msg_1 ✓ │ │ msg_1 │ → Skip (duplicate) │
│ │ msg_2 ✓ │ │ msg_2 │ → Skip (duplicate) │
│ │ msg_3 ✓ │ │ msg_5 NEW │ → Insert │
│ │ msg_4 ✓ │ │ msg_6 NEW │ → Insert │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ │ Bloom Filter │ │
│ │ ┌───────────┐ │ │
│ └─────▶│ O(1) test │◀─────────┘ │
│ └───────────┘ │
│ │
│ Result: Only msg_5 and msg_6 added (fast!) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## Architecture
### System Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ INPUT │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Telegram JSON Export (result.json) │ │
│ │ ├── messages[] │ │
│ │ │ ├── id, date, from, text │ │
│ │ │ ├── reply_to_message_id │ │
│ │ │ └── text_entities[] (links, mentions) │ │
│ │ └── ... │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────┬───────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ INDEXER (indexer.py) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Batch │ │ Bloom │ │ Reply │ │
│ │ Processing │ │ Filter │ │ Graph │ │
│ │ (1000/tx) │ │ (Dedup O(1))│ │ Builder │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────┬───────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ SQLite DATABASE │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ messages │ FTS5 Index │ reply_graph │ │
│ │ ├── id (PK) │ ├── text_plain │ ├── parent_id │ │
│ │ ├── text_plain │ └── from_name │ └── child_id │ │
│ │ ├── from_id │ │ │ │
│ │ ├── date_unixtime │ entities │ threads │ │
│ │ └── ... │ ├── links │ └── messages │ │
│ │ │ └── mentions │ │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────┬───────────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SEARCH │ │ ANALYZER │ │ VECTOR │
│ (search.py) │ │(analyzer.py)│ │ (optional) │
│ │ │ │ │ │
│ • FTS5+BM25 │ │ • Top-K │ │ • FAISS │
│ • Fuzzy │ │ • LCS │ │ • Semantic │
│ • Threads │ │ • Rank Tree │ │ • Clustering│
│ • LRU Cache │ │ • Percentile│ │ │
└─────────────┘ └─────────────┘ └─────────────┘
```
### Data Flow
```
JSON Message Database Tables Search/Analytics
─────────── ─────────────── ────────────────
{ ┌─────────────┐
"id": 548795, ───▶ │ messages │ ───▶ Full-text search
"text": "שלום", └─────────────┘ User filtering
"from": "User1", Date range queries
"from_id": "user123", ─▶ ┌─────────────┐
"date_unixtime": ..., │ users │ ───▶ Top users (Heap)
└─────────────┘ User rank (Rank Tree)
"text_entities": [
{"type": "link", ────▶ ┌─────────────┐
"text": "url"} │ entities │ ───▶ Link analysis
], └─────────────┘ Mention network
"reply_to_message_id" ─▶ ┌─────────────┐
│ reply_graph │ ───▶ Thread DFS/BFS
} └─────────────┘ Conversation view
```
### File Structure
```
telegram/
├── dashboard.py # 🌐 Web Dashboard (Flask)
│ └── Routes: /, /users, /chat, /search, /moderation, /settings
│ └── API: /api/overview, /api/users, /api/search, /api/update, etc.
├── ai_search.py # 🤖 AI-Powered Search
│ └── AISearchEngine class
│ ├── Natural language to SQL
│ ├── Ollama/Groq/Gemini providers
│ └── Hebrew/English support
├── indexer.py # JSON → SQLite indexer
│ ├── OptimizedIndexer class
│ │ ├── Batch processing (100x faster)
│ │ ├── Bloom filter (duplicate detection)
│ │ └── Graph builder (reply threads)
│ └── IncrementalIndexer class
│ ├── Update existing database
│ ├── Bloom filter duplicate check
│ └── Only insert new messages
├── search.py # Search interface
│ └── TelegramSearch class
│ ├── FTS5 full-text search
│ ├── Fuzzy trigram search
│ ├── LRU query cache
│ └── DFS/BFS thread traversal
├── analyzer.py # Analytics & statistics
│ └── TelegramAnalyzer class
│ ├── LCS similar messages
│ ├── Heap-based Top-K
│ ├── Selection percentiles
│ ├── Rank Tree queries
│ └── Bucket Sort histograms
├── data_structures.py # Core data structures
│ ├── BloomFilter # O(1) membership test
│ ├── Trie # O(k) prefix search
│ ├── LRUCache # O(1) caching
│ ├── ReplyGraph # DFS/BFS traversal
│ └── TrigramIndex # Fuzzy matching
├── algorithms.py # Course algorithms
│ ├── LCS # Similar message detection
│ ├── TopK (Heap) # Efficient ranking
│ ├── Selection # O(n) percentiles
│ ├── RankTree # O(log n) rank queries
│ └── BucketSort # Time histograms
├── templates/ # 🎨 HTML Templates
│ ├── index.html # Overview dashboard
│ ├── users.html # User leaderboard
│ ├── chat.html # Telegram-like chat view
│ ├── search.html # Search interface
│ ├── moderation.html # Content analytics
│ └── settings.html # Settings & DB update
├── static/ # 📁 Static assets
│ ├── css/style.css # Dashboard styles
│ └── js/dashboard.js # Dashboard scripts
├── vector_search.py # Optional: Semantic search
│ └── VectorSearch class (requires FAISS)
├── schema.sql # Database schema
└── telegram.db # SQLite database (created)
```
---
## Usage Guide
### Web Dashboard (Recommended)
```bash
# Start the dashboard
python dashboard.py
# Custom port
python dashboard.py --port 8080
# Custom database
python dashboard.py --db my_chat.db
```
### Indexing
```bash
# Basic indexing
python indexer.py result.json
# Custom database name
python indexer.py result.json --db my_chat.db
# With trigram index (for fuzzy search)
python indexer.py result.json --build-trigrams
# Larger batch size (faster for big files)
python indexer.py result.json --batch-size 5000
# Update existing database with new JSON (incremental)
python indexer.py new_export.json --db telegram.db --update
```
### Searching
```bash
# Basic search (Hebrew supported)
python search.py "שלום"
# Search with filters
python search.py "מילה" --user user123456 --limit 50
# Date range
python search.py "חדשות" --from-date 2024-01-01 --to-date 2024-12-31
# Fuzzy search (finds typos)
python search.py "שלמ" --fuzzy --threshold 0.3
# View conversation thread
python search.py --thread 548795
# List all links
python search.py --list-links
# List all mentions
python search.py --list-mentions
```
### Analytics
```bash
# General statistics
python analyzer.py --stats
# Top users (Heap-based O(n log k))
python analyzer.py --top-users --limit 10
# Hourly activity
python analyzer.py --hourly
# Daily activity
python analyzer.py --daily
# Top words
python analyzer.py --words --limit 30
# Top domains
python analyzer.py --domains
# Find similar messages (LCS algorithm)
python analyzer.py --similar --threshold 0.7
# Find reposts
python analyzer.py --reposts
# Message length percentiles (Selection algorithm)
python analyzer.py --percentiles
# Response time percentiles
python analyzer.py --response-times
# User rank (Rank Tree O(log n))
python analyzer.py --user-rank user123456
# Get user at rank #5
python analyzer.py --rank 5
# Activity histogram (Bucket Sort)
python analyzer.py --histogram --bucket-size 86400
# Export as JSON
python analyzer.py --stats --json > stats.json
```
---
## Algorithms
### Algorithm Complexity Comparison
```
┌────────────────────┬─────────────────┬─────────────────┬─────────────┐
│ Operation │ Naive Method │ Our Algorithm │ Improvement │
├────────────────────┼─────────────────┼─────────────────┼─────────────┤
│ Top-K users │ O(n log n) sort │ O(n log k) heap │ ~10x │
│ Find median │ O(n log n) sort │ O(n) selection │ ~5x │
│ User rank query │ O(n) scan │ O(log n) tree │ ~100x │
│ Duplicate check │ O(n) lookup │ O(1) bloom │ ~1000x │
│ Similar messages │ O(n²m²) naive │ O(n²m) LCS+DP │ ~10x │
│ Time histogram │ O(n log n) sort │ O(n+k) bucket │ ~5x │
│ Thread traversal │ O(n) repeated │ O(V+E) DFS/BFS │ ~10x │
└────────────────────┴─────────────────┴─────────────────┴─────────────┘
```
### 1. LCS (Longest Common Subsequence)
**Purpose:** Find similar/duplicate messages
```
String 1: "שלום לכולם מה קורה"
String 2: "שלום לכולם מה נשמע"
LCS: "שלום לכולם מה "
Similarity: 77.78%
```
**Algorithm:**
```
┌───┬───┬───┬───┬───┬───┐
│ │ ∅ │ A │ B │ C │ D │ DP Table
├───┼───┼───┼───┼───┼───┤
│ ∅ │ 0 │ 0 │ 0 │ 0 │ 0 │ dp[i][j] = length of LCS
│ A │ 0 │ 1 │ 1 │ 1 │ 1 │ for first i and j chars
│ C │ 0 │ 1 │ 1 │ 2 │ 2 │
│ B │ 0 │ 1 │ 2 │ 2 │ 2 │ Time: O(m × n)
│ D │ 0 │ 1 │ 2 │ 2 │ 3 │ Space: O(min(m,n))
└───┴───┴───┴───┴───┴───┘
```
### 2. Heap-based Top-K
**Purpose:** Find top K items without sorting everything
```
Finding Top 3 from [5,2,8,1,9,3,7,4,6]
Min-Heap (size K=3):
Step 1: [5] Add 5
Step 2: [2,5] Add 2
Step 3: [2,5,8] Add 8 (heap full)
Step 4: [2,5,8] Skip 1 (< min)
Step 5: [5,9,8] Replace 2 with 9
Step 6: [5,9,8] Skip 3 (< min)
Step 7: [7,9,8] Replace 5 with 7
...
Result: [7,8,9] Top 3!
Time: O(n log k) vs O(n log n) for full sort
```
### 3. Selection Algorithm (Median of Medians)
**Purpose:** Find k-th element or percentiles in O(n)
```
Find median of [3,1,4,1,5,9,2,6,5,3,5]
┌─────────────────────────────────────────┐
│ Divide into groups of 5: │
│ [3,1,4,1,5] [9,2,6,5,3] [5] │
│ ↓ ↓ ↓ │
│ Medians: 3 5 5 │
│ ↓ │
│ Median of medians: 5 (pivot) │
│ ↓ │
│ Partition around 5 │
│ [3,1,4,1,2,3] [5,5,5] [9,6] │
│ 6 elements 3 2 │
│ ↓ │
│ Median is at position 5 → found! │
└─────────────────────────────────────────┘
Time: O(n) guaranteed (not just average!)
```
### 4. Rank Tree (Order Statistics Tree)
**Purpose:** O(log n) rank queries
```
AVL Tree with size augmentation:
┌───────────────┐
│ 150 (size=5) │
└───────┬───────┘
┌────────┴────────┐
┌─────┴─────┐ ┌─────┴─────┐
│ 100 (s=2) │ │ 250 (s=2) │
└─────┬─────┘ └─────┬─────┘
┌─────┴ ┌──┴
┌───┴───┐ ┌───┴───┐
│50 (1) │ │300 (1)│
└───────┘ └───────┘
select(3) → 150 (3rd smallest)
rank(150) → 3 (rank of 150)
Time: O(log n) for both operations
```
### 5. Bucket Sort (Time Histograms)
**Purpose:** O(n+k) time-based grouping
```
Messages with timestamps:
[1000, 1500, 2500, 1200, 3000]
Bucket size: 1000 seconds
┌─────────┬─────────┬─────────┬─────────┐
│ 0-1000 │1000-2000│2000-3000│3000-4000│
├─────────┼─────────┼─────────┼─────────┤
│ │ 1000 │ 2500 │ 3000 │
│ │ 1500 │ │ │
│ │ 1200 │ │ │
├─────────┼─────────┼─────────┼─────────┤
│ Count:0 │ Count:3 │ Count:1 │ Count:1 │
└─────────┴─────────┴─────────┴─────────┘
Time: O(n + k) where k = number of buckets
```
### 6. DFS/BFS Thread Traversal
**Purpose:** Reconstruct conversation threads
```
Reply Graph:
[1] Original message
├──[2] Reply to 1
│ │
│ ├──[4] Reply to 2
│ │
│ └──[5] Reply to 2
└──[3] Reply to 1
DFS order: [1, 2, 4, 5, 3] (deep first)
BFS order: [1, 2, 3, 4, 5] (level by level)
With depth info:
[1] depth=0
[2] depth=1
[4] depth=2
[5] depth=2
[3] depth=1
Time: O(V + E)
```
---
## API Reference
### Dashboard REST API
The web dashboard exposes a REST API for all operations:
```
┌─────────────────────────────────────────────────────────────────────────┐
│ REST API ENDPOINTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GET /api/overview Overview statistics │
│ ?timeframe=month (today|yesterday|week|month|year|all) │
│ │
│ GET /api/users User leaderboard │
│ ?timeframe=month Timeframe filter │
│ &limit=100 Max users │
│ │
│ GET /api/user/<user_id> User details │
│ ?timeframe=month Includes hourly activity │
│ │
│ GET /api/search Full-text search │
│ ?q=search_term Search query │
│ &timeframe=all Timeframe filter │
│ &limit=20&offset=0 Pagination │
│ │
│ POST /api/ai/search AI-powered search │
│ {"query": "..."} Natural language query │
│ │
│ GET /api/chat/messages Chat messages │
│ ?limit=50&offset=0 Pagination │
│ &user_id=... Filter by user │
│ &from_date=... Date range │
│ │
│ GET /api/chat/thread/<id> Get conversation thread │
│ Returns full thread with DFS │
│ │
│ GET /api/top/domains Top shared domains │
│ GET /api/top/mentions Top mentioned users │
│ GET /api/top/words Most frequent words │
│ │
│ POST /api/update Update database with JSON │
│ (multipart form) File upload │
│ │
│ GET /api/db/stats Database statistics │
│ Size, counts, date range │
│ │
│ GET /api/export/users Export users as CSV │
│ GET /api/export/messages Export messages as CSV │
│ │
├─────────────────────────────────────────────────────────────────────────┤
│ ALGORITHM-POWERED ENDPOINTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GET /api/similar/<id> Find similar messages (LCS algorithm) │
│ ?threshold=0.7 Similarity threshold │
│ ?limit=10 Max results │
│ Complexity: O(n*m) n=sample, m=avg length │
│ │
│ GET /api/analytics/similar Find all similar pairs in DB │
│ ?threshold=0.8 Similarity threshold │
│ Algorithm: LCS O(n² * m) with early termination │
│ │
│ GET /api/user/rank/<id> Get user rank (RankTree) │
│ Complexity: O(log n) vs O(n) SQL scan │
│ │
│ GET /api/user/by-rank/<k> Get k-th ranked user (RankTree) │
│ Algorithm: select(k) O(log n) │
│ │
│ GET /api/analytics/histogram Activity histogram (Bucket Sort) │
│ ?bucket=86400 Bucket size in seconds │
│ Complexity: O(n + k) k=number of buckets │
│ │
│ GET /api/analytics/percentiles Message length stats (Selection) │
│ Algorithm: Quickselect O(n) guaranteed │
│ Returns: min,max,median,p25,p75,p90,p95,p99 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### TelegramSearch
```python
from search import TelegramSearch
with TelegramSearch('telegram.db') as search:
# Full-text search
results = search.search("שלום", limit=50)
# With filters
results = search.search(
"מילה",
user_id="user123",
from_date=1704067200, # Unix timestamp
to_date=1735689600,
has_links=True
)
# Fuzzy search
results = search.fuzzy_search("שלמ", threshold=0.3)
# Get thread (DFS)
thread = search.get_thread_dfs(message_id=548795)
# Get thread with depth
thread = search.get_thread_with_depth(message_id=548795)
# Returns: [(message_dict, depth), ...]
# Autocomplete usernames
suggestions = search.autocomplete_user("@user")
```
### TelegramAnalyzer
```python
from analyzer import TelegramAnalyzer
with TelegramAnalyzer('telegram.db') as analyzer:
# Statistics
stats = analyzer.get_stats()
# Top users (Heap-based)
top_users = analyzer.get_top_users(limit=10)
# Similar messages (LCS)
similar = analyzer.find_similar_messages(threshold=0.7)
# Percentiles (Selection algorithm)
percentiles = analyzer.get_message_length_stats()
# Returns: {min, max, median, p25, p75, p90, p95, p99}
# User rank (Rank Tree)
rank_info = analyzer.get_user_rank("user123")
# Returns: {rank, total_users, percentile}
# Get user by rank
user = analyzer.get_user_by_rank(5)
# Histogram (Bucket Sort)
hist = analyzer.get_activity_histogram(bucket_size=86400)
```
---
## Examples
### Example 1: Find Most Active Hours
```python
from analyzer import TelegramAnalyzer
with TelegramAnalyzer('telegram.db') as analyzer:
hourly = analyzer.get_hourly_activity()
# Find peak hour
peak_hour = max(hourly, key=hourly.get)
print(f"Most active hour: {peak_hour}:00 ({hourly[peak_hour]} messages)")
```
### Example 2: Detect Spam/Reposts
```python
from analyzer import TelegramAnalyzer
with TelegramAnalyzer('telegram.db') as analyzer:
reposts = analyzer.find_reposts(threshold=0.9)
for r in reposts[:10]:
print(f"Similarity: {r['similarity']:.0%}")
print(f" User 1: {r['user_1']}")
print(f" User 2: {r['user_2']}")
print(f" Text: {r['text_preview'][:50]}...")
```
### Example 3: Conversation Thread Analysis
```python
from search import TelegramSearch
with TelegramSearch('telegram.db') as search:
# Get full thread
thread = search.get_thread_with_depth(548795)
print("Conversation thread:")
for msg, depth in thread:
indent = " " * depth
print(f"{indent}[{msg['from_name']}]: {msg['text_plain'][:50]}")
```
### Example 4: User Ranking
```python
from analyzer import TelegramAnalyzer
with TelegramAnalyzer('telegram.db') as analyzer:
# Get rank of specific user
rank = analyzer.get_user_rank("user123456")
print(f"Rank: #{rank['rank']} of {rank['total_users']}")
print(f"Top {rank['percentile']:.1f}%")
# Get top 3 users
for i in range(1, 4):
user = analyzer.get_user_by_rank(i)
print(f"#{i}: {user['name']} ({user['count']} messages)")
```
---
## Performance
Tested on 100,000 messages:
| Operation | Time |
|-----------|------|
| Indexing | ~10 seconds |
| Full-text search | <10ms |
| Fuzzy search | ~100ms |
| Top-K (k=20) | ~50ms |
| User rank query | <1ms |
| Thread traversal | <5ms |
| Similar messages (1000 sample) | ~2 seconds |
---
## License
MIT License - Free for personal and commercial use.
---
## Contributing
1. Fork the repository
2. Create feature branch
3. Commit changes
4. Push and create PR
---
## Troubleshooting
### "Module not found" error
```bash
# Make sure you're in the telegram directory
cd /path/to/telegram
python indexer.py result.json
```
### "Database is locked" error
```bash
# Close any other programs using the database
# Or use a different database name
python indexer.py result.json --db telegram2.db
```
### Hebrew text not displaying correctly
```bash
# Ensure your terminal supports UTF-8
export LANG=en_US.UTF-8
```
---
## Credits
Algorithms implemented from "Data Structures and Introduction to Algorithms" course:
- LCS (Longest Common Subsequence)
- Heap-based Top-K
- Selection Algorithm (Median of Medians)
- Rank Tree (Order Statistics Tree)
- Bucket Sort
- DFS/BFS Graph Traversal
- Bloom Filter
- Trie (Prefix Tree)