Spaces:

rottg
/

telegram-analytics

Running

File size: 44,541 Bytes

a99d4dc

---

title: Telegram Analytics Dashboard
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
---


# Telegram JSON Indexer & Analyzer

A high-performance system for indexing, searching, and analyzing Telegram chat exports using SQLite FTS5 and advanced algorithms from Data Structures course. Includes a full-featured **Web Dashboard** with **AI-powered search**.

```

╔══════════════════════════════════════════════════════════════════════════════╗

║                         TELEGRAM CHAT ANALYZER                                ║

║                                                                               ║

║  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────────────────────┐    ║

║  │  JSON   │───▶│ INDEXER │───▶│ SQLite  │───▶│     WEB DASHBOARD       │    ║

║  │ Export  │    │ Bloom   │    │ + FTS5  │    │  ┌─────┬─────┬─────┐   │    ║

║  │         │    │ Filter  │    │         │    │  │Stats│Users│Chat │   │    ║

║  └─────────┘    └─────────┘    └─────────┘    │  ├─────┼─────┼─────┤   │    ║

║                                               │  │Search│ AI  │Mod  │   │    ║

║                                               │  └─────┴─────┴─────┘   │    ║

║                                               └─────────────────────────┘    ║

╚══════════════════════════════════════════════════════════════════════════════╝

```

## Features

### Core Features
- **Full-Text Search** - Fast search with Hebrew support using SQLite FTS5
- **Fuzzy Search** - Find messages even with typos using trigram similarity
- **Similar Message Detection** - LCS algorithm finds duplicates/reposts
- **Conversation Threads** - DFS/BFS traversal reconstructs reply chains
- **User Rankings** - O(log n) rank queries using AVL Rank Tree
- **Time Analytics** - Bucket Sort for efficient histograms
- **Top-K Queries** - Heap-based O(n log k) instead of O(n log n)
- **Percentiles** - O(n) median/percentiles using Selection algorithm

### Web Dashboard
- **Interactive Overview** - Charts, stats, activity graphs
- **User Leaderboard** - Rankings with detailed user profiles
- **Telegram-like Chat View** - Browse all messages like in Telegram
- **Advanced Search** - Full-text + fuzzy search with filters
- **AI-Powered Search** - Natural language queries (Hebrew/English)
- **Moderation Analytics** - Links, mentions, domains analysis
- **Database Updates** - Upload new JSON files via web UI

### AI Search (Free Providers)
- **Ollama** - Local LLM (recommended, 100% free)
- **Groq** - Free API tier available
- **Google Gemini** - Free API tier available

---

## Table of Contents

1. [Installation](#installation)
2. [Quick Start](#quick-start)
3. [Web Dashboard](#web-dashboard)
4. [AI Search](#ai-search)
5. [Database Updates](#database-updates)
6. [Architecture](#architecture)
7. [Usage Guide](#usage-guide)
8. [Algorithms](#algorithms)
9. [API Reference](#api-reference)
10. [Examples](#examples)

---

## Installation

### Requirements

- Python 3.10 or higher
- No external packages required for core functionality

### Setup

```bash

# Clone or download the project

cd telegram



# Verify Python version

python --version  # Should be 3.10+



# Test the system

python algorithms.py  # Should print "ALL TESTS PASSED!"

```

### Optional: Semantic Search

For AI-powered semantic similarity search:

```bash

pip install numpy faiss-cpu sentence-transformers

```

---

## Quick Start

### Step 1: Export from Telegram

1. Open Telegram Desktop
2. Go to any chat/group
3. Click ⋮ → Export Chat History
4. Select JSON format
5. Save as `result.json`

### Step 2: Index Your Data

```bash

python indexer.py result.json --db telegram.db

```

### Step 3: Launch Web Dashboard

```bash

# Start the dashboard (recommended)

python dashboard.py



# Open in browser: http://localhost:5000

```

### Step 4: Search & Analyze (CLI)

```bash

# Search messages

python search.py "שלום"



# View statistics

python analyzer.py --stats



# Find similar messages

python analyzer.py --similar

```

---

## Web Dashboard

The web dashboard provides a complete visual interface for analyzing your Telegram data.

### Starting the Dashboard

```bash

python dashboard.py

# Or with custom port:

python dashboard.py --port 8080

```

### Dashboard Pages

```

┌─────────────────────────────────────────────────────────────────────────┐

│                           WEB DASHBOARD                                  │

├─────────────────────────────────────────────────────────────────────────┤

│                                                                          │

│  📈 Overview      │  Main statistics, charts, activity graphs           │

│                   │  - Total messages, users, links, media              │

│                   │  - Daily/hourly activity charts                     │

│                   │  - Top users leaderboard                            │

│                                                                          │

│  👥 Users         │  User leaderboard with detailed profiles            │

│                   │  - Ranking by message count                         │

│                   │  - User details modal (hourly activity)             │

│                   │  - Export users to CSV                              │

│                                                                          │

│  💬 Chat          │  Telegram-like message view                         │

│                   │  - Browse all messages chronologically              │

│                   │  - Filter by user, date, media type                 │

│                   │  - Click message to view full thread                │

│                   │  - AI search with natural language                  │

│                                                                          │

│  🔍 Search        │  Advanced search interface                          │

│                   │  - Full-text search (Hebrew supported)              │

│                   │  - AI-powered natural language search               │

│                   │  - Boolean operators (AND, OR, NOT)                 │

│                   │  - Export search results                            │

│                                                                          │

│  🛡️ Moderation    │  Content analytics                                  │

│                   │  - Top shared domains                               │

│                   │  - Most mentioned users                             │

│                   │  - Link sharers leaderboard                         │

│                   │  - Word frequency analysis                          │

│                                                                          │

│  ⚙️ Settings      │  Database management                                │

│                   │  - View database statistics                         │

│                   │  - Upload new JSON files                            │

│                   │  - Automatic duplicate detection                    │

│                                                                          │

└─────────────────────────────────────────────────────────────────────────┘

```

### Dashboard Features

- **Dark Theme** - Modern dark UI, easy on the eyes
- **RTL Support** - Full Hebrew/Arabic text support
- **Responsive** - Works on mobile and desktop
- **Real-time Charts** - Interactive Chart.js visualizations
- **Export** - Download data as CSV/JSON

---

## AI Search

Ask questions about your chat data in natural language (Hebrew or English).

### Setup AI Provider (Free Options)

#### Option 1: Ollama (Recommended - 100% Local & Free)

```bash

# Install Ollama (https://ollama.ai)

curl -fsSL https://ollama.ai/install.sh | sh



# Pull a model

ollama pull llama3.2



# Start Ollama server

ollama serve

```

#### Option 2: Groq (Free API Tier)

```bash

# Get free API key from https://console.groq.com

export GROQ_API_KEY="your_api_key"

```

#### Option 3: Google Gemini (Free API Tier)

```bash

# Get free API key from https://makersuite.google.com/app/apikey

export GEMINI_API_KEY="your_api_key"

```

### AI Search Examples

```

┌─────────────────────────────────────────────────────────────────────────┐

│  🤖 AI Search - Natural Language Queries                                │

├─────────────────────────────────────────────────────────────────────────┤

│                                                                          │

│  Query: "מי שלח הכי הרבה הודעות?"                                       │

│  Answer: המשתמש הפעיל ביותר הוא דני עם 5,432 הודעות                     │

│                                                                          │

│  Query: "מתי היו הכי הרבה הודעות?"                                      │

│  Answer: היום הפעיל ביותר היה 15.03.2024 עם 342 הודעות                  │

│                                                                          │

│  Query: "Who mentioned @admin the most?"                                 │

│  Answer: User "Mike" mentioned @admin 47 times                           │

│                                                                          │

│  Query: "הראה הודעות עם קישורים מהשבוע האחרון"                          │

│  Answer: נמצאו 23 הודעות עם קישורים...                                  │

│                                                                          │

└─────────────────────────────────────────────────────────────────────────┘

```

### AI Search API

```python

from ai_search import AISearchEngine



# Initialize with Ollama (local)

ai = AISearchEngine('telegram.db', provider='ollama')



# Or with Groq

ai = AISearchEngine('telegram.db', provider='groq', api_key='your_key')



# Search

result = ai.search("מי הכי פעיל בלילה?")

print(result['answer'])  # Natural language answer

print(result['sql'])     # Generated SQL query

print(result['results']) # Raw data

```

---

## Database Updates

Update your database with new JSON exports without losing existing data.

### Via Web UI

1. Go to **Settings** page in the dashboard
2. Drag & drop your new `result.json` file
3. Wait for processing (duplicate detection automatic)
4. See summary of new messages added

### Via CLI

```bash

# Update existing database with new JSON

python indexer.py new_export.json --db telegram.db --update



# What happens:

# 1. Loads existing message IDs into Bloom filter (O(n))

# 2. For each message in JSON:

#    - Check if exists using Bloom filter (O(1))

#    - Only insert if new

# 3. Re-index FTS if needed

# 4. Report: X new messages, Y duplicates skipped

```

### Incremental Update Process

```

┌─────────────────────────────────────────────────────────────────────────┐

│                    INCREMENTAL UPDATE PROCESS                            │

├─────────────────────────────────────────────────────────────────────────┤

│                                                                          │

│  Existing DB                    New JSON                                 │

│  ┌─────────────┐               ┌─────────────┐                          │

│  │ msg_1 ✓     │               │ msg_1       │ → Skip (duplicate)       │

│  │ msg_2 ✓     │               │ msg_2       │ → Skip (duplicate)       │

│  │ msg_3 ✓     │               │ msg_5  NEW  │ → Insert                 │

│  │ msg_4 ✓     │               │ msg_6  NEW  │ → Insert                 │

│  └─────────────┘               └─────────────┘                          │

│         │                             │                                  │

│         │      Bloom Filter           │                                  │

│         │      ┌───────────┐          │                                  │

│         └─────▶│ O(1) test │◀─────────┘                                  │

│                └───────────┘                                             │

│                                                                          │

│  Result: Only msg_5 and msg_6 added (fast!)                             │

│                                                                          │

└─────────────────────────────────────────────────────────────────────────┘

```

---

## Architecture

### System Overview

```

┌─────────────────────────────────────────────────────────────────┐

│                         INPUT                                    │

│  ┌─────────────────────────────────────────────────────────┐    │

│  │  Telegram JSON Export (result.json)                      │    │

│  │  ├── messages[]                                          │    │

│  │  │   ├── id, date, from, text                           │    │

│  │  │   ├── reply_to_message_id                            │    │

│  │  │   └── text_entities[] (links, mentions)              │    │

│  │  └── ...                                                 │    │

│  └─────────────────────────────────────────────────────────┘    │

└─────────────────────────┬───────────────────────────────────────┘

                          │

                          ▼

┌─────────────────────────────────────────────────────────────────┐

│                      INDEXER (indexer.py)                        │

│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │

│  │   Batch     │  │   Bloom     │  │   Reply     │              │

│  │  Processing │  │   Filter    │  │   Graph     │              │

│  │  (1000/tx)  │  │ (Dedup O(1))│  │  Builder    │              │

│  └─────────────┘  └─────────────┘  └─────────────┘              │

└─────────────────────────┬───────────────────────────────────────┘

                          │

                          ▼

┌─────────────────────────────────────────────────────────────────┐

│                    SQLite DATABASE                               │

│  ┌─────────────────────────────────────────────────────────┐    │

│  │  messages          │  FTS5 Index      │  reply_graph    │    │

│  │  ├── id (PK)       │  ├── text_plain  │  ├── parent_id  │    │

│  │  ├── text_plain    │  └── from_name   │  └── child_id   │    │

│  │  ├── from_id       │                  │                 │    │

│  │  ├── date_unixtime │  entities        │  threads        │    │

│  │  └── ...           │  ├── links       │  └── messages   │    │

│  │                    │  └── mentions    │                 │    │

│  └─────────────────────────────────────────────────────────┘    │

└─────────────────────────┬───────────────────────────────────────┘

                          │

          ┌───────────────┼───────────────┐

          ▼               ▼               ▼

┌─────────────┐  ┌─────────────┐  ┌─────────────┐

│   SEARCH    │  │  ANALYZER   │  │   VECTOR    │

│ (search.py) │  │(analyzer.py)│  │  (optional) │

│             │  │             │  │             │

│ • FTS5+BM25 │  │ • Top-K     │  │ • FAISS     │

│ • Fuzzy     │  │ • LCS       │  │ • Semantic  │

│ • Threads   │  │ • Rank Tree │  │ • Clustering│

│ • LRU Cache │  │ • Percentile│  │             │

└─────────────┘  └─────────────┘  └─────────────┘

```

### Data Flow

```

JSON Message                 Database Tables              Search/Analytics

───────────                  ───────────────              ────────────────



{                           ┌─────────────┐

  "id": 548795,       ───▶  │  messages   │  ───▶  Full-text search

  "text": "שלום",           └─────────────┘        User filtering

  "from": "User1",                                 Date range queries

  "from_id": "user123", ─▶  ┌─────────────┐

  "date_unixtime": ...,     │   users     │  ───▶  Top users (Heap)

                            └─────────────┘        User rank (Rank Tree)

  "text_entities": [

    {"type": "link", ────▶  ┌─────────────┐

     "text": "url"}         │  entities   │  ───▶  Link analysis

  ],                        └─────────────┘        Mention network



  "reply_to_message_id" ─▶  ┌─────────────┐

                            │ reply_graph │  ───▶  Thread DFS/BFS

}                           └─────────────┘        Conversation view

```

### File Structure

```

telegram/

│

├── dashboard.py        # 🌐 Web Dashboard (Flask)

│   └── Routes: /, /users, /chat, /search, /moderation, /settings

│   └── API: /api/overview, /api/users, /api/search, /api/update, etc.

│

├── ai_search.py        # 🤖 AI-Powered Search

│   └── AISearchEngine class

│       ├── Natural language to SQL

│       ├── Ollama/Groq/Gemini providers

│       └── Hebrew/English support

│

├── indexer.py          # JSON → SQLite indexer

│   ├── OptimizedIndexer class

│   │   ├── Batch processing (100x faster)

│   │   ├── Bloom filter (duplicate detection)

│   │   └── Graph builder (reply threads)

│   └── IncrementalIndexer class

│       ├── Update existing database

│       ├── Bloom filter duplicate check

│       └── Only insert new messages

│

├── search.py           # Search interface

│   └── TelegramSearch class

│       ├── FTS5 full-text search

│       ├── Fuzzy trigram search

│       ├── LRU query cache

│       └── DFS/BFS thread traversal

│

├── analyzer.py         # Analytics & statistics

│   └── TelegramAnalyzer class

│       ├── LCS similar messages

│       ├── Heap-based Top-K

│       ├── Selection percentiles

│       ├── Rank Tree queries

│       └── Bucket Sort histograms

│

├── data_structures.py  # Core data structures

│   ├── BloomFilter     # O(1) membership test

│   ├── Trie            # O(k) prefix search

│   ├── LRUCache        # O(1) caching

│   ├── ReplyGraph      # DFS/BFS traversal

│   └── TrigramIndex    # Fuzzy matching

│

├── algorithms.py       # Course algorithms

│   ├── LCS             # Similar message detection

│   ├── TopK (Heap)     # Efficient ranking

│   ├── Selection       # O(n) percentiles

│   ├── RankTree        # O(log n) rank queries

│   └── BucketSort      # Time histograms

│

├── templates/          # 🎨 HTML Templates

│   ├── index.html      # Overview dashboard

│   ├── users.html      # User leaderboard

│   ├── chat.html       # Telegram-like chat view

│   ├── search.html     # Search interface

│   ├── moderation.html # Content analytics

│   └── settings.html   # Settings & DB update

│

├── static/             # 📁 Static assets

│   ├── css/style.css   # Dashboard styles

│   └── js/dashboard.js # Dashboard scripts

│

├── vector_search.py    # Optional: Semantic search

│   └── VectorSearch class (requires FAISS)

│

├── schema.sql          # Database schema

└── telegram.db         # SQLite database (created)

```

---

## Usage Guide

### Web Dashboard (Recommended)

```bash

# Start the dashboard

python dashboard.py



# Custom port

python dashboard.py --port 8080



# Custom database

python dashboard.py --db my_chat.db

```

### Indexing

```bash

# Basic indexing

python indexer.py result.json



# Custom database name

python indexer.py result.json --db my_chat.db



# With trigram index (for fuzzy search)

python indexer.py result.json --build-trigrams



# Larger batch size (faster for big files)

python indexer.py result.json --batch-size 5000



# Update existing database with new JSON (incremental)

python indexer.py new_export.json --db telegram.db --update

```

### Searching

```bash

# Basic search (Hebrew supported)

python search.py "שלום"



# Search with filters

python search.py "מילה" --user user123456 --limit 50



# Date range

python search.py "חדשות" --from-date 2024-01-01 --to-date 2024-12-31



# Fuzzy search (finds typos)

python search.py "שלמ" --fuzzy --threshold 0.3



# View conversation thread

python search.py --thread 548795



# List all links

python search.py --list-links



# List all mentions

python search.py --list-mentions

```

### Analytics

```bash

# General statistics

python analyzer.py --stats



# Top users (Heap-based O(n log k))

python analyzer.py --top-users --limit 10



# Hourly activity

python analyzer.py --hourly



# Daily activity

python analyzer.py --daily



# Top words

python analyzer.py --words --limit 30



# Top domains

python analyzer.py --domains



# Find similar messages (LCS algorithm)

python analyzer.py --similar --threshold 0.7



# Find reposts

python analyzer.py --reposts



# Message length percentiles (Selection algorithm)

python analyzer.py --percentiles



# Response time percentiles

python analyzer.py --response-times



# User rank (Rank Tree O(log n))

python analyzer.py --user-rank user123456



# Get user at rank #5

python analyzer.py --rank 5



# Activity histogram (Bucket Sort)

python analyzer.py --histogram --bucket-size 86400



# Export as JSON

python analyzer.py --stats --json > stats.json

```

---

## Algorithms

### Algorithm Complexity Comparison

```

┌────────────────────┬─────────────────┬─────────────────┬─────────────┐

│     Operation      │  Naive Method   │  Our Algorithm  │ Improvement │

├────────────────────┼─────────────────┼─────────────────┼─────────────┤

│ Top-K users        │ O(n log n) sort │ O(n log k) heap │   ~10x      │

│ Find median        │ O(n log n) sort │ O(n) selection  │   ~5x       │

│ User rank query    │ O(n) scan       │ O(log n) tree   │   ~100x     │

│ Duplicate check    │ O(n) lookup     │ O(1) bloom      │   ~1000x    │

│ Similar messages   │ O(n²m²) naive   │ O(n²m) LCS+DP   │   ~10x      │

│ Time histogram     │ O(n log n) sort │ O(n+k) bucket   │   ~5x       │

│ Thread traversal   │ O(n) repeated   │ O(V+E) DFS/BFS  │   ~10x      │

└────────────────────┴─────────────────┴─────────────────┴─────────────┘

```

### 1. LCS (Longest Common Subsequence)

**Purpose:** Find similar/duplicate messages

```

String 1: "שלום לכולם מה קורה"

String 2: "שלום לכולם מה נשמע"

                          ↓

LCS:      "שלום לכולם מה "

Similarity: 77.78%

```

**Algorithm:**
```

┌───┬───┬───┬───┬───┬───┐

│   │ ∅ │ A │ B │ C │ D │   DP Table

├───┼───┼───┼───┼───┼───┤

│ ∅ │ 0 │ 0 │ 0 │ 0 │ 0 │   dp[i][j] = length of LCS

│ A │ 0 │ 1 │ 1 │ 1 │ 1 │   for first i and j chars

│ C │ 0 │ 1 │ 1 │ 2 │ 2 │

│ B │ 0 │ 1 │ 2 │ 2 │ 2 │   Time:  O(m × n)

│ D │ 0 │ 1 │ 2 │ 2 │ 3 │   Space: O(min(m,n))

└───┴───┴───┴───┴───┴───┘

```

### 2. Heap-based Top-K

**Purpose:** Find top K items without sorting everything

```

Finding Top 3 from [5,2,8,1,9,3,7,4,6]



Min-Heap (size K=3):



Step 1: [5]           Add 5

Step 2: [2,5]         Add 2

Step 3: [2,5,8]       Add 8 (heap full)

Step 4: [2,5,8]       Skip 1 (< min)

Step 5: [5,9,8]       Replace 2 with 9

Step 6: [5,9,8]       Skip 3 (< min)

Step 7: [7,9,8]       Replace 5 with 7

...

Result: [7,8,9]       Top 3!



Time: O(n log k) vs O(n log n) for full sort

```

### 3. Selection Algorithm (Median of Medians)

**Purpose:** Find k-th element or percentiles in O(n)

```

Find median of [3,1,4,1,5,9,2,6,5,3,5]



┌─────────────────────────────────────────┐

│  Divide into groups of 5:               │

│  [3,1,4,1,5] [9,2,6,5,3] [5]           │

│       ↓           ↓        ↓            │

│  Medians: 3       5        5            │

│       ↓                                 │

│  Median of medians: 5 (pivot)           │

│       ↓                                 │

│  Partition around 5                     │

│  [3,1,4,1,2,3] [5,5,5] [9,6]           │

│       6 elements  3     2               │

│       ↓                                 │

│  Median is at position 5 → found!       │

└─────────────────────────────────────────┘



Time: O(n) guaranteed (not just average!)

```

### 4. Rank Tree (Order Statistics Tree)

**Purpose:** O(log n) rank queries

```

AVL Tree with size augmentation:



           ┌───────────────┐

           │  150 (size=5) │

           └───────┬───────┘

          ┌────────┴────────┐

    ┌─────┴─────┐     ┌─────┴─────┐

    │ 100 (s=2) │     │ 250 (s=2) │

    └─────┬─────┘     └─────┬─────┘

    ┌─────┴              ┌──┴

┌───┴───┐            ┌───┴───┐

│50 (1) │            │300 (1)│

└───────┘            └───────┘



select(3) → 150  (3rd smallest)

rank(150) → 3    (rank of 150)



Time: O(log n) for both operations

```

### 5. Bucket Sort (Time Histograms)

**Purpose:** O(n+k) time-based grouping

```

Messages with timestamps:

[1000, 1500, 2500, 1200, 3000]



Bucket size: 1000 seconds



┌─────────┬─────────┬─────────┬─────────┐

│ 0-1000  │1000-2000│2000-3000│3000-4000│

├─────────┼─────────┼─────────┼─────────┤

│         │ 1000    │  2500   │  3000   │

│         │ 1500    │         │         │

│         │ 1200    │         │         │

├─────────┼─────────┼─────────┼─────────┤

│ Count:0 │ Count:3 │ Count:1 │ Count:1 │

└─────────┴─────────┴─────────┴─────────┘



Time: O(n + k) where k = number of buckets

```

### 6. DFS/BFS Thread Traversal

**Purpose:** Reconstruct conversation threads

```

Reply Graph:



    [1] Original message

     │

     ├──[2] Reply to 1

     │   │

     │   ├──[4] Reply to 2

     │   │

     │   └──[5] Reply to 2

     │

     └──[3] Reply to 1



DFS order: [1, 2, 4, 5, 3]  (deep first)

BFS order: [1, 2, 3, 4, 5]  (level by level)



With depth info:

  [1] depth=0

    [2] depth=1

      [4] depth=2

      [5] depth=2

    [3] depth=1



Time: O(V + E)

```

---

## API Reference

### Dashboard REST API

The web dashboard exposes a REST API for all operations:

```

┌─────────────────────────────────────────────────────────────────────────┐

│                         REST API ENDPOINTS                               │

├─────────────────────────────────────────────────────────────────────────┤

│                                                                          │

│  GET  /api/overview           Overview statistics                        │

│       ?timeframe=month        (today|yesterday|week|month|year|all)      │

│                                                                          │

│  GET  /api/users              User leaderboard                           │

│       ?timeframe=month        Timeframe filter                           │

│       &limit=100              Max users                                  │

│                                                                          │

│  GET  /api/user/<user_id>     User details                              │

│       ?timeframe=month        Includes hourly activity                   │

│                                                                          │

│  GET  /api/search             Full-text search                           │

│       ?q=search_term          Search query                               │

│       &timeframe=all          Timeframe filter                           │

│       &limit=20&offset=0      Pagination                                 │

│                                                                          │

│  POST /api/ai/search          AI-powered search                          │

│       {"query": "..."}        Natural language query                     │

│                                                                          │

│  GET  /api/chat/messages      Chat messages                              │

│       ?limit=50&offset=0      Pagination                                 │

│       &user_id=...            Filter by user                             │

│       &from_date=...          Date range                                 │

│                                                                          │

│  GET  /api/chat/thread/<id>   Get conversation thread                    │

│                               Returns full thread with DFS               │

│                                                                          │

│  GET  /api/top/domains        Top shared domains                         │

│  GET  /api/top/mentions       Top mentioned users                        │

│  GET  /api/top/words          Most frequent words                        │

│                                                                          │

│  POST /api/update             Update database with JSON                  │

│       (multipart form)        File upload                                │

│                                                                          │

│  GET  /api/db/stats           Database statistics                        │

│                               Size, counts, date range                   │

│                                                                          │

│  GET  /api/export/users       Export users as CSV                        │

│  GET  /api/export/messages    Export messages as CSV                     │

│                                                                          │

├─────────────────────────────────────────────────────────────────────────┤

│                    ALGORITHM-POWERED ENDPOINTS                           │

├─────────────────────────────────────────────────────────────────────────┤

│                                                                          │

│  GET  /api/similar/<id>       Find similar messages (LCS algorithm)      │

│       ?threshold=0.7          Similarity threshold                       │

│       ?limit=10               Max results                                │

│       Complexity: O(n*m)      n=sample, m=avg length                     │

│                                                                          │

│  GET  /api/analytics/similar  Find all similar pairs in DB               │

│       ?threshold=0.8          Similarity threshold                       │

│       Algorithm: LCS          O(n² * m) with early termination           │

│                                                                          │

│  GET  /api/user/rank/<id>     Get user rank (RankTree)                   │

│       Complexity: O(log n)    vs O(n) SQL scan                           │

│                                                                          │

│  GET  /api/user/by-rank/<k>   Get k-th ranked user (RankTree)            │

│       Algorithm: select(k)    O(log n)                                   │

│                                                                          │

│  GET  /api/analytics/histogram Activity histogram (Bucket Sort)          │

│       ?bucket=86400           Bucket size in seconds                     │

│       Complexity: O(n + k)    k=number of buckets                        │

│                                                                          │

│  GET  /api/analytics/percentiles Message length stats (Selection)        │

│       Algorithm: Quickselect  O(n) guaranteed                            │

│       Returns: min,max,median,p25,p75,p90,p95,p99                        │

│                                                                          │

└─────────────────────────────────────────────────────────────────────────┘

```

### TelegramSearch

```python

from search import TelegramSearch



with TelegramSearch('telegram.db') as search:

    # Full-text search

    results = search.search("שלום", limit=50)



    # With filters

    results = search.search(

        "מילה",

        user_id="user123",

        from_date=1704067200,  # Unix timestamp

        to_date=1735689600,

        has_links=True

    )



    # Fuzzy search

    results = search.fuzzy_search("שלמ", threshold=0.3)



    # Get thread (DFS)

    thread = search.get_thread_dfs(message_id=548795)



    # Get thread with depth

    thread = search.get_thread_with_depth(message_id=548795)

    # Returns: [(message_dict, depth), ...]



    # Autocomplete usernames

    suggestions = search.autocomplete_user("@user")

```

### TelegramAnalyzer

```python

from analyzer import TelegramAnalyzer



with TelegramAnalyzer('telegram.db') as analyzer:

    # Statistics

    stats = analyzer.get_stats()



    # Top users (Heap-based)

    top_users = analyzer.get_top_users(limit=10)



    # Similar messages (LCS)

    similar = analyzer.find_similar_messages(threshold=0.7)



    # Percentiles (Selection algorithm)

    percentiles = analyzer.get_message_length_stats()

    # Returns: {min, max, median, p25, p75, p90, p95, p99}



    # User rank (Rank Tree)

    rank_info = analyzer.get_user_rank("user123")

    # Returns: {rank, total_users, percentile}



    # Get user by rank

    user = analyzer.get_user_by_rank(5)



    # Histogram (Bucket Sort)

    hist = analyzer.get_activity_histogram(bucket_size=86400)

```

---

## Examples

### Example 1: Find Most Active Hours

```python

from analyzer import TelegramAnalyzer



with TelegramAnalyzer('telegram.db') as analyzer:

    hourly = analyzer.get_hourly_activity()



    # Find peak hour

    peak_hour = max(hourly, key=hourly.get)

    print(f"Most active hour: {peak_hour}:00 ({hourly[peak_hour]} messages)")

```

### Example 2: Detect Spam/Reposts

```python

from analyzer import TelegramAnalyzer



with TelegramAnalyzer('telegram.db') as analyzer:

    reposts = analyzer.find_reposts(threshold=0.9)



    for r in reposts[:10]:

        print(f"Similarity: {r['similarity']:.0%}")

        print(f"  User 1: {r['user_1']}")

        print(f"  User 2: {r['user_2']}")

        print(f"  Text: {r['text_preview'][:50]}...")

```

### Example 3: Conversation Thread Analysis

```python

from search import TelegramSearch



with TelegramSearch('telegram.db') as search:

    # Get full thread

    thread = search.get_thread_with_depth(548795)



    print("Conversation thread:")

    for msg, depth in thread:

        indent = "  " * depth

        print(f"{indent}[{msg['from_name']}]: {msg['text_plain'][:50]}")

```

### Example 4: User Ranking

```python

from analyzer import TelegramAnalyzer



with TelegramAnalyzer('telegram.db') as analyzer:

    # Get rank of specific user

    rank = analyzer.get_user_rank("user123456")

    print(f"Rank: #{rank['rank']} of {rank['total_users']}")

    print(f"Top {rank['percentile']:.1f}%")



    # Get top 3 users

    for i in range(1, 4):

        user = analyzer.get_user_by_rank(i)

        print(f"#{i}: {user['name']} ({user['count']} messages)")

```

---

## Performance

Tested on 100,000 messages:

| Operation | Time |
|-----------|------|
| Indexing | ~10 seconds |
| Full-text search | <10ms |
| Fuzzy search | ~100ms |
| Top-K (k=20) | ~50ms |
| User rank query | <1ms |
| Thread traversal | <5ms |
| Similar messages (1000 sample) | ~2 seconds |

---

## License

MIT License - Free for personal and commercial use.

---

## Contributing

1. Fork the repository
2. Create feature branch
3. Commit changes
4. Push and create PR

---

## Troubleshooting

### "Module not found" error
```bash

# Make sure you're in the telegram directory

cd /path/to/telegram

python indexer.py result.json

```

### "Database is locked" error
```bash

# Close any other programs using the database

# Or use a different database name

python indexer.py result.json --db telegram2.db

```

### Hebrew text not displaying correctly
```bash

# Ensure your terminal supports UTF-8

export LANG=en_US.UTF-8

```

---

## Credits

Algorithms implemented from "Data Structures and Introduction to Algorithms" course:
- LCS (Longest Common Subsequence)
- Heap-based Top-K
- Selection Algorithm (Median of Medians)
- Rank Tree (Order Statistics Tree)
- Bucket Sort
- DFS/BFS Graph Traversal
- Bloom Filter
- Trie (Prefix Tree)