Spaces:

KashiAI
/

KCH

Sleeping

App Files Files Community

KCH / docs /projects /FileOrganizer.md

bsamadi

Update to pixi env

c032460 about 2 months ago

preview code

raw

history blame contribute delete

22.3 kB

	# Course Project: FileOrganizer

	A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.

	---

	## Overview

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ FileOrganizer CLI │
	├─────────────────────────────────────────────────────────────────┤
	│ $ fileorg scan ~/Downloads │
	│ $ fileorg organize ~/Papers --strategy=by-topic │
	│ $ fileorg deduplicate ~/Research --similarity=0.9 │
	└─────────────────────────────────────────────────────────────────┘
	```

	---

	## Architecture

	```
	Files ──► Content Analysis ──► AI Classification ──► Organized Structure
	│ │
	▼ ▼
	PDF Extraction Docker Model Runner
	Metadata Tools (Local LLM)
	│ │
	└────────►MCP◄───────────┘
	```

	### Data Flow

	```
	┌──────────────┐ ┌──────────────┐ ┌──────────────┐
	│ Files/PDFs │────►│ Content │────►│ MCP Server │
	│ (Input) │ │ Extraction │ │ (Tools) │
	└──────────────┘ └──────────────┘ └──────┬───────┘
	│
	▼
	┌──────────────┐ ┌──────────────┐ ┌──────────────┐
	│ Organized │◄────│ Agent Crew │◄────│ Local LLM │
	│ Structure │ │ (CrewAI) │ │ (Docker) │
	└──────────────┘ └──────────────┘ └──────────────┘
	```

	---

	## Agent System

	\| Agent \| Role \| Tools \| Output \|
	\|-------\|------\|-------\|--------\|
	\| Scanner Agent \| Discovers files, extracts metadata \| File I/O, PDF extraction, hash generation \| File inventory, metadata catalog \|
	\| Classifier Agent \| Categorizes files by content and context \| LLM analysis, embeddings, similarity \| Category assignments, topic tags \|
	\| Organizer Agent \| Creates folder structure and moves files \| File operations, naming strategies \| Organized directory tree \|
	\| Deduplicator Agent \| Finds and handles duplicate files \| Hash comparison, content similarity \| Duplicate reports, cleanup actions \|

	### Agent Workflow

	```
	User Request: "Organize research papers by topic"
	│
	▼
	┌─────────────────────┐
	│ Scanner Agent │
	│ "What files do we │
	│ have and what │
	│ are they about?" │
	└──────────┬──────────┘
	│ File Inventory
	▼
	┌─────────────────────┐
	│ Classifier Agent │
	│ "What topics and │
	│ categories emerge │
	│ from the content?"│
	└──────────┬──────────┘
	│ Categories
	▼
	┌─────────────────────┐
	│ Organizer Agent │
	│ "Create folder │
	│ structure and │
	│ move files" │
	└──────────┬──────────┘
	│ Organization Plan
	▼
	┌─────────────────────┐
	│ Deduplicator Agent │
	│ "Find and handle │
	│ duplicate files" │
	└──────────┬──────────┘
	│
	▼
	Organized Directory
	```

	---

	## CLI Commands

	### `fileorg scan`

	Scan a directory and analyze its contents.

	```bash
	# Scan a directory
	fileorg scan ~/Downloads

	# Scan with detailed analysis
	fileorg scan ~/Papers --analyze-content

	# Scan and export inventory
	fileorg scan ~/Research --export inventory.json

	# Scan specific file types
	fileorg scan ~/Documents --types pdf,docx,txt
	```

	Options:

	\| Flag \| Description \| Default \|
	\|------\|-------------\|---------\|
	\| `--analyze-content` \| Extract and analyze file contents \| `false` \|
	\| `--export` \| Export inventory to JSON/CSV \| None \|
	\| `--types` \| Comma-separated file extensions to scan \| All \|
	\| `--recursive` \| Scan subdirectories \| `true` \|
	\| `--max-depth` \| Maximum directory depth \| `10` \|

	### `fileorg organize`

	Organize files using AI-powered strategies.

	```bash
	# Organize by topic (AI-powered)
	fileorg organize ~/Papers --strategy=by-topic

	# Organize by date
	fileorg organize ~/Photos --strategy=by-date --format="%Y/%m"

	# Organize with custom naming
	fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}"

	# Dry run to preview changes
	fileorg organize ~/Downloads --dry-run

	# Interactive mode
	fileorg organize ~/Research --interactive
	```

	Options:

	\| Flag \| Description \| Default \|
	\|------\|-------------\|---------\|
	\| `--strategy` \| Organization strategy: `by-topic`, `by-date`, `by-type`, `by-author`, `smart` \| `smart` \|
	\| `--rename` \| Rename files intelligently \| `false` \|
	\| `--pattern` \| Naming pattern for renamed files \| `{original}` \|
	\| `--dry-run` \| Preview changes without executing \| `false` \|
	\| `--interactive` \| Confirm each action \| `false` \|
	\| `--output` \| Output directory \| Same as input \|

	### `fileorg deduplicate`

	Find and handle duplicate files.

	```bash
	# Find duplicates by hash
	fileorg deduplicate ~/Downloads

	# Find similar files (content-based)
	fileorg deduplicate ~/Papers --similarity=0.9

	# Auto-delete duplicates (keep newest)
	fileorg deduplicate ~/Photos --auto-delete --keep=newest

	# Move duplicates to folder
	fileorg deduplicate ~/Documents --move-to=./duplicates
	```

	Options:

	\| Flag \| Description \| Default \|
	\|------\|-------------\|---------\|
	\| `--similarity` \| Similarity threshold (0.0-1.0) for content matching \| `1.0` (exact) \|
	\| `--method` \| Detection method: `hash`, `content`, `metadata` \| `hash` \|
	\| `--auto-delete` \| Automatically delete duplicates \| `false` \|
	\| `--keep` \| Which to keep: `newest`, `oldest`, `largest`, `smallest` \| `newest` \|
	\| `--move-to` \| Move duplicates to directory instead of deleting \| None \|

	### `fileorg research`

	Special commands for research paper management.

	```bash
	# Extract metadata from PDFs
	fileorg research extract ~/Papers

	# Generate bibliography
	fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib

	# Find related papers
	fileorg research related "attention mechanisms" --in ~/Papers

	# Create reading list
	fileorg research reading-list ~/Papers --topic "transformers" --order=citations
	```

	Options:

	\| Flag \| Description \| Default \|
	\|------\|-------------\|---------\|
	\| `--format` \| Bibliography format: `bibtex`, `apa`, `mla` \| `bibtex` \|
	\| `--output` \| Output file path \| `stdout` \|
	\| `--order` \| Sort order: `date`, `citations`, `relevance` \| `relevance` \|

	### `fileorg config`

	Manage configuration settings.

	```bash
	# Show current config
	fileorg config show

	# Set LLM model
	fileorg config set llm.model "llama3.2:3b"

	# Set default strategy
	fileorg config set organize.default_strategy "by-topic"

	# Reset to defaults
	fileorg config reset
	```

	### `fileorg stats`

	Show statistics about files and organization.

	```bash
	# Show directory statistics
	fileorg stats ~/Papers

	# Show organization suggestions
	fileorg stats ~/Downloads --suggest

	# Export statistics
	fileorg stats ~/Research --export stats.json
	```

	---

	## Configuration

	Configuration is stored in `~/.config/fileorg/config.toml` or `./fileorg.toml` in the project directory.

	```toml
	[fileorg]
	version = "1.0.0"

	[llm]
	provider = "docker" # docker, ollama, openai
	model = "llama3.2:3b"
	temperature = 0.7
	max_tokens = 4096
	base_url = "http://localhost:11434"

	[llm.docker]
	runtime = "nvidia" # nvidia, cpu
	memory_limit = "8g"

	[agents]
	verbose = false
	max_iterations = 10

	[agents.scanner]
	role = "File Scanner"
	goal = "Discover and catalog all files with metadata"

	[agents.classifier]
	role = "Content Classifier"
	goal = "Categorize files by content and context"

	[agents.organizer]
	role = "File Organizer"
	goal = "Create optimal folder structure and organize files"

	[agents.deduplicator]
	role = "Duplicate Detector"
	goal = "Find and handle duplicate files efficiently"

	[organize]
	default_strategy = "smart"
	create_backups = true
	backup_dir = "./.fileorg_backup"

	[organize.naming]
	sanitize = true
	max_length = 255
	replace_spaces = "_"

	[research]
	extract_metadata = true
	auto_rename = true
	naming_pattern = "{year}_{author}_{title}"
	generate_bibliography = true

	[deduplication]
	default_method = "hash"
	similarity_threshold = 0.95
	auto_delete = false
	keep_strategy = "newest"

	[pdf]
	extract_text = true
	extract_metadata = true
	ocr_enabled = false # Enable OCR for scanned PDFs

	[observability]
	enabled = true
	provider = "langfuse" # langfuse, langsmith, console
	trace_agents = true
	log_tokens = true
	```

	---

	## Docker Stack

	### docker-compose.yml

	```yaml
	version: "3.9"

	services:
	# Local LLM via Docker Model Runner
	llm:
	image: ollama/ollama:latest
	runtime: nvidia
	environment:
	- OLLAMA_HOST=0.0.0.0
	volumes:
	- ollama_data:/root/.ollama
	ports:
	- "11434:11434"
	deploy:
	resources:
	reservations:
	devices:
	- driver: nvidia
	count: 1
	capabilities: [gpu]
	healthcheck:
	test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
	interval: 30s
	timeout: 10s
	retries: 3

	# MCP Server for file operations and PDF tools
	mcp-server:
	build:
	context: ./src/fileorg/mcp
	dockerfile: Dockerfile
	environment:
	- MCP_PORT=3000
	volumes:
	- ./workspace:/workspace
	ports:
	- "3000:3000"
	depends_on:
	- llm

	# Main application (for containerized usage)
	fileorg:
	build:
	context: .
	dockerfile: Dockerfile
	environment:
	- LLM_BASE_URL=http://llm:11434
	- MCP_SERVER_URL=http://mcp-server:3000
	volumes:
	- ./workspace:/workspace
	- ./config:/config:ro
	depends_on:
	llm:
	condition: service_healthy
	mcp-server:
	condition: service_started
	profiles:
	- cli

	volumes:
	ollama_data:
	```

	### Running the Stack

	```bash
	# Start LLM and MCP server
	docker compose up -d llm mcp-server

	# Pull the model (first time only)
	docker compose exec llm ollama pull llama3.2:3b

	# Run FileOrganizer commands
	docker compose run --rm fileorg scan /workspace/papers
	docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic
	docker compose run --rm fileorg deduplicate /workspace/downloads

	# Or run locally with Docker backend
	fileorg scan ~/Papers
	fileorg organize ~/Papers --strategy=by-topic
	fileorg deduplicate ~/Downloads
	```

	---

	## Project Structure

	```
	fileorg/
	├── pyproject.toml # pixi/uv project config
	├── pixi.lock
	├── docker-compose.yml # Full stack orchestration
	├── Dockerfile
	├── fileorg.toml # Default configuration
	├── README.md
	│
	├── src/
	│ └── fileorg/
	│ ├── __init__.py
	│ ├── __main__.py # Entry point
	│ ├── cli.py # Typer CLI commands
	│ ├── config.py # TOML configuration loader
	│ │
	│ ├── scanner/ # File discovery and analysis
	│ │ ├── __init__.py
	│ │ ├── discovery.py # File system traversal
	│ │ ├── metadata.py # Metadata extraction
	│ │ ├── pdf_reader.py # PDF text/metadata extraction
	│ │ └── hashing.py # File hashing utilities
	│ │
	│ ├── classifier/ # Content classification
	│ │ ├── __init__.py
	│ │ ├── embeddings.py # Generate embeddings
	│ │ ├── clustering.py # Topic clustering
	│ │ ├── categorizer.py # AI-powered categorization
	│ │ └── similarity.py # Content similarity
	│ │
	│ ├── organizer/ # File organization
	│ │ ├── __init__.py
	│ │ ├── strategies.py # Organization strategies
	│ │ ├── naming.py # File naming logic
	│ │ ├── structure.py # Directory structure creation
	│ │ └── mover.py # Safe file operations
	│ │
	│ ├── deduplicator/ # Duplicate detection
	│ │ ├── __init__.py
	│ │ ├── hash_based.py # Hash-based detection
	│ │ ├── content_based.py # Content similarity detection
	│ │ └── handler.py # Duplicate handling
	│ │
	│ ├── research/ # Research paper tools
	│ │ ├── __init__.py
	│ │ ├── extractor.py # PDF metadata extraction
	│ │ ├── bibliography.py # Bibliography generation
	│ │ ├── citation.py # Citation parsing
	│ │ └── scholar.py # Academic search integration
	│ │
	│ ├── agents/ # CrewAI agents
	│ │ ├── __init__.py
	│ │ ├── crew.py # Crew orchestration
	│ │ ├── scanner.py # Scanner agent
	│ │ ├── classifier.py # Classifier agent
	│ │ ├── organizer.py # Organizer agent
	│ │ └── deduplicator.py # Deduplicator agent
	│ │
	│ ├── tools/ # Agent tools
	│ │ ├── __init__.py
	│ │ ├── file_tools.py # File operation tools
	│ │ ├── pdf_tools.py # PDF processing tools
	│ │ ├── search_tools.py # Search and query tools
	│ │ └── analysis.py # Content analysis tools
	│ │
	│ ├── mcp/ # MCP server
	│ │ ├── __init__.py
	│ │ ├── server.py # MCP server implementation
	│ │ ├── tools.py # MCP tool definitions
	│ │ └── Dockerfile # MCP server container
	│ │
	│ ├── llm/ # LLM integration
	│ │ ├── __init__.py
	│ │ ├── client.py # LLM client (Docker/Ollama/OpenAI)
	│ │ └── prompts.py # Prompt templates
	│ │
	│ └── observability/ # Logging & tracing
	│ ├── __init__.py
	│ ├── tracing.py # Distributed tracing
	│ └── metrics.py # Token/cost tracking
	│
	├── tests/
	│ ├── __init__.py
	│ ├── conftest.py # Pytest fixtures
	│ ├── test_cli.py
	│ ├── test_scanner.py
	│ ├── test_classifier.py
	│ ├── test_organizer.py
	│ ├── test_deduplicator.py
	│ ├── test_research.py
	│ └── fixtures/
	│ ├── sample_papers/
	│ │ ├── paper1.pdf
	│ │ ├── paper2.pdf
	│ │ └── paper3.pdf
	│ ├── sample_files/
	│ └── expected_outputs/
	│
	├── workspace/ # Working directory
	│ └── .gitkeep
	│
	└── docs/ # Documentation (Quarto)
	├── _quarto.yml
	├── index.qmd
	└── chapters/
	```

	---

	## Technology Stack

	\| Category \| Tools \|
	\|----------\|-------\|
	\| Package Management \| pixi, uv \|
	\| CLI Framework \| Typer, Rich \|
	\| Local LLM \| Docker Model Runner, Ollama \|
	\| LLM Framework \| LangChain \|
	\| Multi-Agent \| CrewAI \|
	\| MCP \| Docker MCP Toolkit \|
	\| PDF Processing \| PyPDF2, pdfplumber, pypdf \|
	\| Embeddings \| sentence-transformers \|
	\| File Operations \| pathlib, shutil \|
	\| Hashing \| hashlib, xxhash \|
	\| Metadata \| exifread, mutagen \|
	\| Similarity \| scikit-learn, faiss \|
	\| Observability \| Langfuse, OpenTelemetry \|
	\| Testing \| pytest, DeepEval \|
	\| Containerization \| Docker, Docker Compose \|

	---

	## Example Usage

	### End-to-End Workflow

	```bash
	# 1. Start the Docker stack
	docker compose up -d

	# 2. Scan your messy Downloads folder
	fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json

	# 3. Organize files by type and date
	fileorg organize ~/Downloads --strategy=smart --dry-run
	# Review the plan, then execute
	fileorg organize ~/Downloads --strategy=smart

	# 4. Organize research papers by topic
	fileorg scan ~/Papers --types=pdf --analyze-content
	fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}"

	# 5. Find and handle duplicates
	fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates

	# 6. Extract metadata and generate bibliography
	fileorg research extract ~/Papers
	fileorg research bibliography ~/Papers --format=bibtex --output=references.bib

	# 7. Create a reading list on a specific topic
	fileorg research reading-list ~/Papers --topic "transformers" --order=citations

	# 8. View statistics
	fileorg stats ~/Papers
	```

	### Research Paper Organization Example

	```bash
	# Before:
	~/Papers/
	├── paper_final.pdf
	├── attention_is_all_you_need.pdf
	├── bert_paper.pdf
	├── gpt3.pdf
	├── vision_transformer.pdf
	├── download (1).pdf
	├── download (2).pdf
	└── thesis_draft_v5.pdf

	# Run organization
	fileorg organize ~/Papers --strategy=by-topic --rename

	# After:
	~/Papers/
	├── Natural_Language_Processing/
	│ ├── Transformers/
	│ │ ├── 2017_Vaswani_Attention_Is_All_You_Need.pdf
	│ │ ├── 2018_Devlin_BERT_Pretraining.pdf
	│ │ └── 2020_Brown_GPT3_Language_Models.pdf
	│ └── Other/
	│ └── 2023_Smith_Thesis_Draft.pdf
	├── Computer_Vision/
	│ └── Transformers/
	│ └── 2020_Dosovitskiy_Vision_Transformer.pdf
	└── Uncategorized/
	└── 2024_Unknown_Document.pdf
	```

	### Duplicate Detection Example

	```bash
	# Find exact duplicates
	fileorg deduplicate ~/Downloads
	# Found 15 duplicate files (45 MB)
	# • download.pdf (3 copies)
	# • image.jpg (2 copies)
	# • report.docx (2 copies)

	# Find similar papers (different versions)
	fileorg deduplicate ~/Papers --similarity=0.9 --method=content
	# Found 3 similar file groups:
	# • attention_paper.pdf, attention_is_all_you_need.pdf (95% similar)
	# • bert_preprint.pdf, bert_final.pdf (98% similar)

	# Auto-cleanup (keep newest)
	fileorg deduplicate ~/Downloads --auto-delete --keep=newest
	# ✓ Deleted 15 duplicate files, freed 45 MB
	```

	---

	## Learning Outcomes

	By building FileOrganizer, learners will be able to:

	1. ✅ Set up modern Python projects with pixi and reproducible environments
	2. ✅ Build professional CLI tools with Typer and Rich
	3. ✅ Run local LLMs using Docker Model Runner
	4. ✅ Process and extract content from PDF files
	5. ✅ Build MCP servers to connect AI agents to file systems
	6. ✅ Design multi-agent systems with CrewAI
	7. ✅ Implement content-based similarity and clustering
	8. ✅ Generate embeddings for semantic search
	9. ✅ Handle file operations safely with backups and dry-run modes
	10. ✅ Implement observability for AI applications
	11. ✅ Test non-deterministic systems effectively
	12. ✅ Deploy self-hosted AI applications with Docker Compose

	---

	## Advanced Features

	### Smart Organization Strategy

	The `smart` strategy uses AI to analyze file content and context to determine the best organization approach:

	```python
	# Pseudocode for smart strategy
	def smart_organize(files):
	# 1. Analyze file types and content
	file_analysis = scanner_agent.analyze(files)

	# 2. Determine optimal strategy
	if mostly_pdfs_with_academic_content:
	strategy = "by-topic-hierarchical"
	elif mostly_media_files:
	strategy = "by-date-and-type"
	elif mixed_work_documents:
	strategy = "by-project-and-date"

	# 3. Execute with AI-powered categorization
	classifier_agent.categorize(files, strategy)
	organizer_agent.execute(strategy)
	```

	### Research Paper Features

	Special handling for academic PDFs:

	- Metadata Extraction: Title, authors, year, abstract, keywords
	- Citation Parsing: Extract and parse references
	- Smart Naming: `{year}_{first_author}_{short_title}.pdf`
	- Topic Clustering: Group papers by research area
	- Citation Network: Identify related papers
	- Bibliography Generation: BibTeX, APA, MLA formats

	### Deduplication Strategies

	Multiple methods for finding duplicates:

	1. Hash-based: Exact file matches (fastest)
	2. Content-based: Similar content using embeddings
	3. Metadata-based: Same title/author but different files
	4. Fuzzy matching: Handle renamed or modified files

	---

	This project serves as the main example in the [Learning Path](../learning-path.md) for building AI-powered CLI tools.