# Course Project: FileOrganizer **A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.** --- ## Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ FileOrganizer CLI │ ├─────────────────────────────────────────────────────────────────┤ │ $ fileorg scan ~/Downloads │ │ $ fileorg organize ~/Papers --strategy=by-topic │ │ $ fileorg deduplicate ~/Research --similarity=0.9 │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Architecture ``` Files ──► Content Analysis ──► AI Classification ──► Organized Structure │ │ ▼ ▼ PDF Extraction Docker Model Runner Metadata Tools (Local LLM) │ │ └────────►MCP◄───────────┘ ``` ### Data Flow ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Files/PDFs │────►│ Content │────►│ MCP Server │ │ (Input) │ │ Extraction │ │ (Tools) │ └──────────────┘ └──────────────┘ └──────┬───────┘ │ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Organized │◄────│ Agent Crew │◄────│ Local LLM │ │ Structure │ │ (CrewAI) │ │ (Docker) │ └──────────────┘ └──────────────┘ └──────────────┘ ``` --- ## Agent System | Agent | Role | Tools | Output | |-------|------|-------|--------| | **Scanner Agent** | Discovers files, extracts metadata | File I/O, PDF extraction, hash generation | File inventory, metadata catalog | | **Classifier Agent** | Categorizes files by content and context | LLM analysis, embeddings, similarity | Category assignments, topic tags | | **Organizer Agent** | Creates folder structure and moves files | File operations, naming strategies | Organized directory tree | | **Deduplicator Agent** | Finds and handles duplicate files | Hash comparison, content similarity | Duplicate reports, cleanup actions | ### Agent Workflow ``` User Request: "Organize research papers by topic" │ ▼ ┌─────────────────────┐ │ Scanner Agent │ │ "What files do we │ │ have and what │ │ are they about?" │ └──────────┬──────────┘ │ File Inventory ▼ ┌─────────────────────┐ │ Classifier Agent │ │ "What topics and │ │ categories emerge │ │ from the content?"│ └──────────┬──────────┘ │ Categories ▼ ┌─────────────────────┐ │ Organizer Agent │ │ "Create folder │ │ structure and │ │ move files" │ └──────────┬──────────┘ │ Organization Plan ▼ ┌─────────────────────┐ │ Deduplicator Agent │ │ "Find and handle │ │ duplicate files" │ └──────────┬──────────┘ │ ▼ Organized Directory ``` --- ## CLI Commands ### `fileorg scan` Scan a directory and analyze its contents. ```bash # Scan a directory fileorg scan ~/Downloads # Scan with detailed analysis fileorg scan ~/Papers --analyze-content # Scan and export inventory fileorg scan ~/Research --export inventory.json # Scan specific file types fileorg scan ~/Documents --types pdf,docx,txt ``` **Options:** | Flag | Description | Default | |------|-------------|---------| | `--analyze-content` | Extract and analyze file contents | `false` | | `--export` | Export inventory to JSON/CSV | None | | `--types` | Comma-separated file extensions to scan | All | | `--recursive` | Scan subdirectories | `true` | | `--max-depth` | Maximum directory depth | `10` | ### `fileorg organize` Organize files using AI-powered strategies. ```bash # Organize by topic (AI-powered) fileorg organize ~/Papers --strategy=by-topic # Organize by date fileorg organize ~/Photos --strategy=by-date --format="%Y/%m" # Organize with custom naming fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}" # Dry run to preview changes fileorg organize ~/Downloads --dry-run # Interactive mode fileorg organize ~/Research --interactive ``` **Options:** | Flag | Description | Default | |------|-------------|---------| | `--strategy` | Organization strategy: `by-topic`, `by-date`, `by-type`, `by-author`, `smart` | `smart` | | `--rename` | Rename files intelligently | `false` | | `--pattern` | Naming pattern for renamed files | `{original}` | | `--dry-run` | Preview changes without executing | `false` | | `--interactive` | Confirm each action | `false` | | `--output` | Output directory | Same as input | ### `fileorg deduplicate` Find and handle duplicate files. ```bash # Find duplicates by hash fileorg deduplicate ~/Downloads # Find similar files (content-based) fileorg deduplicate ~/Papers --similarity=0.9 # Auto-delete duplicates (keep newest) fileorg deduplicate ~/Photos --auto-delete --keep=newest # Move duplicates to folder fileorg deduplicate ~/Documents --move-to=./duplicates ``` **Options:** | Flag | Description | Default | |------|-------------|---------| | `--similarity` | Similarity threshold (0.0-1.0) for content matching | `1.0` (exact) | | `--method` | Detection method: `hash`, `content`, `metadata` | `hash` | | `--auto-delete` | Automatically delete duplicates | `false` | | `--keep` | Which to keep: `newest`, `oldest`, `largest`, `smallest` | `newest` | | `--move-to` | Move duplicates to directory instead of deleting | None | ### `fileorg research` Special commands for research paper management. ```bash # Extract metadata from PDFs fileorg research extract ~/Papers # Generate bibliography fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib # Find related papers fileorg research related "attention mechanisms" --in ~/Papers # Create reading list fileorg research reading-list ~/Papers --topic "transformers" --order=citations ``` **Options:** | Flag | Description | Default | |------|-------------|---------| | `--format` | Bibliography format: `bibtex`, `apa`, `mla` | `bibtex` | | `--output` | Output file path | `stdout` | | `--order` | Sort order: `date`, `citations`, `relevance` | `relevance` | ### `fileorg config` Manage configuration settings. ```bash # Show current config fileorg config show # Set LLM model fileorg config set llm.model "llama3.2:3b" # Set default strategy fileorg config set organize.default_strategy "by-topic" # Reset to defaults fileorg config reset ``` ### `fileorg stats` Show statistics about files and organization. ```bash # Show directory statistics fileorg stats ~/Papers # Show organization suggestions fileorg stats ~/Downloads --suggest # Export statistics fileorg stats ~/Research --export stats.json ``` --- ## Configuration Configuration is stored in `~/.config/fileorg/config.toml` or `./fileorg.toml` in the project directory. ```toml [fileorg] version = "1.0.0" [llm] provider = "docker" # docker, ollama, openai model = "llama3.2:3b" temperature = 0.7 max_tokens = 4096 base_url = "http://localhost:11434" [llm.docker] runtime = "nvidia" # nvidia, cpu memory_limit = "8g" [agents] verbose = false max_iterations = 10 [agents.scanner] role = "File Scanner" goal = "Discover and catalog all files with metadata" [agents.classifier] role = "Content Classifier" goal = "Categorize files by content and context" [agents.organizer] role = "File Organizer" goal = "Create optimal folder structure and organize files" [agents.deduplicator] role = "Duplicate Detector" goal = "Find and handle duplicate files efficiently" [organize] default_strategy = "smart" create_backups = true backup_dir = "./.fileorg_backup" [organize.naming] sanitize = true max_length = 255 replace_spaces = "_" [research] extract_metadata = true auto_rename = true naming_pattern = "{year}_{author}_{title}" generate_bibliography = true [deduplication] default_method = "hash" similarity_threshold = 0.95 auto_delete = false keep_strategy = "newest" [pdf] extract_text = true extract_metadata = true ocr_enabled = false # Enable OCR for scanned PDFs [observability] enabled = true provider = "langfuse" # langfuse, langsmith, console trace_agents = true log_tokens = true ``` --- ## Docker Stack ### docker-compose.yml ```yaml version: "3.9" services: # Local LLM via Docker Model Runner llm: image: ollama/ollama:latest runtime: nvidia environment: - OLLAMA_HOST=0.0.0.0 volumes: - ollama_data:/root/.ollama ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 30s timeout: 10s retries: 3 # MCP Server for file operations and PDF tools mcp-server: build: context: ./src/fileorg/mcp dockerfile: Dockerfile environment: - MCP_PORT=3000 volumes: - ./workspace:/workspace ports: - "3000:3000" depends_on: - llm # Main application (for containerized usage) fileorg: build: context: . dockerfile: Dockerfile environment: - LLM_BASE_URL=http://llm:11434 - MCP_SERVER_URL=http://mcp-server:3000 volumes: - ./workspace:/workspace - ./config:/config:ro depends_on: llm: condition: service_healthy mcp-server: condition: service_started profiles: - cli volumes: ollama_data: ``` ### Running the Stack ```bash # Start LLM and MCP server docker compose up -d llm mcp-server # Pull the model (first time only) docker compose exec llm ollama pull llama3.2:3b # Run FileOrganizer commands docker compose run --rm fileorg scan /workspace/papers docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic docker compose run --rm fileorg deduplicate /workspace/downloads # Or run locally with Docker backend fileorg scan ~/Papers fileorg organize ~/Papers --strategy=by-topic fileorg deduplicate ~/Downloads ``` --- ## Project Structure ``` fileorg/ ├── pyproject.toml # pixi/uv project config ├── pixi.lock ├── docker-compose.yml # Full stack orchestration ├── Dockerfile ├── fileorg.toml # Default configuration ├── README.md │ ├── src/ │ └── fileorg/ │ ├── __init__.py │ ├── __main__.py # Entry point │ ├── cli.py # Typer CLI commands │ ├── config.py # TOML configuration loader │ │ │ ├── scanner/ # File discovery and analysis │ │ ├── __init__.py │ │ ├── discovery.py # File system traversal │ │ ├── metadata.py # Metadata extraction │ │ ├── pdf_reader.py # PDF text/metadata extraction │ │ └── hashing.py # File hashing utilities │ │ │ ├── classifier/ # Content classification │ │ ├── __init__.py │ │ ├── embeddings.py # Generate embeddings │ │ ├── clustering.py # Topic clustering │ │ ├── categorizer.py # AI-powered categorization │ │ └── similarity.py # Content similarity │ │ │ ├── organizer/ # File organization │ │ ├── __init__.py │ │ ├── strategies.py # Organization strategies │ │ ├── naming.py # File naming logic │ │ ├── structure.py # Directory structure creation │ │ └── mover.py # Safe file operations │ │ │ ├── deduplicator/ # Duplicate detection │ │ ├── __init__.py │ │ ├── hash_based.py # Hash-based detection │ │ ├── content_based.py # Content similarity detection │ │ └── handler.py # Duplicate handling │ │ │ ├── research/ # Research paper tools │ │ ├── __init__.py │ │ ├── extractor.py # PDF metadata extraction │ │ ├── bibliography.py # Bibliography generation │ │ ├── citation.py # Citation parsing │ │ └── scholar.py # Academic search integration │ │ │ ├── agents/ # CrewAI agents │ │ ├── __init__.py │ │ ├── crew.py # Crew orchestration │ │ ├── scanner.py # Scanner agent │ │ ├── classifier.py # Classifier agent │ │ ├── organizer.py # Organizer agent │ │ └── deduplicator.py # Deduplicator agent │ │ │ ├── tools/ # Agent tools │ │ ├── __init__.py │ │ ├── file_tools.py # File operation tools │ │ ├── pdf_tools.py # PDF processing tools │ │ ├── search_tools.py # Search and query tools │ │ └── analysis.py # Content analysis tools │ │ │ ├── mcp/ # MCP server │ │ ├── __init__.py │ │ ├── server.py # MCP server implementation │ │ ├── tools.py # MCP tool definitions │ │ └── Dockerfile # MCP server container │ │ │ ├── llm/ # LLM integration │ │ ├── __init__.py │ │ ├── client.py # LLM client (Docker/Ollama/OpenAI) │ │ └── prompts.py # Prompt templates │ │ │ └── observability/ # Logging & tracing │ ├── __init__.py │ ├── tracing.py # Distributed tracing │ └── metrics.py # Token/cost tracking │ ├── tests/ │ ├── __init__.py │ ├── conftest.py # Pytest fixtures │ ├── test_cli.py │ ├── test_scanner.py │ ├── test_classifier.py │ ├── test_organizer.py │ ├── test_deduplicator.py │ ├── test_research.py │ └── fixtures/ │ ├── sample_papers/ │ │ ├── paper1.pdf │ │ ├── paper2.pdf │ │ └── paper3.pdf │ ├── sample_files/ │ └── expected_outputs/ │ ├── workspace/ # Working directory │ └── .gitkeep │ └── docs/ # Documentation (Quarto) ├── _quarto.yml ├── index.qmd └── chapters/ ``` --- ## Technology Stack | Category | Tools | |----------|-------| | **Package Management** | pixi, uv | | **CLI Framework** | Typer, Rich | | **Local LLM** | Docker Model Runner, Ollama | | **LLM Framework** | LangChain | | **Multi-Agent** | CrewAI | | **MCP** | Docker MCP Toolkit | | **PDF Processing** | PyPDF2, pdfplumber, pypdf | | **Embeddings** | sentence-transformers | | **File Operations** | pathlib, shutil | | **Hashing** | hashlib, xxhash | | **Metadata** | exifread, mutagen | | **Similarity** | scikit-learn, faiss | | **Observability** | Langfuse, OpenTelemetry | | **Testing** | pytest, DeepEval | | **Containerization** | Docker, Docker Compose | --- ## Example Usage ### End-to-End Workflow ```bash # 1. Start the Docker stack docker compose up -d # 2. Scan your messy Downloads folder fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json # 3. Organize files by type and date fileorg organize ~/Downloads --strategy=smart --dry-run # Review the plan, then execute fileorg organize ~/Downloads --strategy=smart # 4. Organize research papers by topic fileorg scan ~/Papers --types=pdf --analyze-content fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}" # 5. Find and handle duplicates fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates # 6. Extract metadata and generate bibliography fileorg research extract ~/Papers fileorg research bibliography ~/Papers --format=bibtex --output=references.bib # 7. Create a reading list on a specific topic fileorg research reading-list ~/Papers --topic "transformers" --order=citations # 8. View statistics fileorg stats ~/Papers ``` ### Research Paper Organization Example ```bash # Before: ~/Papers/ ├── paper_final.pdf ├── attention_is_all_you_need.pdf ├── bert_paper.pdf ├── gpt3.pdf ├── vision_transformer.pdf ├── download (1).pdf ├── download (2).pdf └── thesis_draft_v5.pdf # Run organization fileorg organize ~/Papers --strategy=by-topic --rename # After: ~/Papers/ ├── Natural_Language_Processing/ │ ├── Transformers/ │ │ ├── 2017_Vaswani_Attention_Is_All_You_Need.pdf │ │ ├── 2018_Devlin_BERT_Pretraining.pdf │ │ └── 2020_Brown_GPT3_Language_Models.pdf │ └── Other/ │ └── 2023_Smith_Thesis_Draft.pdf ├── Computer_Vision/ │ └── Transformers/ │ └── 2020_Dosovitskiy_Vision_Transformer.pdf └── Uncategorized/ └── 2024_Unknown_Document.pdf ``` ### Duplicate Detection Example ```bash # Find exact duplicates fileorg deduplicate ~/Downloads # Found 15 duplicate files (45 MB) # • download.pdf (3 copies) # • image.jpg (2 copies) # • report.docx (2 copies) # Find similar papers (different versions) fileorg deduplicate ~/Papers --similarity=0.9 --method=content # Found 3 similar file groups: # • attention_paper.pdf, attention_is_all_you_need.pdf (95% similar) # • bert_preprint.pdf, bert_final.pdf (98% similar) # Auto-cleanup (keep newest) fileorg deduplicate ~/Downloads --auto-delete --keep=newest # ✓ Deleted 15 duplicate files, freed 45 MB ``` --- ## Learning Outcomes By building FileOrganizer, learners will be able to: 1. ✅ Set up modern Python projects with pixi and reproducible environments 2. ✅ Build professional CLI tools with Typer and Rich 3. ✅ Run local LLMs using Docker Model Runner 4. ✅ Process and extract content from PDF files 5. ✅ Build MCP servers to connect AI agents to file systems 6. ✅ Design multi-agent systems with CrewAI 7. ✅ Implement content-based similarity and clustering 8. ✅ Generate embeddings for semantic search 9. ✅ Handle file operations safely with backups and dry-run modes 10. ✅ Implement observability for AI applications 11. ✅ Test non-deterministic systems effectively 12. ✅ Deploy self-hosted AI applications with Docker Compose --- ## Advanced Features ### Smart Organization Strategy The `smart` strategy uses AI to analyze file content and context to determine the best organization approach: ```python # Pseudocode for smart strategy def smart_organize(files): # 1. Analyze file types and content file_analysis = scanner_agent.analyze(files) # 2. Determine optimal strategy if mostly_pdfs_with_academic_content: strategy = "by-topic-hierarchical" elif mostly_media_files: strategy = "by-date-and-type" elif mixed_work_documents: strategy = "by-project-and-date" # 3. Execute with AI-powered categorization classifier_agent.categorize(files, strategy) organizer_agent.execute(strategy) ``` ### Research Paper Features Special handling for academic PDFs: - **Metadata Extraction**: Title, authors, year, abstract, keywords - **Citation Parsing**: Extract and parse references - **Smart Naming**: `{year}_{first_author}_{short_title}.pdf` - **Topic Clustering**: Group papers by research area - **Citation Network**: Identify related papers - **Bibliography Generation**: BibTeX, APA, MLA formats ### Deduplication Strategies Multiple methods for finding duplicates: 1. **Hash-based**: Exact file matches (fastest) 2. **Content-based**: Similar content using embeddings 3. **Metadata-based**: Same title/author but different files 4. **Fuzzy matching**: Handle renamed or modified files --- *This project serves as the main example in the [Learning Path](../learning-path.md) for building AI-powered CLI tools.*