diff --git a/README.md b/README.md index bd9ae759eefd3ad1c6f18283a1bac215ecefb299..b26ef6c813a5ab314160a506162593d6c1007113 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,66 @@ +# ✨ Ink Vision: Advanced HTR Pipeline ✨ + +Welcome to **Ink Vision**, a state-of-the-art Handwritten Text Recognition (HTR) system. This isn't just a simple OCR wrapper; it's a modular, **3-Step Intelligent Pipeline** designed to handle messy, real-world handwriting with precision. + +--- + +## 🚀 The 3-Step Hybrid Architecture + +To achieve world-class accuracy, we split the logic into three distinct, hot-swappable stages: + +### 1️⃣ Step 1: Pre-Processor (Computer Vision & DL) +Before the AI reads the text, we "clean" the image to remove noise, shadows, and artifacts. +- **OpenCV + LightCNN (Denoising)**: Denoising is done by both OpenCV and LightCNN together. OpenCV handles adaptive thresholding, binarization, Green-Channel extraction (to make red ink "pop"), and non-local means denoising. LightCNN is used for denoising alongside OpenCV—its architecture is there for image restoration (Noisy → Clean pairs); in its current form the CNN is worth nothing, but both are part of our denoising pipeline. +- **Deskewing**: Automatic rotation correction ensures slanted handwriting is perfectly leveled for the OCR engine. + +### 2️⃣ Step 2: HTR Engine (Sequence Modeling) +The core recognition happens here. We utilize a **CRAFT + ResNet + LSTM** architecture: +- **Detection**: CRAFT identifies individual character regions and groups them into words. +- **Recognition**: A Deep Residual Network extracts visual features, which are then sequenced by an LSTM to understand the flow of handwriting. +- **Ensemble Strategy**: The app runs dual-inference—one on the raw image and one on the cleaned image—to ensure no data is lost. + +### 3️⃣ Step 3: Post-Processor (NLP Semantic Judge) +Raw OCR output is often "noisy." This stage acts as a human-like editor: +- **Contextual Spellchecker**: Fixes common OCR typos while preserving original capitalization. +- **Merging Logic**: Automatically joins split words (e.g., `import dance` -> `importance`). +- **Semantic Judge (BERT Tiny)**: We've integrated a lightweight BERT model that understands English grammar. It scores sentences based on **"Meaning."** If the OCR produces a jumbled mess, the Semantic Judge selects the most grammatically coherent version. + --- -title: Vision -emoji: 🚀 -colorFrom: red -colorTo: red -sdk: docker -app_port: 8501 -tags: -- streamlit -pinned: false -short_description: Streamlit template space -license: mit + +## 🧠 Training Your Own Models + +We've provided a full suite of training scripts to keep the system evolving: + +### 🖼️ CNN Denoising Training +Located in `training/train_denoiser.py`. +- **The Why**: Denoising uses both OpenCV and LightCNN. Math-based filters (OpenCV) sometimes blur thin handwriting. A trained CNN "understands" what a stroke should look like and can reconstruct it. +- **How to use**: Run `generate_dataset.py` to create synthetic training data, then run `train_denoiser.py` to bake your own weights. + +### ✍️ NLP Corpus Training +Located in `training/train_nlp.py`. +- **The Why**: If you frequently write about specific topics (e.g., Medical, History), the NLP needs to know those specific "rare" words. +- **How to use**: Provide your own text corpus to the script, and it will tune the dictionary and semantic probabilities to favor your specific domain. + +--- + +## 🛠️ Installation & Setup + +1. **Install Dependencies**: + ```bash + pip install -r requirements.txt + ``` +2. **Run the Application**: + ```bash + streamlit run app.py + ``` + --- -# Welcome to Streamlit! +## 📦 Core Technology Stack -Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart: +- **OpenCV + LightCNN**: Denoising—OpenCV for bitwise masking, adaptive thresholding, and non-local means; LightCNN for DL-based denoising alongside it. +- **PyTorch**: Powers the CNN Denoiser and the BERT Semantic Judge. +- **Transformers**: Provides the contextual intelligence for the NLP layer. +- **Streamlit**: A high-performance, premium UI with Glassmorphism and animated gradients. -If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community -forums](https://discuss.streamlit.io). +*Built with ❤️ by the RCO Team.* diff --git a/__pycache__/crnn_model.cpython-311.pyc b/__pycache__/crnn_model.cpython-311.pyc new file mode 100644 index 0000000000000000000000000000000000000000..07f663d2920e825c5f51c850ca04b2b3f4bc6f48 Binary files /dev/null and b/__pycache__/crnn_model.cpython-311.pyc differ diff --git a/app.py b/app.py new file mode 100644 index 0000000000000000000000000000000000000000..783ce6e61247feec702a3d1952360f1eff423650 --- /dev/null +++ b/app.py @@ -0,0 +1,306 @@ +import streamlit as st +import torch +import torchvision.transforms as transforms +from PIL import Image +from pillow_heif import register_heif_opener +import numpy as np +import os +from io import BytesIO +from googletrans import Translator, LANGUAGES +from gtts import gTTS + +# Register HEIC support for PIL +register_heif_opener() +from streamlit_cropper import st_cropper +import easyocr +st.set_page_config(page_title="INK VISION", page_icon="✨", layout="wide") + +# Custom CSS for the stunning animated background and glassmorphic UI +st.markdown(""" + + +
+

✨ HTR ✨

+

Experience the magic of handwritten word recognition.

+
+""", unsafe_allow_html=True) + +from pipeline.preprocessor import DocumentPreprocessor +from pipeline.ocr_engine import HTREngine +from pipeline.postprocessor import NLPCorrector + +# Initialise translator once +translator = Translator() + +# Simple helpers for state +if "extracted_text" not in st.session_state: + st.session_state["extracted_text"] = "" +if "translated_text" not in st.session_state: + st.session_state["translated_text"] = "" +if "target_lang" not in st.session_state: + st.session_state["target_lang"] = "en" + +@st.cache_resource(show_spinner="Booting up 3-Step HTR Pipeline (CV + OCR + NLP)...") +def load_pipeline(): + p = DocumentPreprocessor() + e = HTREngine(languages=['en']) + n = NLPCorrector(use_ml=True) + return p, e, n + +preprocessor, engine, nlp_corrector = load_pipeline() + +col1, col2 = st.columns(2) + +target_image = None + +with col1: + st.markdown("### 📸 Input your masterpiece") + input_method = st.radio("Choose Input Method", ["Upload Image", "Take a Photo"], horizontal=True) + + if input_method == "Upload Image": + uploaded_file = st.file_uploader("Upload a handwritten word image", type=["png", "jpg", "jpeg", "heic", "webp"]) + if uploaded_file is not None: + raw_image = Image.open(uploaded_file).convert("RGB") + + # Resize image to a standard width so both cropper and st.image match in size + target_width = 700 + if raw_image.width != target_width: + ratio = target_width / float(raw_image.width) + raw_image = raw_image.resize((target_width, int(raw_image.height * ratio))) + + if st.checkbox("✨ Crop Image", key="crop_upload"): + st.markdown("✨ **Crop the word below:**") + target_image = st_cropper(raw_image, realtime_update=True, box_color='#ff007f', key="upload_crop") + else: + target_image = raw_image + st.image(target_image, caption="Uploaded Image") + else: + camera_photo = st.camera_input("Take a picture of a handwritten word") + if camera_photo is not None: + raw_image = Image.open(camera_photo).convert("RGB") + + # Resize image to a standard width so both cropper and st.image match in size + target_width = 700 + if raw_image.width != target_width: + ratio = target_width / float(raw_image.width) + raw_image = raw_image.resize((target_width, int(raw_image.height * ratio))) + + if st.checkbox("✨ Crop Image", key="crop_camera"): + st.markdown("✨ **Crop the word below:**") + target_image = st_cropper(raw_image, realtime_update=True, box_color='#ff007f', key="camera_crop") + else: + target_image = raw_image + st.image(target_image, caption="Captured Image") + +with col2: + st.markdown("### 🪄 Magic Result") + + extracted_text = st.session_state.get("extracted_text", "") + translated_text = st.session_state.get("translated_text", "") + + if target_image is not None: + if st.button("✨ Extract Text"): + with st.spinner("Applying Deep Learning OCR algorithms..."): + if engine is None: + st.error("Pipeline failed to initialize.") + else: + # --- STREAM A: RAW OCR (No Preprocessing) --- + try: + raw_ocr_output = engine.extract_text(np.array(target_image)) + raw_stream_text = nlp_corrector.correct_spelling(raw_ocr_output) + except Exception: + raw_stream_text = "" + + # --- STREAM B: 3-STEP PIPELINE (Pre-Processed) --- + try: + # 1. Computer Vision Pre-Processing + cleaned_image_array = preprocessor.process(target_image) + # 2. Deep Learning OCR Engine + p_ocr_output = engine.extract_text(cleaned_image_array) + # 3. NLP Post-Processing + clean_stream_text = nlp_corrector.correct_spelling(p_ocr_output) + except Exception: + clean_stream_text = "" + + # --- THE ENSEMBLE JUDGE --- + # The judge picks the version that sounds most like real English + extracted_text = nlp_corrector.judge_best_output(raw_stream_text, clean_stream_text) + + if extracted_text.strip() == "": + st.warning("Oops! I couldn't find any text. Try a clearer image.") + extracted_text = "" + else: + st.success("Ensemble Magic! Winner selected from Dual-Stream analysis.") + with st.expander("Show AI Reasoning (Ensemble Comparison)"): + st.write(f"**Stream A (Raw Image):** {raw_stream_text}") + st.write(f"**Stream B (Cleaned Image):** {clean_stream_text}") + + st.session_state["extracted_text"] = extracted_text + st.session_state["translated_text"] = "" + + # Editable original text + st.session_state["extracted_text"] = st.text_area( + "You can edit the result here:", + value=st.session_state.get("extracted_text", ""), + height=150, + ) + + st.markdown("### 🌐 Translation & Voice") + + # Language selection + lang_keys = sorted(LANGUAGES.keys()) + default_index = lang_keys.index(st.session_state.get("target_lang", "en")) + target_lang = st.selectbox( + "Choose target language", + options=lang_keys, + index=default_index, + format_func=lambda k: LANGUAGES[k].title(), + ) + st.session_state["target_lang"] = target_lang + + with st.expander("Show available languages"): + st.write(", ".join(f"{code} – {name.title()}" for code, name in LANGUAGES.items())) + + col_translate, col_speak = st.columns(2) + + with col_translate: + if st.button("🌍 Translate into other language"): + if st.session_state["extracted_text"].strip(): + try: + result = translator.translate( + st.session_state["extracted_text"], + dest=target_lang, + ) + st.session_state["translated_text"] = result.text + except Exception as e: + st.error(f"Translation failed: {e}") + else: + st.warning("Please extract or type some text first.") + + with col_speak: + if st.button("🔊 Speak text (original & translated)"): + original = st.session_state.get("extracted_text", "").strip() + translated = st.session_state.get("translated_text", "").strip() + + if not original and not translated: + st.warning("Nothing to speak. Please extract or translate text first.") + else: + # Speak original (English assumed) + if original: + try: + buf = BytesIO() + gTTS(text=original, lang="en").write_to_fp(buf) + buf.seek(0) + st.audio(buf.read(), format="audio/mp3") + except Exception as e: + st.error(f"Failed to generate audio for original text: {e}") + + # Speak translated + if translated: + try: + buf_tr = BytesIO() + gTTS(text=translated, lang=target_lang).write_to_fp(buf_tr) + buf_tr.seek(0) + st.audio(buf_tr.read(), format="audio/mp3") + except Exception as e: + st.error(f"Failed to generate audio for translated text: {e}") + + if st.session_state.get("translated_text", "").strip(): + st.text_area( + "Translated text:", + value=st.session_state["translated_text"], + height=150, + ) + + else: + st.info("Waiting for an image to work my magic...") diff --git a/dataset/clean/000.jpg b/dataset/clean/000.jpg new file mode 100644 index 0000000000000000000000000000000000000000..606789ffe2818bde7f16bdcfaaeacfde3da087c9 Binary files /dev/null and b/dataset/clean/000.jpg differ diff --git a/dataset/clean/001.jpg b/dataset/clean/001.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5de4369d9ceeabe4607d51229cce6ffc71bbee0e Binary files /dev/null and b/dataset/clean/001.jpg differ diff --git a/dataset/clean/002.jpg b/dataset/clean/002.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3f369608e532947a027596325363401b14c7938c Binary files /dev/null and b/dataset/clean/002.jpg differ diff --git a/dataset/clean/003.jpg b/dataset/clean/003.jpg new file mode 100644 index 0000000000000000000000000000000000000000..dbb9c5f7f6f0cc93a5ed61361c9c30189bac0564 Binary files /dev/null and b/dataset/clean/003.jpg differ diff --git a/dataset/clean/004.jpg b/dataset/clean/004.jpg new file mode 100644 index 0000000000000000000000000000000000000000..02fd87222694ef60b880cdef4c312a8c29c122d9 Binary files /dev/null and b/dataset/clean/004.jpg differ diff --git a/dataset/clean/005.jpg b/dataset/clean/005.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d49dd3c26a24271e29f0a2e66527e4dbfd4561cc Binary files /dev/null and b/dataset/clean/005.jpg differ diff --git a/dataset/clean/006.jpg b/dataset/clean/006.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9dbe1986293b9b008ee597e254bf2cd341f54dc3 Binary files /dev/null and b/dataset/clean/006.jpg differ diff --git a/dataset/clean/007.jpg b/dataset/clean/007.jpg new file mode 100644 index 0000000000000000000000000000000000000000..658ae5383209768201abc0ad7f7d8279d4d24f7e Binary files /dev/null and b/dataset/clean/007.jpg differ diff --git a/dataset/clean/008.jpg b/dataset/clean/008.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4c6e55e8d958befc7ed5bca535592514eb6d2f4d Binary files /dev/null and b/dataset/clean/008.jpg differ diff --git a/dataset/clean/009.jpg b/dataset/clean/009.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a88d1c488991a403f0b5682f844bc78ee9631e49 Binary files /dev/null and b/dataset/clean/009.jpg differ diff --git a/dataset/clean/010.jpg b/dataset/clean/010.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f9f0e7e31c5497e966800ebad05bc42200235ac4 Binary files /dev/null and b/dataset/clean/010.jpg differ diff --git a/dataset/clean/011.jpg b/dataset/clean/011.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e2437a5c8e950f203b7a9f6c5a2049769d64888f Binary files /dev/null and b/dataset/clean/011.jpg differ diff --git a/dataset/clean/012.jpg b/dataset/clean/012.jpg new file mode 100644 index 0000000000000000000000000000000000000000..91cab13f035279e4e587561b74fa3e688e29c093 Binary files /dev/null and b/dataset/clean/012.jpg differ diff --git a/dataset/clean/013.jpg b/dataset/clean/013.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8cc98e3d4a280931f423a697c4f3512f62acb519 Binary files /dev/null and b/dataset/clean/013.jpg differ diff --git a/dataset/clean/014.jpg b/dataset/clean/014.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2f0023cd6521d07e22472c58fda8c84bcad569df Binary files /dev/null and b/dataset/clean/014.jpg differ diff --git a/dataset/clean/015.jpg b/dataset/clean/015.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fc4b1de18e8c604caba975ca9ea72c87c9fec806 Binary files /dev/null and b/dataset/clean/015.jpg differ diff --git a/dataset/clean/016.jpg b/dataset/clean/016.jpg new file mode 100644 index 0000000000000000000000000000000000000000..19f2ee2a6042c35a9f3e47c40333dac89c265c1b Binary files /dev/null and b/dataset/clean/016.jpg differ diff --git a/dataset/clean/017.jpg b/dataset/clean/017.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f51b9e1041a572dbaf2aff6a3a565f5e81d10b6e Binary files /dev/null and b/dataset/clean/017.jpg differ diff --git a/dataset/clean/018.jpg b/dataset/clean/018.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2079398e5a4dfde9012472eb4307858d75a83088 Binary files /dev/null and b/dataset/clean/018.jpg differ diff --git a/dataset/clean/019.jpg b/dataset/clean/019.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6fa7e2e8bb1fb6844aacb90938ae1e43dd4d6807 Binary files /dev/null and b/dataset/clean/019.jpg differ diff --git a/dataset/clean/020.jpg b/dataset/clean/020.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e4a2142fc256f195e2a20e7d382658a74cdf54f7 Binary files /dev/null and b/dataset/clean/020.jpg differ diff --git a/dataset/clean/021.jpg b/dataset/clean/021.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a3c50487efef482e996b2995b23f28a11dbbcfa5 Binary files /dev/null and b/dataset/clean/021.jpg differ diff --git a/dataset/clean/022.jpg b/dataset/clean/022.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3336118fa1679c668da9839347b6404becc5ba96 Binary files /dev/null and b/dataset/clean/022.jpg differ diff --git a/dataset/clean/023.jpg b/dataset/clean/023.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d2ff5f71cd98ebe3bc009999dbb669512aadfe04 Binary files /dev/null and b/dataset/clean/023.jpg differ diff --git a/dataset/clean/024.jpg b/dataset/clean/024.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c30e5d2dc1860dbade80e93b5a0e2dbbbc1a658f Binary files /dev/null and b/dataset/clean/024.jpg differ diff --git a/dataset/clean/025.jpg b/dataset/clean/025.jpg new file mode 100644 index 0000000000000000000000000000000000000000..022d08b370cbc1c4940c9408bd8e2df8a1f6ba6a Binary files /dev/null and b/dataset/clean/025.jpg differ diff --git a/dataset/clean/026.jpg b/dataset/clean/026.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5f004e46fad4badf6f83988a117bc0a809fd445e Binary files /dev/null and b/dataset/clean/026.jpg differ diff --git a/dataset/clean/027.jpg b/dataset/clean/027.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b52f63c7f5fbfee27a11612a732190a1201a1a19 Binary files /dev/null and b/dataset/clean/027.jpg differ diff --git a/dataset/clean/028.jpg b/dataset/clean/028.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d758f74c8ff2abd63affeddbaa7f2f4125f720e0 Binary files /dev/null and b/dataset/clean/028.jpg differ diff --git a/dataset/clean/029.jpg b/dataset/clean/029.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3fac68ac4c3300494ba06c68163f48689054b3c1 Binary files /dev/null and b/dataset/clean/029.jpg differ diff --git a/dataset/clean/030.jpg b/dataset/clean/030.jpg new file mode 100644 index 0000000000000000000000000000000000000000..232829d628716175cbe8b0fe23e9cc0cc8bb00e5 Binary files /dev/null and b/dataset/clean/030.jpg differ diff --git a/dataset/clean/031.jpg b/dataset/clean/031.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8542081782c92d2aa36f8c5b16c7d401e0c12824 Binary files /dev/null and b/dataset/clean/031.jpg differ diff --git a/dataset/clean/032.jpg b/dataset/clean/032.jpg new file mode 100644 index 0000000000000000000000000000000000000000..017041d390b8aa703ffaa6637e3eee5d32a82c0e Binary files /dev/null and b/dataset/clean/032.jpg differ diff --git a/dataset/clean/033.jpg b/dataset/clean/033.jpg new file mode 100644 index 0000000000000000000000000000000000000000..583584fa5497f2dee08329657b48d32a5c420055 Binary files /dev/null and b/dataset/clean/033.jpg differ diff --git a/dataset/clean/034.jpg b/dataset/clean/034.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8feb44b86633c6773fdad004deb2ef0942f425f5 Binary files /dev/null and b/dataset/clean/034.jpg differ diff --git a/dataset/clean/035.jpg b/dataset/clean/035.jpg new file mode 100644 index 0000000000000000000000000000000000000000..599318ca4e4502310df8d783175b7c8b3b2b2a09 Binary files /dev/null and b/dataset/clean/035.jpg differ diff --git a/dataset/clean/036.jpg b/dataset/clean/036.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3954d7ccb385862db404ced34559ca3f51bdfac2 Binary files /dev/null and b/dataset/clean/036.jpg differ diff --git a/dataset/clean/037.jpg b/dataset/clean/037.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8fbd2335f331d78e42a002c46b17b546af9c7a7b Binary files /dev/null and b/dataset/clean/037.jpg differ diff --git a/dataset/clean/038.jpg b/dataset/clean/038.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2701f160c2a900723bcab6785b62bb16dd23d3e8 Binary files /dev/null and b/dataset/clean/038.jpg differ diff --git a/dataset/clean/039.jpg b/dataset/clean/039.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b95a790e1bdb71cb03d16e3b0200d81d56b18c18 Binary files /dev/null and b/dataset/clean/039.jpg differ diff --git a/dataset/clean/040.jpg b/dataset/clean/040.jpg new file mode 100644 index 0000000000000000000000000000000000000000..429ea9c870b5cf2fd815de0a17a79a9a0805bbde Binary files /dev/null and b/dataset/clean/040.jpg differ diff --git a/dataset/clean/041.jpg b/dataset/clean/041.jpg new file mode 100644 index 0000000000000000000000000000000000000000..37f81862b79854a64161f3a87e7284c4e184f4fc Binary files /dev/null and b/dataset/clean/041.jpg differ diff --git a/dataset/clean/042.jpg b/dataset/clean/042.jpg new file mode 100644 index 0000000000000000000000000000000000000000..127e21f3f345b396756b1aa43646ba33373bbb5c Binary files /dev/null and b/dataset/clean/042.jpg differ diff --git a/dataset/clean/043.jpg b/dataset/clean/043.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9390a824a5606968894559ffc79d81aac2ea465a Binary files /dev/null and b/dataset/clean/043.jpg differ diff --git a/dataset/clean/044.jpg b/dataset/clean/044.jpg new file mode 100644 index 0000000000000000000000000000000000000000..12e4d0963862076f176fa9d3ad2e46c8824082f1 Binary files /dev/null and b/dataset/clean/044.jpg differ diff --git a/dataset/clean/045.jpg b/dataset/clean/045.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6f35577c276735b3c06674ae9f68e332964a1ec2 Binary files /dev/null and b/dataset/clean/045.jpg differ diff --git a/dataset/clean/046.jpg b/dataset/clean/046.jpg new file mode 100644 index 0000000000000000000000000000000000000000..20bf414d016639e947b00692e297b1340030c573 Binary files /dev/null and b/dataset/clean/046.jpg differ diff --git a/dataset/clean/047.jpg b/dataset/clean/047.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e1bf2122d17f245f72d4cb47671ed842df582d9c Binary files /dev/null and b/dataset/clean/047.jpg differ diff --git a/dataset/clean/048.jpg b/dataset/clean/048.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1b839574e57729a411942d76d33c99e2f8d9fa13 Binary files /dev/null and b/dataset/clean/048.jpg differ diff --git a/dataset/clean/049.jpg b/dataset/clean/049.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7f5d5fdbe0865c192114b03bbe1b74e9d6f12b76 Binary files /dev/null and b/dataset/clean/049.jpg differ diff --git a/dataset/clean/050.jpg b/dataset/clean/050.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5b9bb31588809f0a25acb0340ffe6e8c0c2d8e9d Binary files /dev/null and b/dataset/clean/050.jpg differ diff --git a/dataset/clean/051.jpg b/dataset/clean/051.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6c8c87161d50a0f2e40725438f27b60ad7e7cfe1 Binary files /dev/null and b/dataset/clean/051.jpg differ diff --git a/dataset/clean/052.jpg b/dataset/clean/052.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e73c99ccee7900d1a0dbdc2b98804d453fab788c Binary files /dev/null and b/dataset/clean/052.jpg differ diff --git a/dataset/clean/053.jpg b/dataset/clean/053.jpg new file mode 100644 index 0000000000000000000000000000000000000000..050e4e0635a6d7d4d56a5936837865d3125b7209 Binary files /dev/null and b/dataset/clean/053.jpg differ diff --git a/dataset/clean/054.jpg b/dataset/clean/054.jpg new file mode 100644 index 0000000000000000000000000000000000000000..01483b1ea4c7afb3d16a4d3265db89f6f2d21edc Binary files /dev/null and b/dataset/clean/054.jpg differ diff --git a/dataset/clean/055.jpg b/dataset/clean/055.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3851b74d7075a8b7ab71fc83d1b0ab8b225e53e0 Binary files /dev/null and b/dataset/clean/055.jpg differ diff --git a/dataset/clean/056.jpg b/dataset/clean/056.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b34d0459296e86ecc9e9a6fa99b7f6449adaa8ac Binary files /dev/null and b/dataset/clean/056.jpg differ diff --git a/dataset/clean/057.jpg b/dataset/clean/057.jpg new file mode 100644 index 0000000000000000000000000000000000000000..08341bddd260000c6d292cfcf009787a8a23e686 Binary files /dev/null and b/dataset/clean/057.jpg differ diff --git a/dataset/clean/058.jpg b/dataset/clean/058.jpg new file mode 100644 index 0000000000000000000000000000000000000000..269acc1d0335245c28aa97107c41aa2ff64909ea Binary files /dev/null and b/dataset/clean/058.jpg differ diff --git a/dataset/clean/059.jpg b/dataset/clean/059.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d508e58798fc67bd307b24a7314ce2ee4e7403c5 Binary files /dev/null and b/dataset/clean/059.jpg differ diff --git a/dataset/clean/060.jpg b/dataset/clean/060.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fd33d0e0bc8ca398d2a785373fffd5b0c36fad9c Binary files /dev/null and b/dataset/clean/060.jpg differ diff --git a/dataset/clean/061.jpg b/dataset/clean/061.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4a43d8e905dfa1dd53ab8cbb94ef5ae422576186 Binary files /dev/null and b/dataset/clean/061.jpg differ diff --git a/dataset/clean/062.jpg b/dataset/clean/062.jpg new file mode 100644 index 0000000000000000000000000000000000000000..158a2883789a5eac5a502e36ab4c06aff56658e1 Binary files /dev/null and b/dataset/clean/062.jpg differ diff --git a/dataset/clean/063.jpg b/dataset/clean/063.jpg new file mode 100644 index 0000000000000000000000000000000000000000..050d88999ed88b23f7b8e708e64929b4dd632f6b Binary files /dev/null and b/dataset/clean/063.jpg differ diff --git a/dataset/clean/064.jpg b/dataset/clean/064.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8d0d43ced526bede0c98f38e35217d3f3582ccd2 Binary files /dev/null and b/dataset/clean/064.jpg differ diff --git a/dataset/clean/065.jpg b/dataset/clean/065.jpg new file mode 100644 index 0000000000000000000000000000000000000000..41875d7d23b59fe7789107ab5bae89e6ff94bb86 Binary files /dev/null and b/dataset/clean/065.jpg differ diff --git a/dataset/clean/066.jpg b/dataset/clean/066.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d6db8026eec7b4fc1fecbc6c79c5d07f72e52964 Binary files /dev/null and b/dataset/clean/066.jpg differ diff --git a/dataset/clean/067.jpg b/dataset/clean/067.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b3a4dec453524bb981f94b519eac91d97e2452ae Binary files /dev/null and b/dataset/clean/067.jpg differ diff --git a/dataset/clean/068.jpg b/dataset/clean/068.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7b311a5fb00c4c7f03de8586413f5bc7116c85e4 Binary files /dev/null and b/dataset/clean/068.jpg differ diff --git a/dataset/clean/069.jpg b/dataset/clean/069.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5e04a6fa1e4af03f4df11337fd9183667f8ef02d Binary files /dev/null and b/dataset/clean/069.jpg differ diff --git a/dataset/clean/070.jpg b/dataset/clean/070.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3eddf407a24db59ade5b9895dcf4ab1b797d13c7 Binary files /dev/null and b/dataset/clean/070.jpg differ diff --git a/dataset/clean/071.jpg b/dataset/clean/071.jpg new file mode 100644 index 0000000000000000000000000000000000000000..107b0be4677eafc0543537e4b51401fa40630d0f Binary files /dev/null and b/dataset/clean/071.jpg differ diff --git a/dataset/clean/072.jpg b/dataset/clean/072.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1c8fc0b4eaab34a71a0717f175aa93c5f8f41659 Binary files /dev/null and b/dataset/clean/072.jpg differ diff --git a/dataset/clean/073.jpg b/dataset/clean/073.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c809636e1486910762d2c61ee8e52b43f7edca31 Binary files /dev/null and b/dataset/clean/073.jpg differ diff --git a/dataset/clean/074.jpg b/dataset/clean/074.jpg new file mode 100644 index 0000000000000000000000000000000000000000..37e0dd18cb768a5866e1dde41803a45860f8049d Binary files /dev/null and b/dataset/clean/074.jpg differ diff --git a/dataset/clean/075.jpg b/dataset/clean/075.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c984d0aa189fbadd1d6928c2ee89284b5254d1b3 Binary files /dev/null and b/dataset/clean/075.jpg differ diff --git a/dataset/clean/076.jpg b/dataset/clean/076.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b4c0527c90a84763651864b0bc583ca11bd05783 Binary files /dev/null and b/dataset/clean/076.jpg differ diff --git a/dataset/clean/077.jpg b/dataset/clean/077.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d23e651fa622efba394b7610c90be6348d81ce97 Binary files /dev/null and b/dataset/clean/077.jpg differ diff --git a/dataset/clean/078.jpg b/dataset/clean/078.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4d710811bf2ed2eef0f7ee1d6f612200640ddd00 Binary files /dev/null and b/dataset/clean/078.jpg differ diff --git a/dataset/clean/079.jpg b/dataset/clean/079.jpg new file mode 100644 index 0000000000000000000000000000000000000000..eb6810bd4797325bae5cd7024beee3ec790486e9 Binary files /dev/null and b/dataset/clean/079.jpg differ diff --git a/dataset/clean/080.jpg b/dataset/clean/080.jpg new file mode 100644 index 0000000000000000000000000000000000000000..818eabc1d651bfc25a43b394b52e4135db6dd63b Binary files /dev/null and b/dataset/clean/080.jpg differ diff --git a/dataset/clean/081.jpg b/dataset/clean/081.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a33f6f8c8a619fa02dac6842e7ecc134d9d95cbf Binary files /dev/null and b/dataset/clean/081.jpg differ diff --git a/dataset/clean/082.jpg b/dataset/clean/082.jpg new file mode 100644 index 0000000000000000000000000000000000000000..671ca29ad408e394f8107ef8d245652b91ce1aae Binary files /dev/null and b/dataset/clean/082.jpg differ diff --git a/dataset/clean/083.jpg b/dataset/clean/083.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9abde83763510cb8833b93747106279dd4e2ee1d Binary files /dev/null and b/dataset/clean/083.jpg differ diff --git a/dataset/clean/084.jpg b/dataset/clean/084.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8b75925fc83f88fd5dce12d57e4a9d11431e12a3 Binary files /dev/null and b/dataset/clean/084.jpg differ diff --git a/dataset/clean/085.jpg b/dataset/clean/085.jpg new file mode 100644 index 0000000000000000000000000000000000000000..436372f417c12a77c38675e6dc7743f0938b95d8 Binary files /dev/null and b/dataset/clean/085.jpg differ diff --git a/dataset/clean/086.jpg b/dataset/clean/086.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3c03a3bd5602a3154aa7c7dde350ba0fc3db097b Binary files /dev/null and b/dataset/clean/086.jpg differ diff --git a/dataset/clean/087.jpg b/dataset/clean/087.jpg new file mode 100644 index 0000000000000000000000000000000000000000..01a4628c25bbd90b4cbb9235492139ac1923592f Binary files /dev/null and b/dataset/clean/087.jpg differ diff --git a/dataset/clean/088.jpg b/dataset/clean/088.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b801e70c81f1c4903ec2f0cc43fd56def50c3aef Binary files /dev/null and b/dataset/clean/088.jpg differ diff --git a/dataset/clean/089.jpg b/dataset/clean/089.jpg new file mode 100644 index 0000000000000000000000000000000000000000..59dc51bcfcc617eff9f40adc40a627cc5ad4228c Binary files /dev/null and b/dataset/clean/089.jpg differ diff --git a/dataset/clean/090.jpg b/dataset/clean/090.jpg new file mode 100644 index 0000000000000000000000000000000000000000..62ba34696b0a7e7b4d53e5f6d4df83a48bb1ee0f Binary files /dev/null and b/dataset/clean/090.jpg differ diff --git a/dataset/clean/091.jpg b/dataset/clean/091.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e4bea952975ff37119169ae923bc77b7bb61b66d Binary files /dev/null and b/dataset/clean/091.jpg differ diff --git a/dataset/clean/092.jpg b/dataset/clean/092.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a4cd4ac7d7f9e614c9379b67e00d7f0342d8b039 Binary files /dev/null and b/dataset/clean/092.jpg differ diff --git a/dataset/clean/093.jpg b/dataset/clean/093.jpg new file mode 100644 index 0000000000000000000000000000000000000000..10a6e995738761ca3ec1faaf187bc7f6b409d7df Binary files /dev/null and b/dataset/clean/093.jpg differ diff --git a/dataset/clean/094.jpg b/dataset/clean/094.jpg new file mode 100644 index 0000000000000000000000000000000000000000..73aef3817d4a0f12229e824697da3dfde11673e7 Binary files /dev/null and b/dataset/clean/094.jpg differ diff --git a/dataset/clean/095.jpg b/dataset/clean/095.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4e45ee24a4e4b69ae6aecbd3662f85f6b80bce91 Binary files /dev/null and b/dataset/clean/095.jpg differ diff --git a/dataset/clean/096.jpg b/dataset/clean/096.jpg new file mode 100644 index 0000000000000000000000000000000000000000..969ab02cf56b71f39a5e089e132f52d58ee29a06 Binary files /dev/null and b/dataset/clean/096.jpg differ diff --git a/dataset/clean/097.jpg b/dataset/clean/097.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4896e145cae2f25803f6114fb23e420b892759f7 Binary files /dev/null and b/dataset/clean/097.jpg differ diff --git a/dataset/clean/098.jpg b/dataset/clean/098.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f37360c22a7b8e92dd0f5e5bd33bec970fa45d90 Binary files /dev/null and b/dataset/clean/098.jpg differ diff --git a/dataset/clean/099.jpg b/dataset/clean/099.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d78dc6598ac47e2e57319cfbe1fc10609eed1e2c Binary files /dev/null and b/dataset/clean/099.jpg differ diff --git a/dataset/clean/100.jpg b/dataset/clean/100.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e70c2f6874f5779708c3f8ea7384dad840e40f19 Binary files /dev/null and b/dataset/clean/100.jpg differ diff --git a/dataset/clean/101.jpg b/dataset/clean/101.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5829f49d8cbfc3cc0dc6acdfe73739cbaf6bd447 Binary files /dev/null and b/dataset/clean/101.jpg differ diff --git a/dataset/clean/102.jpg b/dataset/clean/102.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6a0cfad84278e89522495d73096301bf2a9c7ba8 Binary files /dev/null and b/dataset/clean/102.jpg differ diff --git a/dataset/clean/103.jpg b/dataset/clean/103.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c94ffbd80b78b3d8e349e7673b68c52096568057 Binary files /dev/null and b/dataset/clean/103.jpg differ diff --git a/dataset/clean/104.jpg b/dataset/clean/104.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9dd59bcafd8818676ffc05db7f3cfc721201c380 Binary files /dev/null and b/dataset/clean/104.jpg differ diff --git a/dataset/clean/105.jpg b/dataset/clean/105.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c337afd8e6ecacb963b3227285fcb130289b6015 Binary files /dev/null and b/dataset/clean/105.jpg differ diff --git a/dataset/clean/106.jpg b/dataset/clean/106.jpg new file mode 100644 index 0000000000000000000000000000000000000000..40360afb25115263128118117dfa91faa92cdc65 Binary files /dev/null and b/dataset/clean/106.jpg differ diff --git a/dataset/clean/107.jpg b/dataset/clean/107.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6ba3ea3e17efb14211598aa05f86d98292cc61a5 Binary files /dev/null and b/dataset/clean/107.jpg differ diff --git a/dataset/clean/108.jpg b/dataset/clean/108.jpg new file mode 100644 index 0000000000000000000000000000000000000000..afe0aed628684ac523a23d12b0b6544f92e9049b Binary files /dev/null and b/dataset/clean/108.jpg differ diff --git a/dataset/clean/109.jpg b/dataset/clean/109.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b094481461036bc98a89a2f263c9191b1e0a8194 Binary files /dev/null and b/dataset/clean/109.jpg differ diff --git a/dataset/clean/110.jpg b/dataset/clean/110.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b2391fb00b10e3d40bed85d3b9f38e6ff7d4caa5 Binary files /dev/null and b/dataset/clean/110.jpg differ diff --git a/dataset/clean/111.jpg b/dataset/clean/111.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d99b1e7bee1054995a6f840a9eb34fca04582c02 Binary files /dev/null and b/dataset/clean/111.jpg differ diff --git a/dataset/clean/112.jpg b/dataset/clean/112.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5c8496d65c4808ba88930872d9da2865d107a7c4 Binary files /dev/null and b/dataset/clean/112.jpg differ diff --git a/dataset/clean/113.jpg b/dataset/clean/113.jpg new file mode 100644 index 0000000000000000000000000000000000000000..376f2cc5572344c6bbeecd86e1be64c28ea10602 Binary files /dev/null and b/dataset/clean/113.jpg differ diff --git a/dataset/clean/114.jpg b/dataset/clean/114.jpg new file mode 100644 index 0000000000000000000000000000000000000000..02abbabc104026ffbb4471211c2cb60aaa93a74e Binary files /dev/null and b/dataset/clean/114.jpg differ diff --git a/dataset/clean/115.jpg b/dataset/clean/115.jpg new file mode 100644 index 0000000000000000000000000000000000000000..860a5b3422a66146f546b82dbb9eef1746dd3742 Binary files /dev/null and b/dataset/clean/115.jpg differ diff --git a/dataset/clean/116.jpg b/dataset/clean/116.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8255b4e877afdfe873936ee5b31eb7a08c7d9855 Binary files /dev/null and b/dataset/clean/116.jpg differ diff --git a/dataset/clean/117.jpg b/dataset/clean/117.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b0a0ae8d5ebe67cf731c2c732ce7e9d914841db3 Binary files /dev/null and b/dataset/clean/117.jpg differ diff --git a/dataset/clean/118.jpg b/dataset/clean/118.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c44f7e2e2d71901e32db845d7d2fd28c194f015e Binary files /dev/null and b/dataset/clean/118.jpg differ diff --git a/dataset/clean/119.jpg b/dataset/clean/119.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ff44d13085e01795e9bfd16258534fd6b859b8b2 Binary files /dev/null and b/dataset/clean/119.jpg differ diff --git a/dataset/clean/120.jpg b/dataset/clean/120.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b4214c3689f3be100cf374583065fcdc6e5f17db Binary files /dev/null and b/dataset/clean/120.jpg differ diff --git a/dataset/clean/121.jpg b/dataset/clean/121.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6d35fab280b21e4c04a46de9890e4fed605472ee Binary files /dev/null and b/dataset/clean/121.jpg differ diff --git a/dataset/clean/122.jpg b/dataset/clean/122.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b8bcc670cde9ee2e8533b3296c5fbedc23dd1120 Binary files /dev/null and b/dataset/clean/122.jpg differ diff --git a/dataset/clean/123.jpg b/dataset/clean/123.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0c546fe344495c0bf3d5fd2bc0ceda90e1fbda09 Binary files /dev/null and b/dataset/clean/123.jpg differ diff --git a/dataset/clean/124.jpg b/dataset/clean/124.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ecfd2ec13ad346c8f05c5c562acd0a3670edaed8 Binary files /dev/null and b/dataset/clean/124.jpg differ diff --git a/dataset/clean/125.jpg b/dataset/clean/125.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7e52e9c7059eb315cf5f9d73b361d88b91bf2689 Binary files /dev/null and b/dataset/clean/125.jpg differ diff --git a/dataset/clean/126.jpg b/dataset/clean/126.jpg new file mode 100644 index 0000000000000000000000000000000000000000..531e6537085b7d953a4afd13515503303d38881a Binary files /dev/null and b/dataset/clean/126.jpg differ diff --git a/dataset/clean/127.jpg b/dataset/clean/127.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e0d526874989ce230163e02507aa78aece179ced Binary files /dev/null and b/dataset/clean/127.jpg differ diff --git a/dataset/clean/128.jpg b/dataset/clean/128.jpg new file mode 100644 index 0000000000000000000000000000000000000000..24b8bb94ee6aea669564d8573584f3cd387ab580 Binary files /dev/null and b/dataset/clean/128.jpg differ diff --git a/dataset/clean/129.jpg b/dataset/clean/129.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e03d2926eff1afcfd2611436c30a0f5138909d0b Binary files /dev/null and b/dataset/clean/129.jpg differ diff --git a/dataset/clean/130.jpg b/dataset/clean/130.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2093cfadba0e44b1309c44df25c99aa2dd3de3b7 Binary files /dev/null and b/dataset/clean/130.jpg differ diff --git a/dataset/clean/131.jpg b/dataset/clean/131.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7646570277b7ce25c1e05653cd77108434f0a05c Binary files /dev/null and b/dataset/clean/131.jpg differ diff --git a/dataset/clean/132.jpg b/dataset/clean/132.jpg new file mode 100644 index 0000000000000000000000000000000000000000..da3306402182447b7e5e4e8c539daf4074a09929 Binary files /dev/null and b/dataset/clean/132.jpg differ diff --git a/dataset/clean/133.jpg b/dataset/clean/133.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0c27d367923da2a9175d8833f2d23c9932642300 Binary files /dev/null and b/dataset/clean/133.jpg differ diff --git a/dataset/clean/134.jpg b/dataset/clean/134.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7d3de9a78f15d5626c091102350d6a6cbef84fb6 Binary files /dev/null and b/dataset/clean/134.jpg differ diff --git a/dataset/clean/135.jpg b/dataset/clean/135.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c0061b13251378b311d554883e39e4c5b0e7046e Binary files /dev/null and b/dataset/clean/135.jpg differ diff --git a/dataset/clean/136.jpg b/dataset/clean/136.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b30a247019389fe84fc2a421ed855192e4c5252f Binary files /dev/null and b/dataset/clean/136.jpg differ diff --git a/dataset/clean/137.jpg b/dataset/clean/137.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bd62749a0e32898d3b5b9cdf72bbe95c7e646d58 Binary files /dev/null and b/dataset/clean/137.jpg differ diff --git a/dataset/clean/138.jpg b/dataset/clean/138.jpg new file mode 100644 index 0000000000000000000000000000000000000000..955d3f1bab4d78080e2c80b68539c25a0841565c Binary files /dev/null and b/dataset/clean/138.jpg differ diff --git a/dataset/clean/139.jpg b/dataset/clean/139.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4a6ed164a0ca42b69b75ca8aa12c94e0ced2eb1a Binary files /dev/null and b/dataset/clean/139.jpg differ diff --git a/dataset/clean/140.jpg b/dataset/clean/140.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0dde00f4614f0fa3979231ab855f35a24021dbea Binary files /dev/null and b/dataset/clean/140.jpg differ diff --git a/dataset/clean/141.jpg b/dataset/clean/141.jpg new file mode 100644 index 0000000000000000000000000000000000000000..073c8e4e501b58dc331de1bfe0dea2ebbf5e1305 Binary files /dev/null and b/dataset/clean/141.jpg differ diff --git a/dataset/clean/142.jpg b/dataset/clean/142.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c8da0fafe41d895eafe56a8ba94b521c28f306a3 Binary files /dev/null and b/dataset/clean/142.jpg differ diff --git a/dataset/clean/143.jpg b/dataset/clean/143.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fbaf1a79e9e5ea57a313d0ca719f3321e7552426 Binary files /dev/null and b/dataset/clean/143.jpg differ diff --git a/dataset/clean/144.jpg b/dataset/clean/144.jpg new file mode 100644 index 0000000000000000000000000000000000000000..85244e4d0bdf251b14eb15a41bd1ae60c6f68573 Binary files /dev/null and b/dataset/clean/144.jpg differ diff --git a/dataset/clean/145.jpg b/dataset/clean/145.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4c060b8d6323c228a625702dceff313b70ad4c0b Binary files /dev/null and b/dataset/clean/145.jpg differ diff --git a/dataset/clean/146.jpg b/dataset/clean/146.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0f7594cab71d7c9483e485555a73b604aa558e94 Binary files /dev/null and b/dataset/clean/146.jpg differ diff --git a/dataset/clean/147.jpg b/dataset/clean/147.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e3bb0c57d2be168fe4a09d385ff4fcec58612f9e Binary files /dev/null and b/dataset/clean/147.jpg differ diff --git a/dataset/clean/148.jpg b/dataset/clean/148.jpg new file mode 100644 index 0000000000000000000000000000000000000000..88b20ac320d77e314439b284918c9c6cd8d0e8c0 Binary files /dev/null and b/dataset/clean/148.jpg differ diff --git a/dataset/clean/149.jpg b/dataset/clean/149.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5e6c2f17a6db654a1bc03e8f94d2b09df2ab257c Binary files /dev/null and b/dataset/clean/149.jpg differ diff --git a/dataset/noisy/000.jpg b/dataset/noisy/000.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f884ebd22102e0e2e5e823cf9a76d7a8586a035a Binary files /dev/null and b/dataset/noisy/000.jpg differ diff --git a/dataset/noisy/001.jpg b/dataset/noisy/001.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b16e357a2dcaf1885d78e5a9b007a06a31062168 Binary files /dev/null and b/dataset/noisy/001.jpg differ diff --git a/dataset/noisy/002.jpg b/dataset/noisy/002.jpg new file mode 100644 index 0000000000000000000000000000000000000000..842cb29e95e79fc2309841557603bd9d67a10347 Binary files /dev/null and b/dataset/noisy/002.jpg differ diff --git a/dataset/noisy/003.jpg b/dataset/noisy/003.jpg new file mode 100644 index 0000000000000000000000000000000000000000..96d2bd053414041fc8edaa03e0383362a2790c48 Binary files /dev/null and b/dataset/noisy/003.jpg differ diff --git a/dataset/noisy/004.jpg b/dataset/noisy/004.jpg new file mode 100644 index 0000000000000000000000000000000000000000..63de6da62a8453c38597ab3665a5b3f9d4dd9868 Binary files /dev/null and b/dataset/noisy/004.jpg differ diff --git a/dataset/noisy/005.jpg b/dataset/noisy/005.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bccea79b0da24ad2b23c4673e1bf19a87388c77f Binary files /dev/null and b/dataset/noisy/005.jpg differ diff --git a/dataset/noisy/006.jpg b/dataset/noisy/006.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0900fb901b0d5491ce4dd4f75ca0c8cda1645129 Binary files /dev/null and b/dataset/noisy/006.jpg differ diff --git a/dataset/noisy/007.jpg b/dataset/noisy/007.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d60670bea3aeec4cb1841c3fb40b7704b4a0524d Binary files /dev/null and b/dataset/noisy/007.jpg differ diff --git a/dataset/noisy/008.jpg b/dataset/noisy/008.jpg new file mode 100644 index 0000000000000000000000000000000000000000..46e315338928039ad3696c6a64c11c92a7d2b629 Binary files /dev/null and b/dataset/noisy/008.jpg differ diff --git a/dataset/noisy/009.jpg b/dataset/noisy/009.jpg new file mode 100644 index 0000000000000000000000000000000000000000..355f4ac7679e9d004e71ee116ed292bfa0dd7c2c Binary files /dev/null and b/dataset/noisy/009.jpg differ diff --git a/dataset/noisy/010.jpg b/dataset/noisy/010.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8cd23bd38b366bdcc7d0c7c248c92b2681880067 Binary files /dev/null and b/dataset/noisy/010.jpg differ diff --git a/dataset/noisy/011.jpg b/dataset/noisy/011.jpg new file mode 100644 index 0000000000000000000000000000000000000000..12985804fcf64ad1c0572238e465c00677a8d426 Binary files /dev/null and b/dataset/noisy/011.jpg differ diff --git a/dataset/noisy/012.jpg b/dataset/noisy/012.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e1a5209e11f71513ad344579c65bfa1c3ab447fe Binary files /dev/null and b/dataset/noisy/012.jpg differ diff --git a/dataset/noisy/013.jpg b/dataset/noisy/013.jpg new file mode 100644 index 0000000000000000000000000000000000000000..40865f962714ba71d9964bf589a9cc0be8b40b3b Binary files /dev/null and b/dataset/noisy/013.jpg differ diff --git a/dataset/noisy/014.jpg b/dataset/noisy/014.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3c64f8d3c46f17c7d7d0c4fa35813536f536124e Binary files /dev/null and b/dataset/noisy/014.jpg differ diff --git a/dataset/noisy/015.jpg b/dataset/noisy/015.jpg new file mode 100644 index 0000000000000000000000000000000000000000..cee50f8d7742512f360e0cbd2e7e203e2765c002 Binary files /dev/null and b/dataset/noisy/015.jpg differ diff --git a/dataset/noisy/016.jpg b/dataset/noisy/016.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f6a24f8dd014db1e75e79fba0fa5b070e2848100 Binary files /dev/null and b/dataset/noisy/016.jpg differ diff --git a/dataset/noisy/017.jpg b/dataset/noisy/017.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b3a8a68c138d9abc3d41933ab66569d6ea2d082a Binary files /dev/null and b/dataset/noisy/017.jpg differ diff --git a/dataset/noisy/018.jpg b/dataset/noisy/018.jpg new file mode 100644 index 0000000000000000000000000000000000000000..529f7341ec8001504cef945a45d2436239cee053 Binary files /dev/null and b/dataset/noisy/018.jpg differ diff --git a/dataset/noisy/019.jpg b/dataset/noisy/019.jpg new file mode 100644 index 0000000000000000000000000000000000000000..42413ba01f5274c91c87d0d788a28305fceec830 Binary files /dev/null and b/dataset/noisy/019.jpg differ diff --git a/dataset/noisy/020.jpg b/dataset/noisy/020.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7cbac3c58e0cb88e7b315ced983fca34a2636849 Binary files /dev/null and b/dataset/noisy/020.jpg differ diff --git a/dataset/noisy/021.jpg b/dataset/noisy/021.jpg new file mode 100644 index 0000000000000000000000000000000000000000..00cf23350c559d654b8cc4fb39d4e7bed5974391 Binary files /dev/null and b/dataset/noisy/021.jpg differ diff --git a/dataset/noisy/022.jpg b/dataset/noisy/022.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c1e2066460af7c64d3477dcb35fa975240b5c154 Binary files /dev/null and b/dataset/noisy/022.jpg differ diff --git a/dataset/noisy/023.jpg b/dataset/noisy/023.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6831d8ff0e37790fb6458c54ad31caba198164ad Binary files /dev/null and b/dataset/noisy/023.jpg differ diff --git a/dataset/noisy/024.jpg b/dataset/noisy/024.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1cfce593daf0eb5af793dbeb9c75b573acf014b4 Binary files /dev/null and b/dataset/noisy/024.jpg differ diff --git a/dataset/noisy/025.jpg b/dataset/noisy/025.jpg new file mode 100644 index 0000000000000000000000000000000000000000..05d104f9ec719e33564218a15b5c8c359be3baad Binary files /dev/null and b/dataset/noisy/025.jpg differ diff --git a/dataset/noisy/026.jpg b/dataset/noisy/026.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4acf7c2371cad1b47d53fbe24893a51af0eef8f4 Binary files /dev/null and b/dataset/noisy/026.jpg differ diff --git a/dataset/noisy/027.jpg b/dataset/noisy/027.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8f1e5b80e0da25e0644058004730e6d8c86766b2 Binary files /dev/null and b/dataset/noisy/027.jpg differ diff --git a/dataset/noisy/028.jpg b/dataset/noisy/028.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1b2c03ff158aba463bf783d9373a4f853c272bce Binary files /dev/null and b/dataset/noisy/028.jpg differ diff --git a/dataset/noisy/029.jpg b/dataset/noisy/029.jpg new file mode 100644 index 0000000000000000000000000000000000000000..eb934a662efea4001edcd4633d132f9a39d9ead6 Binary files /dev/null and b/dataset/noisy/029.jpg differ diff --git a/dataset/noisy/030.jpg b/dataset/noisy/030.jpg new file mode 100644 index 0000000000000000000000000000000000000000..70f3230551f7e8a295bce2d74e19bbc452af8816 Binary files /dev/null and b/dataset/noisy/030.jpg differ diff --git a/dataset/noisy/031.jpg b/dataset/noisy/031.jpg new file mode 100644 index 0000000000000000000000000000000000000000..edb9b9c78cc1bbd50607a857a0fd1b04da88b998 Binary files /dev/null and b/dataset/noisy/031.jpg differ diff --git a/dataset/noisy/032.jpg b/dataset/noisy/032.jpg new file mode 100644 index 0000000000000000000000000000000000000000..866c09bdc14878be0e72d7f6ce2cb4deb1fb434c Binary files /dev/null and b/dataset/noisy/032.jpg differ diff --git a/dataset/noisy/033.jpg b/dataset/noisy/033.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7c21da06f6f94389db9bd28a36a22235fde576ee Binary files /dev/null and b/dataset/noisy/033.jpg differ diff --git a/dataset/noisy/034.jpg b/dataset/noisy/034.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a10bf87d7194145edcc8cbfe5533f6917380eca4 Binary files /dev/null and b/dataset/noisy/034.jpg differ diff --git a/dataset/noisy/035.jpg b/dataset/noisy/035.jpg new file mode 100644 index 0000000000000000000000000000000000000000..470de8411da8f4a5cfcbde90e1e938d83230bd66 Binary files /dev/null and b/dataset/noisy/035.jpg differ diff --git a/dataset/noisy/036.jpg b/dataset/noisy/036.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c72687db092085c6917fb5236585d4c86b3d3738 Binary files /dev/null and b/dataset/noisy/036.jpg differ diff --git a/dataset/noisy/037.jpg b/dataset/noisy/037.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0eb221ee4b127adfb80cc5668ee72c1b350c159d Binary files /dev/null and b/dataset/noisy/037.jpg differ diff --git a/dataset/noisy/038.jpg b/dataset/noisy/038.jpg new file mode 100644 index 0000000000000000000000000000000000000000..dfd5c43df5a938366c8e1ce557c43182af7afcaa Binary files /dev/null and b/dataset/noisy/038.jpg differ diff --git a/dataset/noisy/039.jpg b/dataset/noisy/039.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2b5789eaa90d7446c817bbda673cd5be64a2c540 Binary files /dev/null and b/dataset/noisy/039.jpg differ diff --git a/dataset/noisy/040.jpg b/dataset/noisy/040.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ce73a11f4fe8e809b9f4ff03161c6840619aede7 Binary files /dev/null and b/dataset/noisy/040.jpg differ diff --git a/dataset/noisy/041.jpg b/dataset/noisy/041.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3f0d20d3dc4740662ce95d4e6c4e0cbc267ae409 Binary files /dev/null and b/dataset/noisy/041.jpg differ diff --git a/dataset/noisy/042.jpg b/dataset/noisy/042.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e9f840bc4407a23089ad99f02afd5495bbb2656f Binary files /dev/null and b/dataset/noisy/042.jpg differ diff --git a/dataset/noisy/043.jpg b/dataset/noisy/043.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bf5cf082ccfd286c819af0d84cc9451a23dc34dc Binary files /dev/null and b/dataset/noisy/043.jpg differ diff --git a/dataset/noisy/044.jpg b/dataset/noisy/044.jpg new file mode 100644 index 0000000000000000000000000000000000000000..db782b801bcfbe097d9a3ead7bffeddb75e747a2 Binary files /dev/null and b/dataset/noisy/044.jpg differ diff --git a/dataset/noisy/045.jpg b/dataset/noisy/045.jpg new file mode 100644 index 0000000000000000000000000000000000000000..5d4e68faced3e13c4b32ddfde5c2f518ce0d96ba Binary files /dev/null and b/dataset/noisy/045.jpg differ diff --git a/dataset/noisy/046.jpg b/dataset/noisy/046.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9850f735771296210cac52e441183898284fc9e3 Binary files /dev/null and b/dataset/noisy/046.jpg differ diff --git a/dataset/noisy/047.jpg b/dataset/noisy/047.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e430dce689c201b22bd896c2be3670b458015edb Binary files /dev/null and b/dataset/noisy/047.jpg differ diff --git a/dataset/noisy/048.jpg b/dataset/noisy/048.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3752ba8f7441f34cedc70dd6afb3db9d2f0a5131 Binary files /dev/null and b/dataset/noisy/048.jpg differ diff --git a/dataset/noisy/049.jpg b/dataset/noisy/049.jpg new file mode 100644 index 0000000000000000000000000000000000000000..59aa100101f8582cd5fde424e78ea8bb3663ba2f Binary files /dev/null and b/dataset/noisy/049.jpg differ diff --git a/dataset/noisy/050.jpg b/dataset/noisy/050.jpg new file mode 100644 index 0000000000000000000000000000000000000000..870fd2e41f21579fd175fa31db7bc140d76c8eee Binary files /dev/null and b/dataset/noisy/050.jpg differ diff --git a/dataset/noisy/051.jpg b/dataset/noisy/051.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d7210f0e55b0848d5506b0e74ec6d80001fecf23 Binary files /dev/null and b/dataset/noisy/051.jpg differ diff --git a/dataset/noisy/052.jpg b/dataset/noisy/052.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f03ac7c4f1fe244eef66eced9fd36765500acc45 Binary files /dev/null and b/dataset/noisy/052.jpg differ diff --git a/dataset/noisy/053.jpg b/dataset/noisy/053.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fcc7150df19029c3f1ea50d1d86e4e06ca763bdc Binary files /dev/null and b/dataset/noisy/053.jpg differ diff --git a/dataset/noisy/054.jpg b/dataset/noisy/054.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3adfc41d6c02e4f31c32a96114b50fd5fe0ec603 Binary files /dev/null and b/dataset/noisy/054.jpg differ diff --git a/dataset/noisy/055.jpg b/dataset/noisy/055.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b3fd6a4dcdbe888a1bfcaedc909e0c8bf3fc19e2 Binary files /dev/null and b/dataset/noisy/055.jpg differ diff --git a/dataset/noisy/056.jpg b/dataset/noisy/056.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d813d8bddcf9ef89a7952565a642c3b2f03e3581 Binary files /dev/null and b/dataset/noisy/056.jpg differ diff --git a/dataset/noisy/057.jpg b/dataset/noisy/057.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ad2814c0d6a3ee8927c464f5c08ac0fab8ff91ff Binary files /dev/null and b/dataset/noisy/057.jpg differ diff --git a/dataset/noisy/058.jpg b/dataset/noisy/058.jpg new file mode 100644 index 0000000000000000000000000000000000000000..233c4b214952b13490c903f60761726e81c5a67b Binary files /dev/null and b/dataset/noisy/058.jpg differ diff --git a/dataset/noisy/059.jpg b/dataset/noisy/059.jpg new file mode 100644 index 0000000000000000000000000000000000000000..52e93cf45d3363b85b6452926a888064e17cc8ec Binary files /dev/null and b/dataset/noisy/059.jpg differ diff --git a/dataset/noisy/060.jpg b/dataset/noisy/060.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2a9288e9c879cad2c76b6c21f84c48b0cd85030c Binary files /dev/null and b/dataset/noisy/060.jpg differ diff --git a/dataset/noisy/061.jpg b/dataset/noisy/061.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d544821c5d61f831f49bb1257a9371bfa46653e3 Binary files /dev/null and b/dataset/noisy/061.jpg differ diff --git a/dataset/noisy/062.jpg b/dataset/noisy/062.jpg new file mode 100644 index 0000000000000000000000000000000000000000..081bda5523d26646471ce70ffc0fc04ef55daecc Binary files /dev/null and b/dataset/noisy/062.jpg differ diff --git a/dataset/noisy/063.jpg b/dataset/noisy/063.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7b165e949c2b76803c6bae24200df60ba64f30ff Binary files /dev/null and b/dataset/noisy/063.jpg differ diff --git a/dataset/noisy/064.jpg b/dataset/noisy/064.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2a72cbde804436a3d6cea4830249a538366169ed Binary files /dev/null and b/dataset/noisy/064.jpg differ diff --git a/dataset/noisy/065.jpg b/dataset/noisy/065.jpg new file mode 100644 index 0000000000000000000000000000000000000000..be358add9df0b25acf4d243b3701447e5d5d8b3f Binary files /dev/null and b/dataset/noisy/065.jpg differ diff --git a/dataset/noisy/066.jpg b/dataset/noisy/066.jpg new file mode 100644 index 0000000000000000000000000000000000000000..89654fb6b87b945121cebddc2fdd76d90e92c614 Binary files /dev/null and b/dataset/noisy/066.jpg differ diff --git a/dataset/noisy/067.jpg b/dataset/noisy/067.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7335e9be4f8f77bf60296451ecc7a634713c7a4d Binary files /dev/null and b/dataset/noisy/067.jpg differ diff --git a/dataset/noisy/068.jpg b/dataset/noisy/068.jpg new file mode 100644 index 0000000000000000000000000000000000000000..33dd5fbc619a21c01fa23ac2234dbe6944785ce1 Binary files /dev/null and b/dataset/noisy/068.jpg differ diff --git a/dataset/noisy/069.jpg b/dataset/noisy/069.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fb25181bdf8e92203748ac04f89bbf1cd9a195c3 Binary files /dev/null and b/dataset/noisy/069.jpg differ diff --git a/dataset/noisy/070.jpg b/dataset/noisy/070.jpg new file mode 100644 index 0000000000000000000000000000000000000000..92442afb078829f74b3079286f8bb60179d6d32f Binary files /dev/null and b/dataset/noisy/070.jpg differ diff --git a/dataset/noisy/071.jpg b/dataset/noisy/071.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fab7a24c9fe45c50f34f55a444ce34038d93a188 Binary files /dev/null and b/dataset/noisy/071.jpg differ diff --git a/dataset/noisy/072.jpg b/dataset/noisy/072.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e5c508fa532fcfe2624996127ee421b42d6b3bdc Binary files /dev/null and b/dataset/noisy/072.jpg differ diff --git a/dataset/noisy/073.jpg b/dataset/noisy/073.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bbd893e01c5f07fefaa98994d9bf8404e8644422 Binary files /dev/null and b/dataset/noisy/073.jpg differ diff --git a/dataset/noisy/074.jpg b/dataset/noisy/074.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9d27c7e9dde5601bde3d5bb89f97a62b62ce730a Binary files /dev/null and b/dataset/noisy/074.jpg differ diff --git a/dataset/noisy/075.jpg b/dataset/noisy/075.jpg new file mode 100644 index 0000000000000000000000000000000000000000..957fc1edd9e77909ae90b46f94802074061081d4 Binary files /dev/null and b/dataset/noisy/075.jpg differ diff --git a/dataset/noisy/076.jpg b/dataset/noisy/076.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4a056ff3cb0e39dcd76a4caa731ccd1f5a25bc4e Binary files /dev/null and b/dataset/noisy/076.jpg differ diff --git a/dataset/noisy/077.jpg b/dataset/noisy/077.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0743bd6a0528624218da7da3c8eff31f55994175 Binary files /dev/null and b/dataset/noisy/077.jpg differ diff --git a/dataset/noisy/078.jpg b/dataset/noisy/078.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8baeb86aea2f6d35699cf5750d34e5cdda3ea250 Binary files /dev/null and b/dataset/noisy/078.jpg differ diff --git a/dataset/noisy/079.jpg b/dataset/noisy/079.jpg new file mode 100644 index 0000000000000000000000000000000000000000..761a3e13fbab66877e56f04457616642e1b88515 Binary files /dev/null and b/dataset/noisy/079.jpg differ diff --git a/dataset/noisy/080.jpg b/dataset/noisy/080.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e634e57db90401735d19591d668bfc411b7e0673 Binary files /dev/null and b/dataset/noisy/080.jpg differ diff --git a/dataset/noisy/081.jpg b/dataset/noisy/081.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4e992fe718dabb28fe93c24ee0d7cd9e6b650e04 Binary files /dev/null and b/dataset/noisy/081.jpg differ diff --git a/dataset/noisy/082.jpg b/dataset/noisy/082.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a28e33e5f4de8b8744d315446bf691a7feadfd05 Binary files /dev/null and b/dataset/noisy/082.jpg differ diff --git a/dataset/noisy/083.jpg b/dataset/noisy/083.jpg new file mode 100644 index 0000000000000000000000000000000000000000..882f004727fa308ee9f17be957bd07b8ffbbab4f Binary files /dev/null and b/dataset/noisy/083.jpg differ diff --git a/dataset/noisy/084.jpg b/dataset/noisy/084.jpg new file mode 100644 index 0000000000000000000000000000000000000000..06dfac34d66b3ca5151ca59a54a95030c7681983 Binary files /dev/null and b/dataset/noisy/084.jpg differ diff --git a/dataset/noisy/085.jpg b/dataset/noisy/085.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d0f969e15394ca6bd491fcf0086494347ac6ff94 Binary files /dev/null and b/dataset/noisy/085.jpg differ diff --git a/dataset/noisy/086.jpg b/dataset/noisy/086.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6ee3fe5aa8843b7e1d78571ec6510299a0166f79 Binary files /dev/null and b/dataset/noisy/086.jpg differ diff --git a/dataset/noisy/087.jpg b/dataset/noisy/087.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8a427b9882dc0688ece1e591b36f465d0e2dcc20 Binary files /dev/null and b/dataset/noisy/087.jpg differ diff --git a/dataset/noisy/088.jpg b/dataset/noisy/088.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8bab3c37c9aa01ca7291c6ede4cfd54976a91d15 Binary files /dev/null and b/dataset/noisy/088.jpg differ diff --git a/dataset/noisy/089.jpg b/dataset/noisy/089.jpg new file mode 100644 index 0000000000000000000000000000000000000000..c8e6027d9a01921a2ead495b6a01cc5cd92502ae Binary files /dev/null and b/dataset/noisy/089.jpg differ diff --git a/dataset/noisy/090.jpg b/dataset/noisy/090.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2598a2def2bb79b65a7c11b6fb73101506546cb2 Binary files /dev/null and b/dataset/noisy/090.jpg differ diff --git a/dataset/noisy/091.jpg b/dataset/noisy/091.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0e4073db0102277e6e77e4a6acf91d5d08a0f47a Binary files /dev/null and b/dataset/noisy/091.jpg differ diff --git a/dataset/noisy/092.jpg b/dataset/noisy/092.jpg new file mode 100644 index 0000000000000000000000000000000000000000..64ecd9296b44a1bd6b25ceeddf4e7b93c301e637 Binary files /dev/null and b/dataset/noisy/092.jpg differ diff --git a/dataset/noisy/093.jpg b/dataset/noisy/093.jpg new file mode 100644 index 0000000000000000000000000000000000000000..cde9bb11e860a9d980ed6389c4c6a5374e38405d Binary files /dev/null and b/dataset/noisy/093.jpg differ diff --git a/dataset/noisy/094.jpg b/dataset/noisy/094.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2492b410bc451d8a9a3505939385038c55d97fdd Binary files /dev/null and b/dataset/noisy/094.jpg differ diff --git a/dataset/noisy/095.jpg b/dataset/noisy/095.jpg new file mode 100644 index 0000000000000000000000000000000000000000..422668c391ab76cb3c8bdc4b78b2537224b6c469 Binary files /dev/null and b/dataset/noisy/095.jpg differ diff --git a/dataset/noisy/096.jpg b/dataset/noisy/096.jpg new file mode 100644 index 0000000000000000000000000000000000000000..454db93bfed40765081ea3d59dc9d39892810d32 Binary files /dev/null and b/dataset/noisy/096.jpg differ diff --git a/dataset/noisy/097.jpg b/dataset/noisy/097.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f5df891e661c8999948d6b6677c63e2194f393ee Binary files /dev/null and b/dataset/noisy/097.jpg differ diff --git a/dataset/noisy/098.jpg b/dataset/noisy/098.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b6acc2edc18b718a92c36b7bf52075a2cb487586 Binary files /dev/null and b/dataset/noisy/098.jpg differ diff --git a/dataset/noisy/099.jpg b/dataset/noisy/099.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4b773777a9971e2771289b4c8d409e51b439bdb3 Binary files /dev/null and b/dataset/noisy/099.jpg differ diff --git a/dataset/noisy/100.jpg b/dataset/noisy/100.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9135195ab2590c07fe1dd9a94cc8ef0d6cbe95e8 Binary files /dev/null and b/dataset/noisy/100.jpg differ diff --git a/dataset/noisy/101.jpg b/dataset/noisy/101.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d93472ed6d970e8fcaaac522a42fd67ea3796dad Binary files /dev/null and b/dataset/noisy/101.jpg differ diff --git a/dataset/noisy/102.jpg b/dataset/noisy/102.jpg new file mode 100644 index 0000000000000000000000000000000000000000..94c3c14e68513b66b935a89f1c7a27696bcaaaf4 Binary files /dev/null and b/dataset/noisy/102.jpg differ diff --git a/dataset/noisy/103.jpg b/dataset/noisy/103.jpg new file mode 100644 index 0000000000000000000000000000000000000000..eb0a9ef895cdb62782ffbbcd3a9f392151561acb Binary files /dev/null and b/dataset/noisy/103.jpg differ diff --git a/dataset/noisy/104.jpg b/dataset/noisy/104.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e7707670cf938d99baa641a5ee93eba773d3709a Binary files /dev/null and b/dataset/noisy/104.jpg differ diff --git a/dataset/noisy/105.jpg b/dataset/noisy/105.jpg new file mode 100644 index 0000000000000000000000000000000000000000..014ad76651c694b520d03e26daeaa6ea385ee3ef Binary files /dev/null and b/dataset/noisy/105.jpg differ diff --git a/dataset/noisy/106.jpg b/dataset/noisy/106.jpg new file mode 100644 index 0000000000000000000000000000000000000000..fdfcf036d4e22bf1e9bda00d6d2b6187ac081b7a Binary files /dev/null and b/dataset/noisy/106.jpg differ diff --git a/dataset/noisy/107.jpg b/dataset/noisy/107.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1ddec85f6a10d4aa17cd8876c4918368ef82d140 Binary files /dev/null and b/dataset/noisy/107.jpg differ diff --git a/dataset/noisy/108.jpg b/dataset/noisy/108.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1223d54acfb5e2d7c1933a3e3be51393e27bcd84 Binary files /dev/null and b/dataset/noisy/108.jpg differ diff --git a/dataset/noisy/109.jpg b/dataset/noisy/109.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b9d3f6ae6b74eac36a151bd8ea93d8e047f7ae90 Binary files /dev/null and b/dataset/noisy/109.jpg differ diff --git a/dataset/noisy/110.jpg b/dataset/noisy/110.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a8ad5aac7b4089a1f8959faabcaea7eacaf73f3c Binary files /dev/null and b/dataset/noisy/110.jpg differ diff --git a/dataset/noisy/111.jpg b/dataset/noisy/111.jpg new file mode 100644 index 0000000000000000000000000000000000000000..68bc6073e5f431f677701f92ec14baa79fa647f9 Binary files /dev/null and b/dataset/noisy/111.jpg differ diff --git a/dataset/noisy/112.jpg b/dataset/noisy/112.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ccd55e7717c24f7c95ad20b245741bcc61713f4e Binary files /dev/null and b/dataset/noisy/112.jpg differ diff --git a/dataset/noisy/113.jpg b/dataset/noisy/113.jpg new file mode 100644 index 0000000000000000000000000000000000000000..156582c095caf0f7b923de37a36f039afaedd48d Binary files /dev/null and b/dataset/noisy/113.jpg differ diff --git a/dataset/noisy/114.jpg b/dataset/noisy/114.jpg new file mode 100644 index 0000000000000000000000000000000000000000..25b555ec1b56dbb08d5587ce20b800f9bb158d2d Binary files /dev/null and b/dataset/noisy/114.jpg differ diff --git a/dataset/noisy/115.jpg b/dataset/noisy/115.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bea8bba82912245eec99359acad0a8b2b03a5e51 Binary files /dev/null and b/dataset/noisy/115.jpg differ diff --git a/dataset/noisy/116.jpg b/dataset/noisy/116.jpg new file mode 100644 index 0000000000000000000000000000000000000000..501ba4afa09578d79b78966368258c7713f9e463 Binary files /dev/null and b/dataset/noisy/116.jpg differ diff --git a/dataset/noisy/117.jpg b/dataset/noisy/117.jpg new file mode 100644 index 0000000000000000000000000000000000000000..74d630b47d33de0dc9258d665e3c326c844b2864 Binary files /dev/null and b/dataset/noisy/117.jpg differ diff --git a/dataset/noisy/118.jpg b/dataset/noisy/118.jpg new file mode 100644 index 0000000000000000000000000000000000000000..56cb2bcdd104f765dade36840b3dc0e684ef01df Binary files /dev/null and b/dataset/noisy/118.jpg differ diff --git a/dataset/noisy/119.jpg b/dataset/noisy/119.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ea929f3bedb8c2a34892cd15edc86f80d309a766 Binary files /dev/null and b/dataset/noisy/119.jpg differ diff --git a/dataset/noisy/120.jpg b/dataset/noisy/120.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ec7193479efd17b750b3fddf13a23944ea0e58fd Binary files /dev/null and b/dataset/noisy/120.jpg differ diff --git a/dataset/noisy/121.jpg b/dataset/noisy/121.jpg new file mode 100644 index 0000000000000000000000000000000000000000..812941705d06069c0be77a41f4b60be6f5874785 Binary files /dev/null and b/dataset/noisy/121.jpg differ diff --git a/dataset/noisy/122.jpg b/dataset/noisy/122.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2acc7f3df5b30992604be32ac721f832d76dfbf6 Binary files /dev/null and b/dataset/noisy/122.jpg differ diff --git a/dataset/noisy/123.jpg b/dataset/noisy/123.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7da04f70ac6d8c8b1f22b15fbad9f6ff7db9939e Binary files /dev/null and b/dataset/noisy/123.jpg differ diff --git a/dataset/noisy/124.jpg b/dataset/noisy/124.jpg new file mode 100644 index 0000000000000000000000000000000000000000..9886b22901271cebb719196de89315bf2ba38181 Binary files /dev/null and b/dataset/noisy/124.jpg differ diff --git a/dataset/noisy/125.jpg b/dataset/noisy/125.jpg new file mode 100644 index 0000000000000000000000000000000000000000..26783c0fed516690ad214eeda7d59b04cd2bd54f Binary files /dev/null and b/dataset/noisy/125.jpg differ diff --git a/dataset/noisy/126.jpg b/dataset/noisy/126.jpg new file mode 100644 index 0000000000000000000000000000000000000000..4eef38392ad6c9a89bfa663202362888f8b475b2 Binary files /dev/null and b/dataset/noisy/126.jpg differ diff --git a/dataset/noisy/127.jpg b/dataset/noisy/127.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8db066d0b129ae322f2e20406f1ccfd734feec18 Binary files /dev/null and b/dataset/noisy/127.jpg differ diff --git a/dataset/noisy/128.jpg b/dataset/noisy/128.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6a4bcb3a83e1b715bf33d7e1b803d0d870390ad9 Binary files /dev/null and b/dataset/noisy/128.jpg differ diff --git a/dataset/noisy/129.jpg b/dataset/noisy/129.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6789c227dcb265989a23ee4a1dacc2d76ed4917a Binary files /dev/null and b/dataset/noisy/129.jpg differ diff --git a/dataset/noisy/130.jpg b/dataset/noisy/130.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ecc3a50c0a65e0c3333b6a298efd8f668c4b9bbb Binary files /dev/null and b/dataset/noisy/130.jpg differ diff --git a/dataset/noisy/131.jpg b/dataset/noisy/131.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2047124c9dfc4adb9318dfec5f649dacc60acca7 Binary files /dev/null and b/dataset/noisy/131.jpg differ diff --git a/dataset/noisy/132.jpg b/dataset/noisy/132.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6a956ac163d317bb4479c4dd41e633d4a0728d2c Binary files /dev/null and b/dataset/noisy/132.jpg differ diff --git a/dataset/noisy/133.jpg b/dataset/noisy/133.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d8f84e98b3c52ba2ca555e48b1b8866e4cac4e0c Binary files /dev/null and b/dataset/noisy/133.jpg differ diff --git a/dataset/noisy/134.jpg b/dataset/noisy/134.jpg new file mode 100644 index 0000000000000000000000000000000000000000..2d61f1bde8d153098d3bedb3acc56c79e2b637a6 Binary files /dev/null and b/dataset/noisy/134.jpg differ diff --git a/dataset/noisy/135.jpg b/dataset/noisy/135.jpg new file mode 100644 index 0000000000000000000000000000000000000000..3282159202bcf5bfe42039c5458b2cbe7ab03021 Binary files /dev/null and b/dataset/noisy/135.jpg differ diff --git a/dataset/noisy/136.jpg b/dataset/noisy/136.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7c833f50937b60249f8602c0084a0caeba838cbc Binary files /dev/null and b/dataset/noisy/136.jpg differ diff --git a/dataset/noisy/137.jpg b/dataset/noisy/137.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d49081cf47e53fd9830b3fe697a928d5fd3695d6 Binary files /dev/null and b/dataset/noisy/137.jpg differ diff --git a/dataset/noisy/138.jpg b/dataset/noisy/138.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0a575af0aca2ebf0578c691d87fef01e59c22f4e Binary files /dev/null and b/dataset/noisy/138.jpg differ diff --git a/dataset/noisy/139.jpg b/dataset/noisy/139.jpg new file mode 100644 index 0000000000000000000000000000000000000000..96e8aad13dae0da675cdf60591e3ab344b349b1f Binary files /dev/null and b/dataset/noisy/139.jpg differ diff --git a/dataset/noisy/140.jpg b/dataset/noisy/140.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e4c26b626097b2b5a3e9f2308c2670a279153821 Binary files /dev/null and b/dataset/noisy/140.jpg differ diff --git a/dataset/noisy/141.jpg b/dataset/noisy/141.jpg new file mode 100644 index 0000000000000000000000000000000000000000..90a125da379dad5816c5b9689234bab72a02ef0d Binary files /dev/null and b/dataset/noisy/141.jpg differ diff --git a/dataset/noisy/142.jpg b/dataset/noisy/142.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a29a503333209be7339be3ee21456093459ddb96 Binary files /dev/null and b/dataset/noisy/142.jpg differ diff --git a/dataset/noisy/143.jpg b/dataset/noisy/143.jpg new file mode 100644 index 0000000000000000000000000000000000000000..d781aa36b86deb88933cbb0b436bb949c7706859 Binary files /dev/null and b/dataset/noisy/143.jpg differ diff --git a/dataset/noisy/144.jpg b/dataset/noisy/144.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f1cdce9c3a422cd5af5530353f456742e2141665 Binary files /dev/null and b/dataset/noisy/144.jpg differ diff --git a/dataset/noisy/145.jpg b/dataset/noisy/145.jpg new file mode 100644 index 0000000000000000000000000000000000000000..985cd4752f1d92ea88cf0bbbb74f6745cd54cff4 Binary files /dev/null and b/dataset/noisy/145.jpg differ diff --git a/dataset/noisy/146.jpg b/dataset/noisy/146.jpg new file mode 100644 index 0000000000000000000000000000000000000000..b25c9f0699c601047a03e3c1ace6ce0ee73855a1 Binary files /dev/null and b/dataset/noisy/146.jpg differ diff --git a/dataset/noisy/147.jpg b/dataset/noisy/147.jpg new file mode 100644 index 0000000000000000000000000000000000000000..7f68c48e3b1d1e47a2a6e953a5a5a054e2b70716 Binary files /dev/null and b/dataset/noisy/147.jpg differ diff --git a/dataset/noisy/148.jpg b/dataset/noisy/148.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6ec453c6a55438dcd4c45bc33a47b4db7833dbee Binary files /dev/null and b/dataset/noisy/148.jpg differ diff --git a/dataset/noisy/149.jpg b/dataset/noisy/149.jpg new file mode 100644 index 0000000000000000000000000000000000000000..170f228660a01fa10e898896fcc6cd6d28ed1628 Binary files /dev/null and b/dataset/noisy/149.jpg differ diff --git a/ink_vision_engine.py b/ink_vision_engine.py new file mode 100644 index 0000000000000000000000000000000000000000..eff9bf12d1696d448d207cdb159f9750d5c734f3 --- /dev/null +++ b/ink_vision_engine.py @@ -0,0 +1,86 @@ +import torch +import torch.nn as nn + +class CRAFT_Demonstration(nn.Module): + + def __init__(self): + super().__init__() + # In reality, this is a deep ResNet-based U-Net architecture. + self.feature_extractor = nn.Conv2d(3, 64, kernel_size=3, padding=1) + self.heatmap_predictor = nn.Conv2d(64, 2, kernel_size=1) + + def forward(self, image): + features = self.feature_extractor(image) + # Returns [Region Score, Affinity Score] + return self.heatmap_predictor(features) + + +class VGG_FeatureExtractor(nn.Module): + + def __init__(self, input_channel=1, output_channel=256): + super(VGG_FeatureExtractor, self).__init__() + self.ConvNet = nn.Sequential( + nn.Conv2d(input_channel, 64, 3, 1, 1), nn.ReLU(True), + nn.MaxPool2d(2, 2), + nn.Conv2d(64, 128, 3, 1, 1), nn.ReLU(True), + nn.MaxPool2d(2, 2), + nn.Conv2d(128, 256, 3, 1, 1), nn.ReLU(True), + nn.Conv2d(256, 256, 3, 1, 1), nn.ReLU(True), + nn.MaxPool2d((2, 1), (2, 1)), + nn.Conv2d(256, output_channel, 3, 1, 1, bias=False), + nn.BatchNorm2d(output_channel), nn.ReLU(True) + ) + + def forward(self, input): + return self.ConvNet(input) + + +class BidirectionalLSTM(nn.Module): + + def __init__(self, input_size, hidden_size, output_size): + super(BidirectionalLSTM, self).__init__() + self.rnn = nn.LSTM(input_size, hidden_size, bidirectional=True, batch_first=True) + self.linear = nn.Linear(hidden_size * 2, output_size) + + def forward(self, input): + recurrent, _ = self.rnn(input) + output = self.linear(recurrent) # Contextual Features mapped to Classes + return output + + +class CRNN_Model(nn.Module): + + def __init__(self, num_classes=97): + super(CRNN_Model, self).__init__() + + self.FeatureExtraction = VGG_FeatureExtractor(input_channel=1, output_channel=256) + self.AdaptiveAvgPool = nn.AdaptiveAvgPool2d((None, 1)) + + self.SequenceModeling = nn.Sequential( + BidirectionalLSTM(256, 256, 256), + BidirectionalLSTM(256, 256, 256) + ) + + self.Prediction = nn.Linear(256, num_classes) + + def forward(self, image_tensor): + visual_feature = self.FeatureExtraction(image_tensor) + visual_feature = self.AdaptiveAvgPool(visual_feature.permute(0, 3, 1, 2)).squeeze(3) + + contextual_feature = self.SequenceModeling(visual_feature) + + prediction = self.Prediction(contextual_feature.contiguous()) + return prediction + + +def CTCDecoder(predictions): + + max_probs = torch.argmax(predictions, dim=2) + + final_string = [] + for i in range(len(max_probs)): + if max_probs[i] != 0 and (i == 0 or max_probs[i] != max_probs[i-1]): + final_string.append(str(max_probs[i].item())) + + return "".join(final_string) + diff --git a/pipeline/README.md b/pipeline/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4cfa4555dfef3c8e3096d1ffa190a8b26259fc87 --- /dev/null +++ b/pipeline/README.md @@ -0,0 +1,29 @@ +# Modular 3-Step HTR Pipeline + +This directory contains a highly modular Machine Learning pipeline for Handwritten Text Recognition (HTR). It is designed to be easily extensible, allowing you to plug and play your own trained PyTorch/TensorFlow models into any of the 3 steps. + +## Pipeline Architecture + +1. **Step 1: The Preprocessor (`preprocessor.py`)** + - **Current Logic:** Denoising uses both OpenCV and LightCNN. OpenCV handles non-local means denoising, adaptive Gaussian thresholding, Green-Channel extraction; LightCNN is used for denoising alongside it. Contour-based MinAreaRect rotation for deskewing. + - **How to Swap:** To use a custom Deep Learning model (like a trained UNet for binarization or a CNN for deskewing), open `preprocessor.py`. Initialize your PyTorch/Keras model inside the `__init__()` function. Then, inside `binarize_and_denoise()` or `deskew()`, replace the `cv2` logic with your model's forward inference pass (e.g., `return my_unet_model(image)`). + +2. **Step 2: The OCR Engine (`ocr_engine.py`)** + - **Current Logic:** Wraps the `EasyOCR` library, tuned specifically for handwriting (lowered text and link thresholds, increased magnification ratio). + - **How to Swap:** You can point EasyOCR to your own fine-tuned weights by passing `model_storage_directory='path/to/models'` when initializing the reader. If you want to use an entirely different architecture (e.g., Microsoft TrOCR), simply replace the `self.reader` initialization with your TrOCR HuggingFace setup, and update the `extract_text()` function to call your model's generation function instead. + +3. **Step 3: The Postprocessor (`postprocessor.py`)** + - **Current Logic:** Uses a lightweight Hugging Face pipeline (`t5-small` text2text-generation) to conceptually attempt to reconstruct corrupted sentences grammatically, combined with standard regex cleaning. + - **How to Swap:** If you train a custom Transformer on OCR errors (e.g., fine-tuning BERT or BART to map "he110 th3re" -> "hello there"), simply change the `pipeline` instantiation in `__init__()` to load your local huggingface directory: `pipeline("text2text-generation", model="path/to/my/finetuned/model")`. + +## Execution + +Ensure your dependencies are installed: +```bash +pip install -r requirements.txt +``` + +Run a test image through the pipeline: +```bash +python pipeline/main.py --image path/to/handwritten_test.jpg +``` diff --git a/pipeline/__init__.py b/pipeline/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6574e109d95dbe9873281c783c72b526b503c1a0 --- /dev/null +++ b/pipeline/__init__.py @@ -0,0 +1,3 @@ +""" +HTR Pipeline Package +""" diff --git a/pipeline/__pycache__/__init__.cpython-311.pyc b/pipeline/__pycache__/__init__.cpython-311.pyc new file mode 100644 index 0000000000000000000000000000000000000000..b643cf3cb72637e1855feedf93074da15cd03b65 Binary files /dev/null and b/pipeline/__pycache__/__init__.cpython-311.pyc differ diff --git a/pipeline/__pycache__/ocr_engine.cpython-311.pyc b/pipeline/__pycache__/ocr_engine.cpython-311.pyc new file mode 100644 index 0000000000000000000000000000000000000000..e3fb1b66e6ca0bc22f378b6fa45b0f0634921677 Binary files /dev/null and b/pipeline/__pycache__/ocr_engine.cpython-311.pyc differ diff --git a/pipeline/__pycache__/postprocessor.cpython-311.pyc b/pipeline/__pycache__/postprocessor.cpython-311.pyc new file mode 100644 index 0000000000000000000000000000000000000000..c725b699ea88863e6e33aa82eace6cbe22c8d112 Binary files /dev/null and b/pipeline/__pycache__/postprocessor.cpython-311.pyc differ diff --git a/pipeline/__pycache__/preprocessor.cpython-311.pyc b/pipeline/__pycache__/preprocessor.cpython-311.pyc new file mode 100644 index 0000000000000000000000000000000000000000..21bc5622f4f6c3fadc0c39322f56d982e89d1e4d Binary files /dev/null and b/pipeline/__pycache__/preprocessor.cpython-311.pyc differ diff --git a/pipeline/main.py b/pipeline/main.py new file mode 100644 index 0000000000000000000000000000000000000000..b37a3b9b811ed03f29d956480fd9882694e5cdb1 --- /dev/null +++ b/pipeline/main.py @@ -0,0 +1,64 @@ +import sys +import os + +# Add the parent directory to the path so the pipeline module can be imported +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + +from pipeline.preprocessor import DocumentPreprocessor +from pipeline.ocr_engine import HTREngine +from pipeline.postprocessor import NLPCorrector +from PIL import Image + +def run_pipeline(image_path): + """ + Executes the 3-Step Modular HTR Pipeline on a test image. + """ + print("==================================================") + print(f"Starting HTR Pipeline for: {image_path}") + print("==================================================") + + # Boot up the modules (This usually happens once on server start) + preprocessor = DocumentPreprocessor() + engine = HTREngine(languages=['en']) + nlp_corrector = NLPCorrector(use_ml=True) + + print("\n[STEP 1] Running Computer Vision Pre-Processing...") + try: + cleaned_image_array = preprocessor.process(image_path) + print(" -> Image binarized, denoised, and deskewed successfully.") + except Exception as e: + print(f" -> ERROR in Preprocessing: {e}") + return + + print("\n[STEP 2] Running Deep Learning OCR Engine...") + try: + raw_text = engine.extract_text(cleaned_image_array) + print(f" -> Raw Output: '{raw_text}'") + except Exception as e: + print(f" -> ERROR in OCR Engine: {e}") + return + + print("\n[STEP 3] Running NLP Post-Processing Contextual Correction...") + try: + final_text = nlp_corrector.correct_spelling(raw_text) + print("==================================================") + print(f"FINAL POLISHED RESULT: '{final_text}'") + print("==================================================") + except Exception as e: + print(f" -> ERROR in NLP Correction: {e}") + +if __name__ == "__main__": + # Test execution script + import argparse + parser = argparse.ArgumentParser(description="Run the Modular HTR Pipeline") + parser.add_argument("--image", type=str, required=True, help="Path to the handwritten image file.") + + args, unknown = parser.parse_known_args() + + if args.image: + if os.path.exists(args.image): + run_pipeline(args.image) + else: + print(f"Image not found at path: {args.image}") + else: + print("Please provide an image using python main.py --image path/to/image.jpg") diff --git a/pipeline/ocr_engine.py b/pipeline/ocr_engine.py new file mode 100644 index 0000000000000000000000000000000000000000..33eb13270f40ad182c7ba225ea6693c0919ae43a --- /dev/null +++ b/pipeline/ocr_engine.py @@ -0,0 +1,84 @@ +import easyocr +import numpy as np + +class HTREngine: + """ + STEP 2: THE ENGINE (EasyOCR) + This module integrates EasyOCR to extract raw text and digits from the cleaned images. + + HOT-SWAP ML MODELS HERE: + To swap locally trained weights for EasyOCR, place your .pth and .yaml config + in a directory and initialize the reader with: `model_storage_directory='path/to/models'`. + Alternatively, replace EasyOCR here completely with another engine (like TrOCR or PaddleOCR). + """ + + def __init__(self, languages=['en']): + # We initialize the model with standard parameters, but configured to be aggressive + # in recognition since the preprocessor has already cleaned the image perfectly. + print(f"Initializing EasyOCR Engine for {languages}...") + self.reader = easyocr.Reader(languages) + + def extract_text(self, image_input): + """ + Extracts text from the preprocessed image array. + """ + # Read the image using parameters tuned for spaced handwriting + # - link_threshold: Increased to link disparate handwritten characters together. + # Read the image using paragraph mode + # - text_threshold: Lowered to catch faint red ink handwriting. + # - paragraph: True to ensure correct top-to-bottom reading flow. + results = self.reader.readtext( + image_input, + text_threshold=0.35, # Aggressively catch faint strokes + link_threshold=0.4, + mag_ratio=1.5, + paragraph=True + ) + + if not results: + return "" + + # NATIVE PARAGRAPH MODE: + # In paragraph mode, results is a list of [bbox, text] pairs. + # However, to ensure 100% correct reading order (Top-to-Bottom then Left-to-Right), + # we apply a robust spatial sort. + + if len(results) > 1: + # 1. Sort by top-left Y coordinate as a primary pass + results.sort(key=lambda x: x[0][0][1]) + + lines = [] + for res in results: + bbox = res[0] + # Use the vertical center of the box for most stable line-matching + y_center = (bbox[0][1] + bbox[2][1]) / 2 + + # Check if this word can join an existing line + joined = False + for line in lines: + # Representative height of the line + line_h = np.mean([item[0][2][1] - item[0][0][1] for item in line]) + line_y_avg = np.mean([(item[0][0][1] + item[0][2][1]) / 2 for item in line]) + + # If the word's center is within 50% of the line's average height, join it. + if abs(y_center - line_y_avg) < (line_h * 0.5): + line.append(res) + joined = True + break + + if not joined: + lines.append([res]) + + # 2. Sort words within each clustered line by X-coordinate + for line in lines: + line.sort(key=lambda x: x[0][0][0]) + + # 3. Sort the lines themselves by the Y-coordinate of their FIRST word + # (which is usually the start of the line) + lines.sort(key=lambda line: line[0][0][1]) + + # 4. Flatten back into results + results = [word for line in lines for word in line] + + raw_text = " ".join([text for (bbox, text) in results]) + return raw_text diff --git a/pipeline/postprocessor.py b/pipeline/postprocessor.py new file mode 100644 index 0000000000000000000000000000000000000000..ec23e1544472a8614c33af43a68feac5c18bf60d --- /dev/null +++ b/pipeline/postprocessor.py @@ -0,0 +1,230 @@ +from spellchecker import SpellChecker +import re +import os + +try: + from transformers import AutoTokenizer, AutoModelForSequenceClassification + import torch + HAS_TRANSFORMERS = True +except ImportError: + HAS_TRANSFORMERS = False + +class NLPCorrector: + """ + STEP 3: THE POST-PROCESSOR (NLP) + This module takes the raw, potentially misspelled output from the OCR engine + and uses a Natural Language Processing (NLP) technique to fix the text. + + HOT-SWAP ML MODELS HERE: + Currently, this uses a robust, offline dictionary distance-based SpellChecker for lightning-fast results. + To swap in a heavy Deep Learning LM (like T5 or BERT): + 1. Load your local model via `AutoModelForSeq2SeqLM.from_pretrained('path/to/model')` + 2. Replace the `self.corrector` logic below. + """ + + def __init__(self, use_ml=True): + self.use_ml = use_ml + if self.use_ml: + print("Initializing NLP Post-Processor (Dictionary & Context Model)...") + self.corrector = SpellChecker() + + # 4. INITIALIZE SEMANTIC JUDGE (BERT Tiny) + # This model is very small (~17MB) but understands English grammar and "meaning" + if HAS_TRANSFORMERS: + try: + print("Initializing Semantic Judge (BERT Tiny)...") + # Using a tiny model to avoid CPU/Memory overload + self.model_name = "google/bert_uncased_L-2_H-128_A-2" + self.tokenizer = AutoTokenizer.from_pretrained(self.model_name) + self.semantic_model = AutoModelForSequenceClassification.from_pretrained(self.model_name) + self.semantic_model.eval() + except Exception as e: + print(f"Semantic Judge failed to load: {e}") + self.semantic_model = None + else: + self.semantic_model = None + + def basic_clean(self, text): + """ Quick regex cleanup for common OCR artifact characters. """ + # Remove multiple spaces + text = re.sub(r'\s+', ' ', text) + # Remove weird symbols except standard punctuation + text = re.sub(r'[^\w\s\.,!\?-]', '', text) + return text.strip() + + def correct_spelling(self, ocr_text): + """ + Takes raw OCR text and attempts contextual reconstruction. + """ + cleaned = self.basic_clean(ocr_text) + + if not cleaned: + return "" + + if not self.use_ml: + return cleaned + + # Instead of a heavy LLM, we use a rapid NLP deterministic approach + words = cleaned.split() + fixed_words = [] + for word in words: + # We ignore single digits or characters that are fine + if len(word) <= 1 or word.isdigit(): + fixed_words.append(word) + continue + + # 1. Start with raw lowercase + clean_word = word.lower() + + # 2. Skip if it's a known correctly spelled word or short + if clean_word in self.corrector.word_frequency or len(clean_word) < 2: + fixed_words.append(word) + continue + + # 3. HIGH ERROR HANDLING: Find candidates with edit distance + # If the OCR is messy, we look for the most likely candidates + candidates = self.corrector.candidates(clean_word) + + # If we have strong candidates, pick the best one + if candidates and len(candidates) > 0: + # We prefer the 'correction' but can also inspect full candidate list + correction = self.corrector.correction(clean_word) + else: + correction = clean_word + + if correction and correction != clean_word: + # Retain the user's original capitalization rules! + if word.isupper(): + fixed_words.append(correction.upper()) + elif word.istitle(): + fixed_words.append(correction.capitalize()) + else: + fixed_words.append(correction) + else: + fixed_words.append(word) + + # 4. SPLIT WORD MERGING (e.g., 'import dance' -> 'importance') + # We perform a second pass to see if joining adjacent words creates a valid one. + final_pass = [] + skip_next = False + for i in range(len(fixed_words)): + if skip_next: + skip_next = False + continue + + if i + 1 < len(fixed_words): + joined = (fixed_words[i] + fixed_words[i+1]).lower() + # If the joined version exists in the dictionary but the individual ones were questionable + if joined in self.corrector.word_frequency: + # Check if the originals were also in dictionary + orig_1 = fixed_words[i].lower() in self.corrector.word_frequency + orig_2 = fixed_words[i+1].lower() in self.corrector.word_frequency + + if not (orig_1 and orig_2): # If at least one was a 'broken' piece + final_pass.append(joined.upper() if fixed_words[i].isupper() else joined) + skip_next = True + continue + + final_pass.append(fixed_words[i]) + + fixed_text = " ".join(final_pass) + + # 5. GRAMMATICAL PASS (Helping Verbs & Articles) + # We ensure that common small words (is, am, are, the, a) are correctly + # spaced and positioned based on basic English syntax rules. + fixed_text = self.grammatical_pass(fixed_text) + + return fixed_text + + def grammatical_pass(self, text): + """ + Lightweight heuristic pass to repair broken grammar patterns + common in OCR (e.g., 'has is' -> 'has', 'i s' -> 'is'). + """ + # Fix common helping verb fragments + replacements = { + r'\bha s\b': 'has', + r'\bi s\b': 'is', + r'\ba re\b': 'are', + r'\bwa s\b': 'was', + r'\bha ve\b': 'have', + r'\bha d\b': 'had', + r'\bt he\b': 'the', + r'\bh as\b': 'has', + r'\bi t s\b': 'its', + r'\ba n d\b': 'and' + } + for pattern, replacement in replacements.items(): + text = re.sub(pattern, replacement, text, flags=re.IGNORECASE) + + # Deduplicate common double helping verbs (OCR often double-reads) + text = re.sub(r'\b(has|is|was|were|had|have)\b\s+\b\1\b', r'\1', text, flags=re.IGNORECASE) + + # Repair specific "has is" -> "has its" or "has" logic if needed + # (This handles the user's specific "has is" error case) + text = re.sub(r'\bhas is\b', 'has its', text, flags=re.IGNORECASE) + + return text.strip() + + def score_semantic(self, text): + """ + Calculates a 'Coherence Score' for a sentence using BERT. + Determines how much 'sense' a sentence makes grammatically. + """ + if not self.semantic_model or not text.strip(): + return 0.5 # Neutral fallback + + try: + inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True) + with torch.no_grad(): + outputs = self.semantic_model(**inputs) + # We use the raw logits as a proxy for 'confidence' in the sequence structure + score = torch.softmax(outputs.logits, dim=1).max().item() + return score + except: + return 0.5 + + def judge_best_output(self, text_a, text_b): + """ + ENSEMBLE JUDGE (Advanced) + Combines Dictionary Density and BERT-based Semantic Meaning. + Stream A (Raw Image) is slightly preferred when scores are close. + Final returned text has all digit characters removed (text-only output). + """ + def calculate_score(text): + if not text or text.strip() == "": + return 0 + + words = text.split() + + # 1. Dictionary Match (Grammatical check) + matches = sum(1 for w in words if w.lower() in self.corrector.word_frequency) + density = matches / len(words) if words else 0 + + # 2. Semantic Coherence (Meaning check) + semantic_weight = self.score_semantic(text) + + # 3. Length Bonus + length_factor = min(len(text) / 50.0, 1.0) + + # Weighted average + # Emphasize dictionary and length, de-emphasize semantic model + total_score = (density * 0.5) + (semantic_weight * 0.2) + (length_factor * 0.3) + return total_score + + score_a = calculate_score(text_a) + score_b = calculate_score(text_b) + + print(f"Ensemble Judge (Semantic) -> Stream A: {score_a:.2f}, Stream B: {score_b:.2f}") + + # If both non-empty and scores are close, prefer the raw-image stream (Stream A) + if score_a > 0 and score_b > 0 and score_a >= score_b * 0.9: + chosen = text_a + else: + chosen = text_a if score_a > score_b else text_b + + # Remove all digit characters from the final output (text-only) + chosen_no_digits = re.sub(r'\d+', '', chosen) + # Normalize extra spaces after removing digits + chosen_no_digits = re.sub(r'\s+', ' ', chosen_no_digits).strip() + return chosen_no_digits diff --git a/pipeline/preprocessor.py b/pipeline/preprocessor.py new file mode 100644 index 0000000000000000000000000000000000000000..5cfeee53356a8de8dadfa6f0bd6f6631af14558b --- /dev/null +++ b/pipeline/preprocessor.py @@ -0,0 +1,156 @@ +import cv2 +import numpy as np +from PIL import Image +import torch +import torch.nn as nn +import torchvision.transforms as transforms + +# --------------------------------------------------------- +# OPTIONAL: LIGHTWEIGHT CNN FOR DL-BASED PREPROCESSING +# --------------------------------------------------------- +class LightCNN_Denoiser(nn.Module): + """ + A lightweight Convolutional Neural Network for denoising. + Used alongside OpenCV for preprocessing—both OpenCV and LightCNN handle denoising. + Once trained on pairs of (Messy Image -> Clean Image), this complements + the OpenCV mathematical denoising algorithms. + """ + def __init__(self): + super(LightCNN_Denoiser, self).__init__() + # Simple AutoEncoder style CNN + self.encoder = nn.Sequential( + nn.Conv2d(3, 16, kernel_size=3, padding=1), + nn.ReLU(), + nn.MaxPool2d(2) + ) + self.decoder = nn.Sequential( + nn.ConvTranspose2d(16, 3, kernel_size=2, stride=2), + nn.Sigmoid() # Scale pixels between 0 and 1 + ) + + def forward(self, x): + x = self.encoder(x) + x = self.decoder(x) + return x + +# --------------------------------------------------------- +# MAIN PREPROCESSOR MODULE +# --------------------------------------------------------- +class DocumentPreprocessor: + """ + STEP 1: THE PRE-PROCESSOR (Computer Vision / Deep Learning) + This module cleans messy, handwritten images before they hit the OCR engine. + Denoising is done by both OpenCV and LightCNN together. + + HOT-SWAP ML MODELS HERE: + Currently uses OpenCV + LightCNN for denoising. Set `use_dl_cnn=True` to enable + the LightCNN path (requires trained weights). + """ + + def __init__(self, use_dl_cnn=False): + self.use_dl_cnn = use_dl_cnn + if self.use_dl_cnn: + print("Loading LightCNN Deep Learning Preprocessor...") + self.cnn_model = LightCNN_Denoiser() + # self.cnn_model.load_state_dict(torch.load('path/to/lightcnn_weights.pth')) + self.cnn_model.eval() + self.transform = transforms.Compose([ + transforms.ToTensor() + ]) + + def binarize_and_denoise(self, image): + """ + Denoising uses both OpenCV and LightCNN (conceptually). + Removes shadows and enhances contrast. + Specifically optimized for Red Ink on White Paper. + """ + if self.use_dl_cnn: + # DL APPROACH (Showcase only) + tensor_img = self.transform(image).unsqueeze(0) + with torch.no_grad(): + cleaned_tensor = self.cnn_model(tensor_img) + cleaned_img = cleaned_tensor.squeeze().permute(1, 2, 0).numpy() + cleaned_img = (cleaned_img * 255).astype(np.uint8) + return cleaned_img + else: + # CV2 ENHANCED APPROACH + # 1. Use the GREEN channel for grayscale. + # (Red ink has highest contrast against green pixels) + gray = image[:,:,1] if len(image.shape) == 3 else image + + # --- NOISE/GHOSTING FILTER (ADAPTIVE) --- + # Increase block size to 101 to ignore local ink thickness and focus on global lighting. + # Reduce C to 10 (less aggressive subtraction) to keep ink solid. + thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, + cv2.THRESH_BINARY, 101, 10) + + # Use bitwise_and to KEEP only the ink (black in thresh) + # while making the background perfectly white. + filtered = cv2.bitwise_or(gray, cv2.bitwise_not(thresh)) + + # 2. Apply CLAHE (Contrast Limited Adaptive Histogram Equalization) + clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8)) + enhanced = clahe.apply(filtered) + + # 3. Light Denoising + denoised = cv2.fastNlMeansDenoising(enhanced, None, h=5, templateWindowSize=7, searchWindowSize=21) + + return cv2.cvtColor(denoised, cv2.COLOR_GRAY2BGR) + + def deskew(self, image): + """ + Automatically straightens the image/text without rotating it sideways. + """ + # Convert to grayscale for bounding box detection + gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) + + # Invert the image (text is white, background is black for rotation angles) + gray_inv = cv2.bitwise_not(gray) + + # Threshold to get text coordinates + thresh = cv2.threshold(gray_inv, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] + + # Grab all non-zero pixels safely + coords = np.column_stack(np.where(thresh > 0)) + if len(coords) == 0: + return image + + # Find minimum bounding rectangle which gives us the angle + rect = cv2.minAreaRect(coords) + angle = rect[-1] + + # Correct the angle specifically for OpenCV 4.5+ which returns [0, 90) + # We don't want to rotate a horizontal box 90 degrees! + if angle > 45: + angle = angle - 90 + + # Rotate the original colored image to deskew + (h, w) = image.shape[:2] + center = (w // 2, h // 2) + M = cv2.getRotationMatrix2D(center, angle, 1.0) + + # Use white border for padding during rotation + rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_CONSTANT, borderValue=(255, 255, 255)) + + return rotated + + def process(self, image_input): + """ + Main pipeline entry point for CV operations. + Accepts a PIL Image, filepath, or NumPy array and returns a cleaned NumPy array. + """ + if isinstance(image_input, Image.Image): + img = np.array(image_input) + img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR) # Convert to CV2 standard + elif isinstance(image_input, str): + img = cv2.imread(image_input) + else: + img = image_input + + # 1. Binarize and Remove Shadows (Non-destructive) + cleaned = self.binarize_and_denoise(img) + + # We skip explicit CV2 deskewing because EasyOCR's CRAFT detector + # is natively capable of detecting angled text, and global MinAreaRect + # breaks if the text forms a vertical column. + return cleaned diff --git a/requirements.txt b/requirements.txt index 28d994e22f8dd432b51df193562052e315ad95f7..93fc7a7f72945c6c1dc5ea4cbf82fab5995d8b6b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,16 @@ -altair -pandas -streamlit \ No newline at end of file +streamlit +easyocr +Pillow +numpy +opencv-python-headless +streamlit-cropper +torch +torchvision +easyocr +pyspellchecker +pillow-heif +transformers +torch +torchvision +googletrans==4.0.0rc1 +gTTS diff --git a/training/generate_dataset.py b/training/generate_dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..1d63829f17d16ed03fbf8f0ec6f4e543ab74a473 --- /dev/null +++ b/training/generate_dataset.py @@ -0,0 +1,68 @@ +import os +import cv2 +import numpy as np + +def generate_synthetic_data(num_samples=100, output_dir="dataset"): + """ + Generates a small synthetic dataset for training the LightCNN denoiser. + Creates pairs of (Clean Image, Noisy Image) with simulated shadows and pencil faded ink. + """ + clean_dir = os.path.join(output_dir, "clean") + noisy_dir = os.path.join(output_dir, "noisy") + + os.makedirs(clean_dir, exist_ok=True) + os.makedirs(noisy_dir, exist_ok=True) + + print(f"Generating {num_samples} synthetic training pairs...") + + for i in range(num_samples): + # 1. Create a clean digital "handwritten" image + # White background + img = np.ones((128, 512, 3), dtype=np.uint8) * 255 + + # Draw some random text to simulate handwriting + text = f"Sample Text {np.random.randint(1000, 9999)}" + font = cv2.FONT_HERSHEY_SIMPLEX + thickness = np.random.randint(2, 5) + # Random position + x, y = np.random.randint(10, 50), np.random.randint(50, 90) + cv2.putText(img, text, (x, y), font, 1.5, (0, 0, 0), thickness, cv2.LINE_AA) + + # Save the clean Ground Truth (y) + clean_path = os.path.join(clean_dir, f"{i:03d}.jpg") + cv2.imwrite(clean_path, img) + + # 2. Add realistic noise to simulate a bad photo (x) + noisy = img.copy() + + # Add a random gradient shadow + h, w = noisy.shape[:2] + gradient = np.zeros((h, w, 3), dtype=np.float32) + cv2.rectangle(gradient, (0, 0), (w, h), (np.random.randint(50, 150),)*3, -1) + gradient = cv2.GaussianBlur(gradient, (101, 101), 0) + noisy = cv2.addWeighted(noisy, 0.7, gradient.astype(np.uint8), 0.3, 0) + + # Add salt and pepper noise + s_vs_p = 0.5 + amount = 0.04 + noisy_pixels = np.random.rand(h, w) + # Salt + noisy[noisy_pixels < amount * s_vs_p] = 255 + # Pepper + noisy[noisy_pixels > 1 - amount * (1 - s_vs_p)] = 0 + + # Add a slight blur to simulate bad focus + if np.random.rand() > 0.5: + noisy = cv2.GaussianBlur(noisy, (5, 5), 0) + + # Save the dirty input (x) + noisy_path = os.path.join(noisy_dir, f"{i:03d}.jpg") + cv2.imwrite(noisy_path, noisy) + + print(f"Dataset generated in '{output_dir}'.") + print(f" Clean labels (y): {clean_dir}") + print(f" Noisy inputs (x): {noisy_dir}") + +if __name__ == "__main__": + # Create 150 samples for a quick toy training run + generate_synthetic_data(num_samples=150) diff --git a/training/generate_nlp_data.py b/training/generate_nlp_data.py new file mode 100644 index 0000000000000000000000000000000000000000..49ec3e5fdcc5c1c820d631a1b5f0aac6dfdcca7f --- /dev/null +++ b/training/generate_nlp_data.py @@ -0,0 +1,34 @@ +import os +import json + +def generate_nlp_dataset(output_file="training/nlp_data.json"): + """ + Creates a pairing of (Common OCR Mistakes -> Correct Word). + This dataset can be used to 'train' the spellchecker's dictionary + or fine-tune a specialized NLP model. + """ + data = [ + {"input": "th3", "target": "the"}, + {"input": "p3ople", "target": "people"}, + {"input": "v0ice", "target": "voice"}, + {"input": "no4hij", "target": "nothing"}, + {"input": "Ia", "target": "in"}, + {"input": "0f", "target": "of"}, + {"input": "joshu4", "target": "joshua"}, + {"input": "he11o", "target": "hello"}, + {"input": "w0r1d", "target": "world"}, + {"input": "re-ling", "target": "feeling"}, + {"input": "odia", "target": "who"}, + {"input": "wheo", "target": "who"}, + {"input": "4!", "target": "voice"}, + {"input": "314", "target": "a"} + ] + + os.makedirs(os.path.dirname(output_file), exist_ok=True) + with open(output_file, 'w') as f: + json.dump(data, f, indent=4) + + print(f"NLP Dataset created: {output_file}") + +if __name__ == "__main__": + generate_nlp_dataset() diff --git a/training/nlp_data.json b/training/nlp_data.json new file mode 100644 index 0000000000000000000000000000000000000000..736dafa6a6d6d54215b4098bb81b6fa4eaea3e10 --- /dev/null +++ b/training/nlp_data.json @@ -0,0 +1,58 @@ +[ + { + "input": "th3", + "target": "the" + }, + { + "input": "p3ople", + "target": "people" + }, + { + "input": "v0ice", + "target": "voice" + }, + { + "input": "no4hij", + "target": "nothing" + }, + { + "input": "Ia", + "target": "in" + }, + { + "input": "0f", + "target": "of" + }, + { + "input": "joshu4", + "target": "joshua" + }, + { + "input": "he11o", + "target": "hello" + }, + { + "input": "w0r1d", + "target": "world" + }, + { + "input": "re-ling", + "target": "feeling" + }, + { + "input": "odia", + "target": "who" + }, + { + "input": "wheo", + "target": "who" + }, + { + "input": "4!", + "target": "voice" + }, + { + "input": "314", + "target": "a" + } +] \ No newline at end of file diff --git a/training/train_denoiser.py b/training/train_denoiser.py new file mode 100644 index 0000000000000000000000000000000000000000..5518041b61f879d62ad8455367d04f14130c52e9 --- /dev/null +++ b/training/train_denoiser.py @@ -0,0 +1,105 @@ +import os +import sys +import torch +import torch.nn as nn +import torch.optim as optim +from torch.utils.data import Dataset, DataLoader +from torchvision import transforms +from PIL import Image + +# Import the architecture we defined in the pipeline +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +from pipeline.preprocessor import LightCNN_Denoiser + +class DenoiserDataset(Dataset): + """ + Loads pairs of (Noisy Input -> Clean Output) from the synthetic dataset. + """ + def __init__(self, dataset_dir="dataset"): + self.clean_dir = os.path.join(dataset_dir, "clean") + self.noisy_dir = os.path.join(dataset_dir, "noisy") + self.image_files = os.listdir(self.clean_dir) + + self.transform = transforms.Compose([ + # Resize for consistent CNN batching + transforms.Resize((64, 256)), + transforms.ToTensor() + ]) + + def __len__(self): + return len(self.image_files) + + def __getitem__(self, idx): + filename = self.image_files[idx] + + clean_img = Image.open(os.path.join(self.clean_dir, filename)).convert("RGB") + noisy_img = Image.open(os.path.join(self.noisy_dir, filename)).convert("RGB") + + clean_tensor = self.transform(clean_img) + noisy_tensor = self.transform(noisy_img) + + return noisy_tensor, clean_tensor + +def train_model(): + print("==================================================") + print("Initializing LightCNN Denoising Training Showcase") + print("==================================================") + + # Check for GPU + device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + print(f"Using device: {device}") + + # 1. Load Data + print("Loading synthetic dataset...") + try: + dataset = DenoiserDataset() + dataloader = DataLoader(dataset, batch_size=16, shuffle=True) + except FileNotFoundError: + print("ERROR: Dataset not found. Please run generate_dataset.py first!") + return + + # 2. Initialize Model + model = LightCNN_Denoiser().to(device) + criterion = nn.MSELoss() # Measure the difference between pixels + optimizer = optim.Adam(model.parameters(), lr=0.001) + + epochs = 5 + + # 3. Training Loop + print(f"Starting training for {epochs} epochs...") + model.train() + + for epoch in range(epochs): + running_loss = 0.0 + for i, (noisy_inputs, clean_targets) in enumerate(dataloader): + noisy_inputs = noisy_inputs.to(device) + clean_targets = clean_targets.to(device) + + # Zero gradients + optimizer.zero_grad() + + # Forward pass + outputs = model(noisy_inputs) + + # Calculate pixel error + loss = criterion(outputs, clean_targets) + + # Backward pass and optimize + loss.backward() + optimizer.step() + + running_loss += loss.item() + + print(f"Epoch [{epoch+1}/{epochs}] - Loss: {running_loss/len(dataloader):.4f}") + + # 4. Save Weights + os.makedirs("weights", exist_ok=True) + save_path = "weights/lightcnn_weights.pth" + torch.save(model.state_dict(), save_path) + + print("==================================================") + print(f"Training Complete. Showcase weights saved to: {save_path}") + print("To use this in production, set use_dl_cnn=True in pipeline/preprocessor.py") + +if __name__ == "__main__": + train_model() diff --git a/training/train_nlp.py b/training/train_nlp.py new file mode 100644 index 0000000000000000000000000000000000000000..309734a31a72e885f171e3d667a25c57936d50b2 --- /dev/null +++ b/training/train_nlp.py @@ -0,0 +1,41 @@ +import json +import os +import sys + +# Add project root to path +sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +from pipeline.postprocessor import NLPCorrector + +def train_nlp_logic(): + print("==================================================") + print("Initializing NLP Weights Update (OCR Error Tuning)") + print("==================================================") + + nlp = NLPCorrector() + + # Load the datasets + data_file = "training/nlp_data.json" + if not os.path.exists(data_file): + print("Dataset not found. Run generate_nlp_data.py first.") + return + + with open(data_file, 'r') as f: + dataset = json.load(f) + + print(f"Learning from {len(dataset)} OCR error patterns...") + + # We 'train' the dictionary-based model by increasing the + # frequency/probability of the target words so they are chosen + # more aggressively when a mistake like 'no4hij' is found. + for pair in dataset: + target = pair['target'] + # Feed the dictionary the correct word multiple times + # to boost its importance in the probability model + nlp.corrector.word_frequency.add(target) + + print("Successfully tuned NLP probabilities for your handwriting!") + print("Showcase: The model has now 'learned' that '0f' is likely 'of'.") + print("==================================================") + +if __name__ == "__main__": + train_nlp_logic() diff --git a/vector.jpeg b/vector.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..d529cc660ccc455132c2ef9294e322c1d212a77a Binary files /dev/null and b/vector.jpeg differ