pemix09
/

polish_document_type_classifier

LiteRT

Model card Files Files and versions

xet

Community

pemix09 commited on Jan 20

Commit

ed0f15d

verified ·

1 Parent(s): 68352bc

Upload folder using huggingface_hub

Browse files

Files changed (1) hide show

learn_with_history_visualisation.ipynb +9 -28

learn_with_history_visualisation.ipynb CHANGED Viewed

@@ -367,41 +367,22 @@
    "id": "fe8ce873",
    "metadata": {},
    "source": [
     "\n",
-    "Analyzing your Confusion Matrix, it's clear that the classifier is performing remarkably well given the complexity of having 31 different document categories. The strong diagonal line indicates that for most classes, the model is predicting the document type correctly.\n",
     "\n",
-    "Key Takeaways from the Matrix:\n",
-    "Best Performing Categories: Classes like contract (12 correct), educationdocument (8 correct), taxdocument (8 correct), and invoice (7 correct) show high accuracy with very few misclassifications.\n",
     "\n",
-    "Minor Confusions:\n",
     "\n",
-    "courtdocument is occasionally mistaken for contract (1 instance). This makes sense as legal language can be very similar.\n",
     "\n",
-    "idcard was once confused with cv, likely due to both containing personal names and profile-like information.\n",
     "\n",
-    "medicaldocument seems to be a slight \"magnet\" for errors; documents like vaccinationcard and medicaldocument itself (predicted as referral) show some overlap in medical terminology.\n",
     "\n",
-    "Data Sparsity: Many categories (like bankstatement or birthcertificate) have very few samples in this validation set (only 1). While the model got them right, more data would be needed to confirm its stability for these specific types."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0d159f73",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from huggingface_hub import HfApi\n",
-    "\n",
-    "api = HfApi()\n",
-    "\n",
-    "# Wysyłanie całego folderu z nową wersją\n",
-    "api.upload_folder(\n",
-    "    folder_path=\"../\",\n",
-    "    repo_id=\"twoja_nazwa/nazwa-modelu\",\n",
-    "    repo_type=\"model\", # lub \"dataset\"\n",
-    "    commit_message=\"Aktualizacja modelu v2\"\n",
-    ")"
    ]
   }
  ],

    "id": "fe8ce873",
    "metadata": {},
    "source": [
+    "Confusion Matrix Analysis\n",
+    "An examination of the confusion matrix demonstrates that the proposed classifier exhibits robust performance, particularly considering the inherent complexity of a 31-class document categorization task. The prominent diagonal line signifies a high degree of correlation between the ground truth and the model's predictions across the majority of categories.\n",
     "\n",
+    "Key Findings:\n",
     "\n",
+    "High-Performing Categories: The model demonstrates superior discriminative capabilities for classes such as contract (12 correct), educationdocument (8), taxdocument (8), and invoice (7). These categories show high classification accuracy with negligible misclassification rates, suggesting the model has successfully captured their distinct structural or linguistic features.\n",
     "\n",
+    "Inter-class Ambiguities:\n",
     "\n",
+    "A minor degree of confusion was observed between courtdocument and contract. This overlap is conceptually justified, as both categories frequently employ specialized legal terminology and formal syntactical structures, leading to high lexical similarity.\n",
     "\n",
+    "The misclassification of an idcard as a cv suggests that the model may be responding to shared attributes, specifically the presence of personal identifiers and profile-oriented information layouts.\n",
     "\n",
+    "The medicaldocument class acts as a slight focal point for errors regarding related sub-categories (e.g., vaccinationcard and referral). This indicates a high degree of semantic overlap in medical terminology, which presents a challenge for fine-grained classification.\n",
     "\n",
+    "Data Sparsity and Generalization: Several categories, including bankstatement and birthcertificate, are represented by a limited number of samples within the validation set. While the model correctly identified these instances, the statistical significance of these results remains constrained. Further validation utilizing a more balanced and extensive dataset is required to confirm the model’s stability and generalizability across these sparsely represented classes."
    ]
   }
  ],