--- language: - en license: apache-2.0 base_model: - sentence-transformers/all-MiniLM-L6-v2 datasets: - itsjhuang/watsonx-docs-document-type tags: - text-classification - embeddings - technical-documentation metrics: - accuracy - f1 --- # Watsonx Docs Document Type Classifier Binary classifier for IBM Watsonx technical documentation pages. Given a documentation page, the model predicts whether it is: - `conceptual` (0): primarily used to understand or look up information - `how-to` (1): primarily used to complete a procedure or fix a problem ## Model Details | | | |---|---| | Base embeddings | sentence-transformers/all-MiniLM-L6-v2 | | Classifier | LinearSVC (C=1.0, max_iter=2000) | | Training dataset | [itsjhuang/watsonx-docs-document-type](https://huggingface.co/datasets/itsjhuang/watsonx-docs-document-type) | | Input | title + first 800 words of document | | Output | `conceptual` or `how-to` | ## Evaluation Results Three conditions were trained and evaluated. The best model (B) was selected by test macro F1. | Condition | Embedding Model | Classifier | Train Acc | Train F1 | Test Acc | Test F1 | |---|---|---|---:|---:|---:|---:| | A | all-MiniLM-L6-v2 | Logistic Regression | 0.879 | 0.879 | 0.817 | 0.817 | | B ✅ | all-MiniLM-L6-v2 | LinearSVC | 0.971 | 0.971 | 0.867 | 0.867 | | C | bge-small-en-v1.5 | Logistic Regression | 0.864 | 0.864 | 0.833 | 0.833 | Confusion matrices for each condition are available in the repository files. ## Usage ```python import joblib import numpy as np from sentence_transformers import SentenceTransformer embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") clf = joblib.load("best_model.joblib") def softmax(x): e = np.exp(x - np.max(x)) return e / e.sum() def predict(text): embedding = embedder.encode([text], convert_to_numpy=True) scores = clf.decision_function(embedding)[0] if np.ndim(scores) == 0: scores = np.array([-scores, scores]) probs = softmax(scores) labels = ["conceptual", "how-to"] return dict(zip(labels, probs)) ``` ## Limitations - Trained on IBM Watsonx documentation only; may not generalize to other technical documentation domains. - Label boundary between weak procedural pages and conceptual capability descriptions remains a residual source of error. ## Source Dataset Derived from [`ibm-research/watsonxDocsQA`](https://huggingface.co/datasets/ibm-research/watsonxDocsQA), licensed under Apache 2.0.