Watsonx Docs Document Type Classifier

Binary classifier for IBM Watsonx technical documentation pages.
Given a documentation page, the model predicts whether it is:

  • conceptual (0): primarily used to understand or look up information
  • how-to (1): primarily used to complete a procedure or fix a problem

Model Details

Base embeddings sentence-transformers/all-MiniLM-L6-v2
Classifier LinearSVC (C=1.0, max_iter=2000)
Training dataset itsjhuang/watsonx-docs-document-type
Input title + first 800 words of document
Output conceptual or how-to

Evaluation Results

Three conditions were trained and evaluated. The best model (B) was selected by test macro F1.

Condition Embedding Model Classifier Train Acc Train F1 Test Acc Test F1
A all-MiniLM-L6-v2 Logistic Regression 0.879 0.879 0.817 0.817
B ✅ all-MiniLM-L6-v2 LinearSVC 0.971 0.971 0.867 0.867
C bge-small-en-v1.5 Logistic Regression 0.864 0.864 0.833 0.833

Confusion matrices for each condition are available in the repository files.

Usage

import joblib
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
clf = joblib.load("best_model.joblib")

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

def predict(text):
    embedding = embedder.encode([text], convert_to_numpy=True)
    scores = clf.decision_function(embedding)[0]
    if np.ndim(scores) == 0:
        scores = np.array([-scores, scores])
    probs = softmax(scores)
    labels = ["conceptual", "how-to"]
    return dict(zip(labels, probs))

Limitations

  • Trained on IBM Watsonx documentation only; may not generalize to other technical documentation domains.
  • Label boundary between weak procedural pages and conceptual capability descriptions remains a residual source of error.

Source Dataset

Derived from ibm-research/watsonxDocsQA, licensed under Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for itsjhuang/watsonx-docs-type-classifier

Finetuned
(875)
this model

Dataset used to train itsjhuang/watsonx-docs-type-classifier