CreateDebate Counter-Argument Retrieval Models

This repository contains two fine-tuned bi-encoder models for counter-argument retrieval: given an argument, retrieve the most relevant counter-argument from a corpus of debate arguments. Both models are fine-tuned from sentence-transformers/all-mpnet-base-v2 on the curated subset of the CreateDebate Arguments Dataset, using MultipleNegativesRankingLoss with in-batch negatives.

The two models correspond to two different retrieval settings:

Model folder Setting Ground truth Use case
biencoder_scenario1 Targeted retrieval The author-tagged Disputed reply to a given argument Find the specific rebuttal that was written in direct response to a claim
biencoder_scenario2 General retrieval Any argument from the opposing side of the same debate Find any plausible counter-argument on the same topic, regardless of whether it directly addresses the specific claim

These models were trained and evaluated as part of a paper introducing the CreateDebate Arguments Dataset. See the Limitations section before using these models in any downstream application.

Repository Structure

.
β”œβ”€β”€ biencoder_scenario1/    # targeted retrieval model (sentence-transformers format)
β”œβ”€β”€ biencoder_scenario2/    # general retrieval model (sentence-transformers format)
β”œβ”€β”€ corpus_df.pkl           # argument metadata for the evaluation corpus (pandas DataFrame)
└── corpus_embs.pt          # pre-computed embeddings for the evaluation corpus

corpus_df.pkl and corpus_embs.pt are provided so the usage example below works without re-encoding the corpus. If you want to retrieve from your own corpus, encode it with the relevant model and skip these two files.

Usage

Install dependencies:

pip install sentence-transformers huggingface_hub pandas torch numpy

Run a query against either model:

import textwrap
import numpy as np
import pandas as pd
import torch
from pathlib import Path
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

BOLD, CYAN, GREEN, YELLOW, RESET = "\033[1m", "\033[96m", "\033[92m", "\033[93m", "\033[0m"


def retrieve(
    query: str,
    model: SentenceTransformer,
    corpus_ids: list,
    corpus_embs: np.ndarray,
    corpus_df: pd.DataFrame,
    top_k: int = 10,
    exclude_ids: set = None,
) -> list:
    """
    Encode query, score against corpus, return top-k results.

    Parameters
    ----------
    query        : the query argument text
    exclude_ids  : set of argument IDs to exclude from results
                   (e.g. to avoid returning the query itself if it's in the corpus)
    """
    query_emb = model.encode(
        [query],
        normalize_embeddings=True,
        convert_to_numpy=True,
    )[0]

    # Score all corpus arguments
    scores = corpus_embs @ query_emb            # (N,)

    # Rank descending
    ranked_indices = np.argsort(scores)[::-1]

    # Build result list, skipping excluded IDs
    id_to_row = corpus_df.set_index("argumentId").to_dict("index")
    results   = []

    for idx in ranked_indices:
        arg_id = corpus_ids[idx]

        if exclude_ids and arg_id in exclude_ids:
            continue

        row = id_to_row.get(arg_id, {})
        results.append({
            "rank":          len(results) + 1,
            "argumentId":    arg_id,
            "score":         float(scores[idx]),
            "argumentBody":  row.get("argumentBody", ""),
            "argumentSide":  row.get("argumentSide", ""),
            "argumentTag":   row.get("argumentTag", ""),
            "debateTitle":   row.get("debateTitle", ""),
            "debateUrl":     row.get("debateUrl", ""),
            "depth":         row.get("depth", -1),
            "username":      row.get("username", ""),
        })

        if len(results) >= top_k:
            break

    return results


def display_results(query: str, results: list, mode: str):
    width = 80

    print("\n" + "=" * width)
    print(f"{BOLD}{CYAN}  QUERY ARGUMENT{RESET}")
    print("=" * width)
    print(textwrap.fill(query, width=width, initial_indent="  ", subsequent_indent="  "))

    print("\n" + "=" * width)
    label = "TARGETED COUNTER-ARGUMENTS (Disputed)" if mode == "targeted" \
            else "GENERAL COUNTER-ARGUMENTS (Side-Based)"
    print(f"{BOLD}{GREEN}  TOP {len(results)} {label}{RESET}")
    print("=" * width)

    for res in results:
        tag_str   = f" [{res['argumentTag']}]" if res.get("argumentTag") else ""
        depth_str = f"depth={res['depth']}"
        score_str = f"score={res['score']:.4f}"

        print(f"\n{BOLD}  Rank {res['rank']}{RESET}  |  {score_str}  |  {depth_str}{tag_str}")
        print(f"  {YELLOW}Side:{RESET} {res['argumentSide']}")
        print(f"  {YELLOW}Debate:{RESET} {res['debateTitle']}")
        print()
        body = res["argumentBody"]
        wrapped = textwrap.fill(
            body, width=width - 4,
            initial_indent="    ",
            subsequent_indent="    "
        )
        print(wrapped)
        print("  " + "-" * (width - 2))

    print()


if __name__ == "__main__":
    DEVICE     = "cuda" if torch.cuda.is_available() else "cpu"
    TOP_K      = 10
    MODE       = "targeted"   # or "general"
    query      = "Capital punishment is justified as a deterrent to serious crime."

    # Download the repo (models + corpus files) and cache it locally
    repo_path = Path(snapshot_download(repo_id="azza1625/counter-argument-retrieval"))

    MODEL_PATHS = {
        "targeted": repo_path / "biencoder_scenario1",
        "general":  repo_path / "biencoder_scenario2",
    }

    model_path  = MODEL_PATHS[MODE]
    corpus_df   = pd.read_pickle(repo_path / "corpus_df.pkl")
    corpus_embs = torch.load(repo_path / "corpus_embs.pt", weights_only=False)

    print(f"\nLoading {MODE} model from {model_path}...")
    model = SentenceTransformer(str(model_path), device=DEVICE)

    corpus_ids = corpus_df["argumentId"].tolist()

    results = retrieve(
        query=query,
        model=model,
        corpus_ids=corpus_ids,
        corpus_embs=corpus_embs,
        corpus_df=corpus_df,
        top_k=TOP_K,
    )

    display_results(query, results, MODE)

To retrieve from your own corpus instead of the bundled one, encode your arguments with the same model and skip corpus_df.pkl / corpus_embs.pt:

corpus_texts = [...]       # list of argument strings
corpus_ids   = [...]       # matching list of IDs

corpus_embs = model.encode(
    corpus_texts,
    batch_size=128,
    normalize_embeddings=True,
    convert_to_numpy=True,
)

Training Details

  • Base model: sentence-transformers/all-mpnet-base-v2
  • Loss: MultipleNegativesRankingLoss (in-batch negatives)
  • Batch size: 64
  • Learning rate: 2e-5
  • Epochs: 3, with linear warmup over 10% of training steps
  • Checkpoint selection: best epoch by MRR@10 on the validation split
  • Training data: curated subset of the CreateDebate Arguments Dataset (84,872 arguments, 578 debates), split at the debate level (75/10/15) to prevent topic leakage between train, validation, and test

biencoder_scenario1 is trained on targeted query-positive pairs, where the positive is the author-tagged Disputed reply to the query argument (44,569 pairs). biencoder_scenario2 is trained on general query-positive pairs, where the positive is any argument from the opposing side of the same debate (80,885 pairs, capped at 500 per debate).

Evaluation Results

Both models were evaluated on held-out test debates, with candidates restricted to arguments from the same debate as the query.

Scenario 1 (targeted retrieval), evaluated with biencoder_scenario1:

Metric Score
MRR 0.444
Recall@1 0.301
Recall@5 0.605
Recall@10 0.711

Scenario 2 (general retrieval), evaluated with biencoder_scenario2:

Metric Score
Recall@10 0.077
Recall@20 0.157

The low Recall@10/20 on Scenario 2 reflects the size of the positive pool (a mean of ~83 valid opposing-side arguments per query), not necessarily a failure of the model relative to other approaches; a BM25 lexical baseline performs comparably on this setting. See the accompanying paper for full baseline comparisons and a discussion of why Scenario 2 is difficult for all evaluated methods.

Querying biencoder_scenario1 does not require debate-level context (title/description) to be included in the query text. In our experiments, adding debate context to the query consistently hurt retrieval performance for both lexical and neural models when all candidates come from the same debate, since it adds shared vocabulary that doesn't help distinguish between candidates. We recommend passing only the argument text itself as the query.

Limitations and Considerations

A qualitative analysis of biencoder_scenario1 identified three recurring failure patterns that are useful to know before relying on these models:

  1. Pragmatic and contextual rebuttals are hard to retrieve. Counter-arguments that operate at a pragmatic level (e.g. asking for clarification, or reframing the premise of an argument rather than directly engaging its content) tend to have low semantic similarity to the query and are often missed, even when they are the most effective real-world response.

  2. Single ground truth labels can be ambiguous. In many cases, several arguments in the candidate pool are equally valid counter-arguments, but only one is the labeled ground truth (the one tagged as a direct reply). A model that retrieves a different, also-valid counter-argument is being scored as wrong. Retrieval metrics on this dataset should be read as a lower bound on true model capability.

  3. The training and evaluation data contains non-substantive exchanges. Because the source dataset is a naturally occurring, unfiltered debate platform, some "Disputed" replies are personal remarks, platform notices, or off-topic dismissals rather than genuine counter-arguments. These cases are unanswerable by any text-only retrieval model and contribute to the error rate.

These models are intended for research on computational argumentation and counter-argument retrieval. They are not intended to be used as a standalone fact-checking, content moderation, or debate-arbitration tool, and outputs should not be treated as an authoritative judgment of which side of an argument is "correct."

License

These models are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, matching the license of the underlying training data.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for azza1625/counter-argument-retrieval

Finetuned
(379)
this model

Dataset used to train azza1625/counter-argument-retrieval