Instructions to use azza1625/counter-argument-retrieval with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use azza1625/counter-argument-retrieval with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("azza1625/counter-argument-retrieval") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
CreateDebate Counter-Argument Retrieval Models
This repository contains two fine-tuned bi-encoder models for counter-argument retrieval: given an argument, retrieve the most relevant counter-argument from a corpus of debate arguments. Both models are fine-tuned from sentence-transformers/all-mpnet-base-v2 on the curated subset of the CreateDebate Arguments Dataset, using MultipleNegativesRankingLoss with in-batch negatives.
The two models correspond to two different retrieval settings:
| Model folder | Setting | Ground truth | Use case |
|---|---|---|---|
biencoder_scenario1 |
Targeted retrieval | The author-tagged Disputed reply to a given argument |
Find the specific rebuttal that was written in direct response to a claim |
biencoder_scenario2 |
General retrieval | Any argument from the opposing side of the same debate | Find any plausible counter-argument on the same topic, regardless of whether it directly addresses the specific claim |
These models were trained and evaluated as part of a paper introducing the CreateDebate Arguments Dataset. See the Limitations section before using these models in any downstream application.
Repository Structure
.
βββ biencoder_scenario1/ # targeted retrieval model (sentence-transformers format)
βββ biencoder_scenario2/ # general retrieval model (sentence-transformers format)
βββ corpus_df.pkl # argument metadata for the evaluation corpus (pandas DataFrame)
βββ corpus_embs.pt # pre-computed embeddings for the evaluation corpus
corpus_df.pkl and corpus_embs.pt are provided so the usage example below works without re-encoding the corpus. If you want to retrieve from your own corpus, encode it with the relevant model and skip these two files.
Usage
Install dependencies:
pip install sentence-transformers huggingface_hub pandas torch numpy
Run a query against either model:
import textwrap
import numpy as np
import pandas as pd
import torch
from pathlib import Path
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer
BOLD, CYAN, GREEN, YELLOW, RESET = "\033[1m", "\033[96m", "\033[92m", "\033[93m", "\033[0m"
def retrieve(
query: str,
model: SentenceTransformer,
corpus_ids: list,
corpus_embs: np.ndarray,
corpus_df: pd.DataFrame,
top_k: int = 10,
exclude_ids: set = None,
) -> list:
"""
Encode query, score against corpus, return top-k results.
Parameters
----------
query : the query argument text
exclude_ids : set of argument IDs to exclude from results
(e.g. to avoid returning the query itself if it's in the corpus)
"""
query_emb = model.encode(
[query],
normalize_embeddings=True,
convert_to_numpy=True,
)[0]
# Score all corpus arguments
scores = corpus_embs @ query_emb # (N,)
# Rank descending
ranked_indices = np.argsort(scores)[::-1]
# Build result list, skipping excluded IDs
id_to_row = corpus_df.set_index("argumentId").to_dict("index")
results = []
for idx in ranked_indices:
arg_id = corpus_ids[idx]
if exclude_ids and arg_id in exclude_ids:
continue
row = id_to_row.get(arg_id, {})
results.append({
"rank": len(results) + 1,
"argumentId": arg_id,
"score": float(scores[idx]),
"argumentBody": row.get("argumentBody", ""),
"argumentSide": row.get("argumentSide", ""),
"argumentTag": row.get("argumentTag", ""),
"debateTitle": row.get("debateTitle", ""),
"debateUrl": row.get("debateUrl", ""),
"depth": row.get("depth", -1),
"username": row.get("username", ""),
})
if len(results) >= top_k:
break
return results
def display_results(query: str, results: list, mode: str):
width = 80
print("\n" + "=" * width)
print(f"{BOLD}{CYAN} QUERY ARGUMENT{RESET}")
print("=" * width)
print(textwrap.fill(query, width=width, initial_indent=" ", subsequent_indent=" "))
print("\n" + "=" * width)
label = "TARGETED COUNTER-ARGUMENTS (Disputed)" if mode == "targeted" \
else "GENERAL COUNTER-ARGUMENTS (Side-Based)"
print(f"{BOLD}{GREEN} TOP {len(results)} {label}{RESET}")
print("=" * width)
for res in results:
tag_str = f" [{res['argumentTag']}]" if res.get("argumentTag") else ""
depth_str = f"depth={res['depth']}"
score_str = f"score={res['score']:.4f}"
print(f"\n{BOLD} Rank {res['rank']}{RESET} | {score_str} | {depth_str}{tag_str}")
print(f" {YELLOW}Side:{RESET} {res['argumentSide']}")
print(f" {YELLOW}Debate:{RESET} {res['debateTitle']}")
print()
body = res["argumentBody"]
wrapped = textwrap.fill(
body, width=width - 4,
initial_indent=" ",
subsequent_indent=" "
)
print(wrapped)
print(" " + "-" * (width - 2))
print()
if __name__ == "__main__":
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
TOP_K = 10
MODE = "targeted" # or "general"
query = "Capital punishment is justified as a deterrent to serious crime."
# Download the repo (models + corpus files) and cache it locally
repo_path = Path(snapshot_download(repo_id="azza1625/counter-argument-retrieval"))
MODEL_PATHS = {
"targeted": repo_path / "biencoder_scenario1",
"general": repo_path / "biencoder_scenario2",
}
model_path = MODEL_PATHS[MODE]
corpus_df = pd.read_pickle(repo_path / "corpus_df.pkl")
corpus_embs = torch.load(repo_path / "corpus_embs.pt", weights_only=False)
print(f"\nLoading {MODE} model from {model_path}...")
model = SentenceTransformer(str(model_path), device=DEVICE)
corpus_ids = corpus_df["argumentId"].tolist()
results = retrieve(
query=query,
model=model,
corpus_ids=corpus_ids,
corpus_embs=corpus_embs,
corpus_df=corpus_df,
top_k=TOP_K,
)
display_results(query, results, MODE)
To retrieve from your own corpus instead of the bundled one, encode your arguments with the same model and skip corpus_df.pkl / corpus_embs.pt:
corpus_texts = [...] # list of argument strings
corpus_ids = [...] # matching list of IDs
corpus_embs = model.encode(
corpus_texts,
batch_size=128,
normalize_embeddings=True,
convert_to_numpy=True,
)
Training Details
- Base model:
sentence-transformers/all-mpnet-base-v2 - Loss:
MultipleNegativesRankingLoss(in-batch negatives) - Batch size: 64
- Learning rate: 2e-5
- Epochs: 3, with linear warmup over 10% of training steps
- Checkpoint selection: best epoch by MRR@10 on the validation split
- Training data: curated subset of the CreateDebate Arguments Dataset (84,872 arguments, 578 debates), split at the debate level (75/10/15) to prevent topic leakage between train, validation, and test
biencoder_scenario1 is trained on targeted query-positive pairs, where the positive is the author-tagged Disputed reply to the query argument (44,569 pairs). biencoder_scenario2 is trained on general query-positive pairs, where the positive is any argument from the opposing side of the same debate (80,885 pairs, capped at 500 per debate).
Evaluation Results
Both models were evaluated on held-out test debates, with candidates restricted to arguments from the same debate as the query.
Scenario 1 (targeted retrieval), evaluated with biencoder_scenario1:
| Metric | Score |
|---|---|
| MRR | 0.444 |
| Recall@1 | 0.301 |
| Recall@5 | 0.605 |
| Recall@10 | 0.711 |
Scenario 2 (general retrieval), evaluated with biencoder_scenario2:
| Metric | Score |
|---|---|
| Recall@10 | 0.077 |
| Recall@20 | 0.157 |
The low Recall@10/20 on Scenario 2 reflects the size of the positive pool (a mean of ~83 valid opposing-side arguments per query), not necessarily a failure of the model relative to other approaches; a BM25 lexical baseline performs comparably on this setting. See the accompanying paper for full baseline comparisons and a discussion of why Scenario 2 is difficult for all evaluated methods.
Querying biencoder_scenario1 does not require debate-level context (title/description) to be included in the query text. In our experiments, adding debate context to the query consistently hurt retrieval performance for both lexical and neural models when all candidates come from the same debate, since it adds shared vocabulary that doesn't help distinguish between candidates. We recommend passing only the argument text itself as the query.
Limitations and Considerations
A qualitative analysis of biencoder_scenario1 identified three recurring failure patterns that are useful to know before relying on these models:
Pragmatic and contextual rebuttals are hard to retrieve. Counter-arguments that operate at a pragmatic level (e.g. asking for clarification, or reframing the premise of an argument rather than directly engaging its content) tend to have low semantic similarity to the query and are often missed, even when they are the most effective real-world response.
Single ground truth labels can be ambiguous. In many cases, several arguments in the candidate pool are equally valid counter-arguments, but only one is the labeled ground truth (the one tagged as a direct reply). A model that retrieves a different, also-valid counter-argument is being scored as wrong. Retrieval metrics on this dataset should be read as a lower bound on true model capability.
The training and evaluation data contains non-substantive exchanges. Because the source dataset is a naturally occurring, unfiltered debate platform, some "Disputed" replies are personal remarks, platform notices, or off-topic dismissals rather than genuine counter-arguments. These cases are unanswerable by any text-only retrieval model and contribute to the error rate.
These models are intended for research on computational argumentation and counter-argument retrieval. They are not intended to be used as a standalone fact-checking, content moderation, or debate-arbitration tool, and outputs should not be treated as an authoritative judgment of which side of an argument is "correct."
License
These models are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, matching the license of the underlying training data.
Model tree for azza1625/counter-argument-retrieval
Base model
sentence-transformers/all-mpnet-base-v2