Instructions to use armaniii/WIBA-Extract-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use armaniii/WIBA-Extract-V1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="armaniii/WIBA-Extract-V1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("armaniii/WIBA-Extract-V1")
model = AutoModelForCausalLM.from_pretrained("armaniii/WIBA-Extract-V1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use armaniii/WIBA-Extract-V1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "armaniii/WIBA-Extract-V1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armaniii/WIBA-Extract-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/armaniii/WIBA-Extract-V1

SGLang

How to use armaniii/WIBA-Extract-V1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "armaniii/WIBA-Extract-V1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armaniii/WIBA-Extract-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "armaniii/WIBA-Extract-V1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "armaniii/WIBA-Extract-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use armaniii/WIBA-Extract-V1 with Docker Model Runner:
```
docker model run hf.co/armaniii/WIBA-Extract-V1
```

WIBA-Extract-V1 / README.md

armaniii

Model card v3: step-by-step gated-access walkthrough, separate GPU/CPU quickstarts with hardware requirements, batch processing with tqdm progress bar

5ed7e6b verified 22 days ago

preview code

Raw

History Blame Contribute Delete

10.7 kB

	---
	library_name: transformers
	base_model: meta-llama/Meta-Llama-3-8B
	license: llama3
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- argument-mining
	- topic-extraction
	- claim-extraction
	- computational-social-science
	- llama
	- 4-bit
	- bitsandbytes
	- wiba
	---

	# WIBA Claim Topic Extraction (Llama-3-8B, pre-quantized 4-bit)

	Topic extraction model: given an argumentative sentence or passage, it generates the topic being argued (a short phrase naming the person, place, thing, entity, or idea at issue), or `No Topic` if the text is not an argument. The topic may be explicit in the text or implicit and inferred from context.

	This is Stage 2 of the [WIBA (What Is Being Argued?)](https://arxiv.org/abs/2405.00828) argument mining pipeline:

	\| Stage \| Task \| Model \| Type \|
	\|---\|---\|---\|---\|
	\| 1. Detect \| Is this text an argument? \| [armaniii/llama-3-8b-argument-detection](https://huggingface.co/armaniii/llama-3-8b-argument-detection) \| LoRA adapter (sequence classification, 2 labels) \|
	\| 2. Extract \| What topic is being argued? \| this repo \| Fine-tuned causal LM (pre-quantized 4-bit) \|
	\| 3. Stance \| What position does it take on the topic? \| [armaniii/llama-stance-classification](https://huggingface.co/armaniii/llama-stance-classification) \| LoRA adapter (sequence classification, 3 labels) \|

	- 📄 Paper: [WIBA: What Is Being Argued? A Comprehensive Approach to Argument Mining](https://arxiv.org/abs/2405.00828)
	- 💻 Code: [github.com/Armaniii/WIBA](https://github.com/Armaniii/WIBA)
	- 🌐 Platform: [wiba.dev](https://wiba.dev)

	## What this repo contains (full model, stored 4-bit quantized)

	This repo is a complete, self-contained fine-tuned model — no base download, no adapter. But unlike a normal fp16 checkpoint, the weights are stored pre-quantized with bitsandbytes NF4 (the format the WIBA platform serves in production):

	\| File \| Purpose \|
	\|---\|---\|
	\| `model-0000*-of-00002.safetensors` + index \| ~6 GB total. Linear-layer weights as packed 4-bit (uint8) with `absmax`/`quant_map` quantization metadata; embeddings and `lm_head` in float16 \|
	\| `config.json` \| Model config including the `quantization_config` (bnb NF4, blocksize 64, compute dtype fp16) that tells transformers how to load the 4-bit weights \|
	\| `generation_config.json` \| Default generation settings \|
	\| `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` \| Llama-3 tokenizer \|

	Practical consequences:

	- `bitsandbytes` is a hard requirement — the checkpoint cannot be loaded without it.
	- Do not try to remove/override `quantization_config` to get fp16: the stored weights themselves are 4-bit packed, so there is no full-precision copy in this repo. To obtain higher-precision weights, load 4-bit first and call `model.dequantize()` (see below).
	- VRAM needed is only ~6 GB — the model fits on small GPUs.

	## Before you start

	No gated access needed — unlike the detect and stance stages, this repo is fully self-contained (no Meta base model to download), so there is no license gate, no account, and no token required. The first run downloads ~6 GB with progress bars, cached afterward in `~/.cache/huggingface`.

	## Hardware requirements — pick your setup

	\| Setup \| What you need \| Speed \|
	\|---\|---\|---\|
	\| GPU (recommended) \| NVIDIA GPU with ≥8 GB free VRAM, `pip install bitsandbytes` \| fast — this is the wiba.dev production configuration \|
	\| CPU only \| ~25 GB free RAM, no GPU; loads 4-bit then dequantizes (see below) \| ~1–2 min per text on 16 cores \|

	⚠️ Do not run `generate()` directly on the 4-bit model on a CPU: bitsandbytes' CPU 4-bit kernels are single-threaded and a single sentence takes over an hour (measured). Use the dequantize recipe below instead.

	## Quickstart — GPU

	```bash
	pip install torch transformers accelerate bitsandbytes
	```

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	REPO = "armaniii/llama-3-8b-claim-topic-extraction"

	tokenizer = AutoTokenizer.from_pretrained(REPO)
	tokenizer.pad_token_id = tokenizer.eos_token_id
	tokenizer.padding_side = "left"

	# quantization_config ships in config.json — transformers loads the 4-bit
	# weights automatically (~6 GB VRAM)
	model = AutoModelForCausalLM.from_pretrained(REPO, device_map="auto", low_cpu_mem_usage=True)
	model.eval()
	```

	## Quickstart — CPU (no GPU)

	`bitsandbytes` is still required (the checkpoint is stored 4-bit), but after loading, dequantize to bfloat16 so generation runs on all CPU cores (verified: ~25 GB RAM peak, then ~1–2 min per text on 16 cores):

	```python
	model = AutoModelForCausalLM.from_pretrained(REPO, device_map="cpu", low_cpu_mem_usage=True)
	model = model.dequantize().to(torch.bfloat16)
	model.eval()
	torch.set_num_threads(16) # match your core count
	```

	### Prompt format (must match training)

	The model expects the Llama-3 chat header format with the WIBA topic-extraction system prompt, and the generation cut off after a few tokens (topics are short):

	```python
	SYSTEM_PROMPT = """You are a helpful assistant that is specialized in a single task. If the sentence provided is an argument, decide what the topic being argued is using the rules and steps below.
	Rules:
	1. An argument is a sentence that must contain a claim AND AT LEAST ONE premise(i.e evidence) supporting that assertion or claim.
	2. A claim is the position being taken in the argument.
	3. A premise is a statement that provides evidence to support the claim.
	4. In order for a sentence to be an argument it must contain a claim AND at least one premise.
	5. If the sentence does not contain a claim AND does not provide any premises to support the claim, then it is a non-argument.
	6. If the sentence provided is an argument, then there must be a single topic being argued that is regarding a person, place, thing, entity, or abstract idea. The topic being argued may be explicitly stated OR it may be implicit and must be inferred from the context of the argument.
	7. If the sentence provided is a non-argument, then there is no topic being argued.

	Steps:
	1. Decide if the sentence provided is an argument or non-argument using the Rules provided.
	2. If the sentence is an argument, output only the topic being argued and your task is finished.
	3. If the sentence is a non-argument, only output: No Topic and your task is finished.
	4. If the sentence provided is a non-argument, then there is no topic being argued and you should only output: No Topic
	5. Let us think through the problem step by step carefully following all the rules outlined."""

	def extract_topic(text: str) -> str:
	prompt = (
	"<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>\n\n"
	+ SYSTEM_PROMPT
	+ "<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>\n\n"
	+ text
	+ "<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n"
	)
	enc = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.no_grad():
	out = model.generate(**enc, max_new_tokens=8, pad_token_id=128009)
	return tokenizer.decode(out[0, enc.input_ids.shape[1]:], skip_special_tokens=True).strip()

	print(extract_topic("We must act on climate change because temperatures are rising."))
	# -> climate change
	print(extract_topic("The weather is nice today."))
	# -> No Topic
	print(extract_topic("Abortion should remain legal because bodily autonomy is a fundamental right."))
	# -> abortion
	```

	(Outputs above are actual verified predictions, not illustrations.)

	The original implementation uses the equivalent `pipeline("text-generation", ..., max_new_tokens=8, pad_token_id=128009)` and takes the text after the final `assistant<\|end_header_id\|>\n\n` marker — the function above does the same thing with `generate`.

	### Output

	- An argumentative input → a short topic phrase (e.g. `Climate change`, `Gun control`)
	- A non-argument input → the literal string `No Topic`

	## Batch processing many texts (with a progress bar)

	Model downloads show progress bars automatically; generation doesn't, so wrap your loop in `tqdm` (installed with transformers) exactly as the original WIBA serving code does:

	```python
	from tqdm import tqdm

	texts = ["...", "..."] # your data
	topics = [extract_topic(t) for t in tqdm(texts)]
	```

	## Getting full-precision weights

	The repo stores no fp16 copy, but you can dequantize after loading (needs enough memory for the fp16 model, ~16 GB — this is the same call the CPU quickstart uses):

	```python
	model = AutoModelForCausalLM.from_pretrained(REPO, device_map="auto")
	model = model.dequantize() # bnb 4-bit -> floating point
	```

	## Tested configurations

	\| Stack \| Versions \| Status \|
	\|---\|---\|---\|
	\| Modern (2026) \| torch 2.5.1, transformers 5.12.0, accelerate 1.14.0, bitsandbytes 0.49.2 \| ✅ verified (4-bit load, generation, and `dequantize()` path) \|

	Notes:
	- Without `bitsandbytes` installed, `from_pretrained` raises immediately (the checkpoint is pre-quantized).
	- Attempting to load with the `quantization_config` removed fails with shape errors (`ckpt torch.Size([8388608, 1]) vs model torch.Size([4096, 4096])`) — the stored weights really are 4-bit packed.
	- CPU-only machines: the 4-bit load works (~4 GB RAM, bitsandbytes ships a CPU backend) but 4-bit inference on CPU is single-threaded and impractically slow. For CPU inference, load 4-bit, then `model.dequantize()` and cast to `torch.bfloat16`. For real use, a CUDA GPU (~6 GB VRAM) is the practical choice.
	- `use_fast=False` (which the original 2024 serving code passed) is silently ignored on transformers 5.x — slow tokenizers were removed; the default fast tokenizer is correct.

	## How it's used in the WIBA implementation

	In the WIBA serving code, this model backs the `/api/extract` endpoint at [wiba.dev](https://wiba.dev). Texts that Stage 1 classified as `Argument` are passed here to name the topic; the (text, topic) pair is then passed to Stage 3 ([stance classification](https://huggingface.co/armaniii/llama-stance-classification)) to determine whether the argument is in favor of or against that topic. For batch processing the implementation streams prompts through the pipeline with `batch_size=2` and left-padding.

	## Citation

	```bibtex
	@article{irani2024wiba,
	title={WIBA: What Is Being Argued? A Comprehensive Approach to Argument Mining},
	author={Irani, Arman and Park, Ju Yeon and Esterling, Kevin and Faloutsos, Michalis},
	journal={arXiv preprint arXiv:2405.00828},
	year={2024}
	}
	```

	## Notes

	- Fine-tuned from `meta-llama/Meta-Llama-3-8B` (Llama 3 license applies). The weights here are already fine-tuned; the base model is not required.
	- Internal fine-tune lineage: `llama_cte_v3`.