Report for textattack/bert-base-uncased-SST-2

by giskard-bot - opened Mar 25, 2024

Discussion

giskard-bot

Mar 25, 2024

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 5 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

You can find a full version of scan report here.

👉Performance issues (4)

For records in the dataset where text contains "movie", the Precision is 8.81% lower than the global Precision.

Level	Data slice	Metric	Deviation
medium 🟡	`text` contains "movie"	Precision = 0.837	-8.81% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	label	Predicted `label`
69	this one is definitely one to skip , even for horror movie fanatics .	LABEL_0	LABEL_1 (p = 0.95)
172	it seems like i have been waiting my whole life for this movie and now i ca n't wait for the sequel .	LABEL_1	LABEL_0 (p = 0.72)
509	a movie that successfully crushes a best selling novel into a timeframe that mandates that you avoid the godzilla sized soda .	LABEL_1	LABEL_0 (p = 0.91)

For records in the dataset where text_length(text) < 82.500 AND text_length(text) >= 73.500, the Recall is 6.97% lower than the global Recall.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` < 82.500 AND `text_length(text)` >= 73.500	Recall = 0.870	-6.97% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
93	if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .	76	LABEL_1	LABEL_0 (p = 0.59)
142	what better message than ` love thyself ' could young women of any size receive ?	82	LABEL_1	LABEL_0 (p = 0.98)
411	i do n't mind having my heartstrings pulled , but do n't treat me like a fool .	80	LABEL_0	LABEL_1 (p = 0.95)

For records in the dataset where text_length(text) >= 165.500 AND text_length(text) < 183.500, the Recall is 6.73% lower than the global Recall.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` >= 165.500 AND `text_length(text)` < 183.500	Recall = 0.872	-6.73% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
266	a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors .	179	LABEL_1	LABEL_0 (p = 0.85)
282	while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer	166	LABEL_1	LABEL_0 (p = 1.00)
292	the story and the friendship proceeds in such a way that you 're watching a soap opera rather than a chronicle of the ups and downs that accompany lifelong friendships .	170	LABEL_0	LABEL_1 (p = 0.88)

For records in the dataset where text_length(text) < 98.500 AND text_length(text) >= 86.500, the Precision is 6.21% lower than the global Precision.

Level	Data slice	Metric	Deviation
medium 🟡	`text_length(text)` < 98.500 AND `text_length(text)` >= 86.500	Precision = 0.861	-6.21% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
115	sam mendes has become valedictorian at the school for soft landings and easy ways out .	88	LABEL_0	LABEL_1 (p = 0.98)
230	reign of fire looks as if it was made without much thought -- and is best watched that way .	93	LABEL_1	LABEL_0 (p = 1.00)
519	moretti 's compelling anatomy of grief and the difficult process of adapting to loss .	87	LABEL_0	LABEL_1 (p = 1.00)

👉Robustness issues (1)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.

Level	Metric	Transformation	Deviation
major 🔴	Fail rate = 0.125	Add typos	100/800 tested samples (12.5%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201

🔍✨Examples

	text	Add typos(text)	Original prediction	Prediction after perturbation
16	the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .	the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma .	LABEL_1 (p = 1.00)	LABEL_0 (p = 0.89)
22	holden caulfield did it better .	holdsn caulfkeld did t better .	LABEL_1 (p = 0.99)	LABEL_0 (p = 0.98)
36	the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .	he weight of the piec e hte unerring professionalism of the chilly production , and the fascination embeded in the lurid topic prove rrcommendatioh enough .	LABEL_1 (p = 1.00)	LABEL_0 (p = 0.98)

We've generated test suites according to your scan results! Checkout the Test Suite in our Giskard Space and Giskard Documentation to learn more about how to test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment