Report for textattack/bert-base-uncased-SST-2
Hi Team,
This is a report from Giskard Bot Scan 🐢.
We have identified 5 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).
You can find a full version of scan report here.
👉Performance issues (4)
For records in the dataset where text contains "movie", the Precision is 8.81% lower than the global Precision.
| Level | Data slice | Metric | Deviation |
|---|---|---|---|
| medium 🟡 | text contains "movie" |
Precision = 0.837 | -8.81% than global |
Taxonomy
avid-effect:performance:P0204🔍✨Examples
| text | label | Predicted label |
|
|---|---|---|---|
| 69 | this one is definitely one to skip , even for horror movie fanatics . | LABEL_0 | LABEL_1 (p = 0.95) |
| 172 | it seems like i have been waiting my whole life for this movie and now i ca n't wait for the sequel . | LABEL_1 | LABEL_0 (p = 0.72) |
| 509 | a movie that successfully crushes a best selling novel into a timeframe that mandates that you avoid the godzilla sized soda . | LABEL_1 | LABEL_0 (p = 0.91) |
For records in the dataset where text_length(text) < 82.500 AND text_length(text) >= 73.500, the Recall is 6.97% lower than the global Recall.
| Level | Data slice | Metric | Deviation |
|---|---|---|---|
| medium 🟡 | text_length(text) < 82.500 AND text_length(text) >= 73.500 |
Recall = 0.870 | -6.97% than global |
Taxonomy
avid-effect:performance:P0204🔍✨Examples
| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 93 | if steven soderbergh 's ` solaris ' is a failure it is a glorious failure . | 76 | LABEL_1 | LABEL_0 (p = 0.59) |
| 142 | what better message than ` love thyself ' could young women of any size receive ? | 82 | LABEL_1 | LABEL_0 (p = 0.98) |
| 411 | i do n't mind having my heartstrings pulled , but do n't treat me like a fool . | 80 | LABEL_0 | LABEL_1 (p = 0.95) |
For records in the dataset where text_length(text) >= 165.500 AND text_length(text) < 183.500, the Recall is 6.73% lower than the global Recall.
| Level | Data slice | Metric | Deviation |
|---|---|---|---|
| medium 🟡 | text_length(text) >= 165.500 AND text_length(text) < 183.500 |
Recall = 0.872 | -6.73% than global |
Taxonomy
avid-effect:performance:P0204🔍✨Examples
| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 266 | a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors . | 179 | LABEL_1 | LABEL_0 (p = 0.85) |
| 282 | while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer | 166 | LABEL_1 | LABEL_0 (p = 1.00) |
| 292 | the story and the friendship proceeds in such a way that you 're watching a soap opera rather than a chronicle of the ups and downs that accompany lifelong friendships . | 170 | LABEL_0 | LABEL_1 (p = 0.88) |
For records in the dataset where text_length(text) < 98.500 AND text_length(text) >= 86.500, the Precision is 6.21% lower than the global Precision.
| Level | Data slice | Metric | Deviation |
|---|---|---|---|
| medium 🟡 | text_length(text) < 98.500 AND text_length(text) >= 86.500 |
Precision = 0.861 | -6.21% than global |
Taxonomy
avid-effect:performance:P0204🔍✨Examples
| text | text_length(text) | label | Predicted label |
|
|---|---|---|---|---|
| 115 | sam mendes has become valedictorian at the school for soft landings and easy ways out . | 88 | LABEL_0 | LABEL_1 (p = 0.98) |
| 230 | reign of fire looks as if it was made without much thought -- and is best watched that way . | 93 | LABEL_1 | LABEL_0 (p = 1.00) |
| 519 | moretti 's compelling anatomy of grief and the difficult process of adapting to loss . | 87 | LABEL_0 | LABEL_1 (p = 1.00) |
👉Robustness issues (1)
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.
| Level | Metric | Transformation | Deviation |
|---|---|---|---|
| major 🔴 | Fail rate = 0.125 | Add typos | 100/800 tested samples (12.5%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201🔍✨Examples
| text | Add typos(text) | Original prediction | Prediction after perturbation | |
|---|---|---|---|---|
| 16 | the emotions are raw and will strike a nerve with anyone who 's ever had family trauma . | the ekotions are raw andw ill strike a nerve with anyone wgo 's ever had family trauma . | LABEL_1 (p = 1.00) | LABEL_0 (p = 0.89) |
| 22 | holden caulfield did it better . | holdsn caulfkeld did t better . | LABEL_1 (p = 0.99) | LABEL_0 (p = 0.98) |
| 36 | the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough . | he weight of the piec e hte unerring professionalism of the chilly production , and the fascination embeded in the lurid topic prove rrcommendatioh enough . | LABEL_1 (p = 1.00) | LABEL_0 (p = 0.98) |
We've generated test suites according to your scan results! Checkout the Test Suite in our Giskard Space and Giskard Documentation to learn more about how to test your model.
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.