protecttors commited on
Commit
ea648ad
·
verified ·
1 Parent(s): 90b22c5

Model card

Browse files

Model Card based on the demo files that are part of the Repository.

Files changed (1) hide show
  1. README.md +294 -0
README.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: gguf
6
+ tags:
7
+ - conversational
8
+ - aibom
9
+ - sample
10
+ - qwen2
11
+ - gguf
12
+ pipeline_tag: text-generation
13
+ base_model: Qwen/Qwen2-0.5B
14
+ ---
15
+
16
+ # Model Card for protecttors/sample-files
17
+
18
+ `protecttors/sample-files` is a **security vulnerability demonstration repository** containing model artifacts across multiple serialization formats (GGUF, PyTorch `.bin`, and Python pickle `.pkl`). It is NOT a production inference model. It exists to enable AI/ML supply chain security tooling, AIBOM (AI Bill of Materials) generation, unsafe file detection, and red teaming of model ingestion pipelines.
19
+
20
+ > ⚠️ **Security Notice:** 4 files in this repository are intentionally built to be unsafe. This is by design. Do not load these files into production environments without thorough security review.
21
+
22
+ ---
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ This repository serves as a controlled artifact fixture for security practitioners and for AIBOM tooling and proffesionals working on AI supply chain integrity. It bundles three classes of model file formats — GGUF (quantized LLM weights), PyTorch binary weights, and pickle-serialized ML objects — to provide ground-truth positive samples for scanners, AIBOM generators, and VEX (Vulnerability Exploitability eXchange) authoring tools.
29
+
30
+ The `.pkl` and `.bin` files may contain synthetic or deliberately modified artifacts constructed for security research purposes and do not represent validated trained weights.
31
+
32
+ - **Developed by:** [Protecttors](https://huggingface.co/protecttors)
33
+ - **Model type:** GGUF, PKL, PT
34
+ - **Language(s) (NLP):** English
35
+ - **License:** Apache 2.0
36
+ - **Finetuned from model [optional]:** [Qwen/Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B)
37
+
38
+ ### Model Sources [optional]
39
+
40
+ - **Repository:** https://huggingface.co/protecttors/sample-files
41
+ - **Paper [optional]:** Not applicable
42
+ - **Demo [optional]:** Not applicable
43
+
44
+ ---
45
+
46
+ ## Uses
47
+
48
+ This repository serves as a controlled artifact fixture for security practitioners and for AIBOM tooling and proffesionals working on AI supply chain integrity.
49
+
50
+ ### Direct Use
51
+
52
+ This repository is intended for use as a **test artifactory** for:
53
+
54
+ - **AIBOM/SBOM developer / Integrators** — validating that tools correctly enumerate model components, serialization formats, embedded metadata, and dependency graphs across GGUF, `.bin`, and `.pkl` formats.
55
+ - **Vulnerability scanner and testers** — verifying that scanners flag unsafe pickle deserialization payloads, embedded executable code, or malformed model headers.
56
+ - **Red teamers and penetration testers** — simulating adversarial model artifacts in controlled environments to test model registry ingestion pipelines, CI/CD gates, and serving infrastructure.
57
+
58
+ ### Downstream Use [optional]
59
+
60
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
61
+ Not recommended for any Downstream applications.
62
+
63
+ ### Out-of-Scope Use
64
+
65
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
66
+
67
+ - **Production inference:** These files are not quality-evaluated and must not be used for real-world text generation.
68
+ - **Fine-tuning or transfer learning:** No training provenance or dataset documentation is available for these artifacts.
69
+ - **Weaponization:** Adapting the intentionally unsafe artifacts in this repository to create novel malware or exploit code is strictly prohibited and outside the intended scope of this research.
70
+ - **Use by non-security practitioners without supervision:** Users unfamiliar with the risks of loading untrusted `.pkl` or `.bin` files should not interact with these artifacts directly.
71
+
72
+ ---
73
+
74
+ ## Bias, Risks, and Limitations
75
+
76
+
77
+ **Dual-use risk:** Publishing intentionally unsafe model artifacts carries inherent dual-use risk. The same samples that enable defenders to test scanners can, in principle, serve as reference material for adversaries. This is mitigated by ensuring the repository does not contain functional exploits, only detection-oriented samples.
78
+
79
+ **Pickle deserialization risk:** Python pickle (`.pkl`) files can embed arbitrary executable Python code. Loading these files outside an isolated environment could result in code execution on the host system.
80
+
81
+ **No quality guarantees on GGUF weights:** The quantized Qwen2 weights have not been evaluated for factual accuracy, coherence, or safety alignment. They inherit any biases present in the Qwen2-0.5B base model.
82
+
83
+ **Scanner false negative risk:** Not all security scanning tools may flag all 4 unsafe files in this repository. Absence of a scanner alert does not imply safety.
84
+
85
+ **Format coverage is intentionally narrow:** This repository covers three file formats (GGUF, PyTorch bin, pickle). It does not represent the full surface area of unsafe model formats (e.g., ONNX, SafeTensors, TFLite, CoreML).
86
+
87
+ ### Recommendations
88
+
89
+
90
+ - Always load artifacts from this repository inside an **isolated, sandboxed environment** (container or VM with no network access, no credentials, no access to sensitive filesystem paths).
91
+ - Prefer **SafeTensors** over `.pkl` or `.bin` in production pipelines — SafeTensors does not support arbitrary code execution during deserialization.
92
+ - Run **pickle scanning** (e.g., `picklescan`, `modelscan`) on any `.pkl` artifact before loading.
93
+ - Validate GGUF file headers before inference to detect unexpected metadata or embedded payloads.
94
+ - Treat scanner results from this repository as **ground-truth positives** when calibrating detection thresholds.
95
+ - Users building AIBOM tooling should verify their tools enumerate all three format directories and correctly surface the 4 flagged files.
96
+
97
+ ---
98
+
99
+ ## How to Get Started with the Model
100
+
101
+ Use the following **only in an isolated sandbox environment**:
102
+
103
+ ```bash
104
+ # Clone the repository
105
+ git clone https://huggingface.co/protecttors/sample-files
106
+
107
+ # Inspect file structure
108
+ ls -lh sample-files/gguf_diffusion_model/
109
+ ls -lh sample-files/ml_pkl_file/
110
+ ls -lh sample-files/torch_bin_model/
111
+
112
+ # Run pickle scan on pkl artifacts (install: pip install modelscan)
113
+ modelscan -p sample-files/ml_pkl_file/
114
+
115
+ # Inspect GGUF header without loading weights
116
+ python -c "
117
+ with open('sample-files/gguf_diffusion_model/<file>.gguf', 'rb') as f:
118
+ magic = f.read(4)
119
+ print('Magic bytes:', magic)
120
+ "
121
+ ```
122
+
123
+ > Do **not** run `pickle.load()` or `torch.load()` directly on these files outside a sandbox.
124
+
125
+ ---
126
+
127
+ ## Training Details
128
+
129
+ ### Training Data
130
+
131
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
132
+
133
+ Not applicable. This repository is not the output of a training run.
134
+
135
+ ### Training Procedure
136
+
137
+ Not applicable.
138
+
139
+ #### Preprocessing [optional]
140
+
141
+ Not applicable.
142
+
143
+ #### Training Hyperparameters
144
+
145
+ - **Training regime:** Not applicable — no training was performed by the Protecttors organization for this repository.
146
+
147
+ #### Speeds, Sizes, Times [optional]
148
+
149
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
150
+
151
+ | Artifact Directory | Approximate Size | Format |
152
+ |---|---|---|
153
+ | `gguf_diffusion_model/` | ~280 MB | GGUF |
154
+ | `ml_pkl_file/` | Variable | Python Pickle |
155
+ | `torch_bin_model/` | Variable | PyTorch Binary |
156
+ | **Total repository** | **~301 MB** | Mixed |
157
+
158
+ ---
159
+
160
+ ## Evaluation
161
+
162
+ <!-- This section describes the evaluation protocols and provides the results. -->
163
+
164
+ ### Testing Data, Factors & Metrics
165
+
166
+ #### Testing Data
167
+
168
+ <!-- This should link to a Dataset Card if possible. -->
169
+
170
+ This repository is itself a test dataset for security tooling. It is not evaluated on NLP benchmarks.
171
+
172
+ #### Factors
173
+
174
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
175
+
176
+ The relevant evaluation factors for tooling using this repository are:
177
+
178
+ - **Format coverage:** Does the scanner/AIBOM tool correctly handle all three artifact formats?
179
+ - **Detection recall:** Are all 4 unsafe files surfaced by the tool?
180
+ - **False positive rate:** Does the tool produce spurious alerts on safe files?
181
+ - **Metadata extraction fidelity:** Does the AIBOM tool correctly extract architecture, parameter count, quantization type, and license from GGUF metadata?
182
+
183
+ #### Metrics
184
+
185
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
186
+
187
+ | Metric | Description |
188
+ |---|---|
189
+ | Unsafe file detection rate | % of the 4 flagged files correctly identified |
190
+ | Format enumeration completeness | % of artifact formats correctly categorized in AIBOM output |
191
+ | VEX advisory linkage | Whether generated VEX documents correctly reference flagged component SHAs |
192
+ | False positive rate | Alerts raised on non-flagged files |
193
+
194
+ ### Results
195
+
196
+ Tooling evaluation results are not included in this card. Security Researchers and practitioners using this repository as a benchmark fixture are encouraged to publish their scanner results via the [Community Discussions tab](https://huggingface.co/protecttors/sample-files/discussions).
197
+
198
+ #### Summary
199
+
200
+ This repository provides 4 unsafe artifacts and a mix of format types to stress-test AI supply chain security tooling. It is not benchmarked on NLP tasks.
201
+
202
+ ---
203
+
204
+ ## Model Examination [optional]
205
+
206
+ <!-- Relevant interpretability work for the model goes here -->
207
+
208
+ The GGUF weights are derived from Qwen2-0.5B, a transformer-based autoregressive language model. No interpretability analysis has been performed by the Protecttors organization on these artifacts. Researchers wishing to inspect model internals may use GGUF header parsing tools to examine quantization metadata without loading full weights into memory.
209
+
210
+ ---
211
+
212
+ ## Environmental Impact
213
+
214
+ No training was conducted by the Protecttors for this repository.
215
+
216
+ The GGUF quantization of Qwen2-0.5B is a lightweight conversion step with negligible carbon footprint relative to the original pre-training.
217
+
218
+ Carbon emissions for the original Qwen2 pre-training can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
219
+
220
+ - **Hardware Type:** Not applicable (no training by Protecttors)
221
+ - **Hours used:** Not applicable
222
+ - **Cloud Provider:** Not applicable
223
+ - **Compute Region:** Not applicable
224
+ - **Carbon Emitted:** Not applicable — refer to [Qwen2 model card](https://huggingface.co/Qwen/Qwen2-0.5B) for pre-training emissions
225
+
226
+ ---
227
+
228
+ ## Technical Specifications [optional]
229
+
230
+ ### Model Architecture and Objective
231
+
232
+ | Property | Value |
233
+ |---|---|
234
+ | Architecture | Qwen2 (transformer, autoregressive) |
235
+ | Parameters | ~0.6B |
236
+ | Primary serialization format | GGUF (quantized) |
237
+ | Additional formats | PyTorch `.bin`, Python `.pkl` |
238
+ | Chat template | Qwen2 default |
239
+ | Quantization variants | See repository file listing |
240
+
241
+ ### Compute Infrastructure
242
+
243
+ Not applicable — no training infrastructure was used by the Protecttors organization.
244
+
245
+ #### Hardware
246
+
247
+ Not applicable.
248
+
249
+ #### Software
250
+
251
+ - GGUF conversion: `llama.cpp` conversion toolchain
252
+ - Pickle artifacts: Python 3.x standard library
253
+ - PyTorch Package
254
+
255
+ ---
256
+
257
+ ## Citation [optional]
258
+
259
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
260
+
261
+ **BibTeX:**
262
+
263
+ ---
264
+
265
+ ## Glossary [optional]
266
+
267
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
268
+
269
+ | Term | Definition |
270
+ |---|---|
271
+ | **AIBOM** | AI Bill of Materials — a structured inventory of components, dependencies, and metadata for an AI model artifact |
272
+ | **SBOM** | Software Bill of Materials — analogous to AIBOM but covering software supply chains broadly |
273
+ | **VEX** | Vulnerability Exploitability eXchange — a document format for attaching exploitability status to known vulnerabilities in software/AI components |
274
+ | **GGUF** | A binary serialization format for quantized LLM weights, used by `llama.cpp` and compatible runtimes |
275
+ | **Pickle** | Python's native object serialization format; dangerous when loading untrusted sources as it supports arbitrary code execution |
276
+ | **SafeTensors** | A safer alternative serialization format for ML weights that does not support code execution on load |
277
+ | **Red teaming** | Adversarial testing methodology where security researchers simulate attacker behavior to identify vulnerabilities |
278
+ | **CERT-In** | Indian Computer Emergency Response Team — the national nodal agency for cybersecurity incident response in India |
279
+
280
+ ---
281
+
282
+ ## More Information [optional]
283
+
284
+ ---
285
+
286
+ ## Model Card Authors [optional]
287
+
288
+ Protecttors organization
289
+
290
+ ---
291
+
292
+ ## Model Card Contact
293
+
294
+ Reach out via the [Community Discussions tab](https://huggingface.co/protecttors/sample-files/discussions) on this repository for questions, responsible disclosure, or tooling benchmark contributions.