README.md · PerkinsFund/AURA at main

AURA / README.md

PerkinsFund

Update README.md

2598ad0 verified about 2 months ago

preview code

raw

history blame contribute delete

11 kB

	---
	license: cc-by-4.0
	pipeline_tag: tabular-classification
	tags:
	- malware
	- cybersecurity
	- pe-files
	- binary-classification
	- tabular-data
	- threat-intelligence
	- digital-forensics
	- reverse-engineering
	- incident-response
	- security-telemetry
	- ai-security
	- security-ml
	- mitre-attack
	- mitre-mbc
	- windows
	- executable-files
	- static-analysis
	- behavioral-analysis
	- classification
	- anomaly-detection
	- intrusion-detection
	- explainable-ai
	- model-evaluation
	- benchmarking
	- training
	- evaluation
	- research
	- education
	- teaching
	- quantized
	- tflite
	- edge-inference
	model_name: AURA Q1
	---

	# AURA Q1

	AURA Q1 is the free, quantized Windows release of the AURA malware classification model family.

	It is designed for efficient inference on structured telemetry extracted from Windows PE files, with a focus on lightweight deployment, fast scoring, and reproducible preprocessing.

	This repository contains a quantized model artifact intended for inference only.

	## What this model does

	AURA Q1 performs tabular binary classification for Windows executable analysis workflows. It is intended for use in research, education, prototyping, and defensive experimentation where users want a compact model that can score extracted static PE features.

	Depending on your surrounding pipeline, the model can support workflows involving:

	- malware / benign classification
	- triage prioritization
	- bulk telemetry scoring
	- offline experimentation
	- edge or constrained deployment scenarios

	## Model format

	This release is distributed as a quantized TensorFlow Lite model.

	Primary characteristics:

	- compact deployment format
	- lower memory footprint than a full-precision model
	- suitable for portable and embedded inference scenarios
	- optimized for inference speed and distribution simplicity

	## Full version availability

	AURA Q1 is the free quantized Windows release of AURA for lightweight local and edge inference.

	If you want to use the full version of AURA in a broader analysis workflow, including Windows, Linux, and Android classification, it is available through Traceix.

	Traceix: https://traceix.com

	Traceix is operated by PCEF (Perkins Cybersecurity Educational Fund), a 501(c)(3) nonprofit that provides free cybersecurity education, tools, and training.

	Learn more about PCEF: https://perkinsfund.org

	## Input schema

	AURA Q1 expects 30 numeric input features in a fixed order. The preprocessing configuration attached to this release defines the exact feature list and normalization parameters for the Windows model. :contentReference[oaicite:1]{index=1}

	### Feature order

	```text
	1. MajorImageVersion
	2. MajorOperatingSystemVersion
	3. MajorSubsystemVersion
	4. ImageBase
	5. MinorLinkerVersion
	6. CheckSum
	7. BaseOfData
	8. SectionsMaxEntropy
	9. MajorLinkerVersion
	10. DllCharacteristics
	11. SizeOfStackReserve
	12. LoadConfigurationSize
	13. ResourcesMinSize
	14. Subsystem
	15. SizeOfCode
	16. SectionsMeanVirtualsize
	17. Machine
	18. SizeOfImage
	19. AddressOfEntryPoint
	20. Characteristics
	21. SizeOfOptionalHeader
	22. ResourcesMaxSize
	23. ResourcesMaxEntropy
	24. ImportsNb
	25. SectionsMaxRawsize
	26. ExportNb
	27. ImportsNbDLL
	28. ResourcesMinEntropy
	29. SectionMaxVirtualsize
	30. SectionsMeanRawsize
	```

	These features and their normalization metadata are defined in the provided preprocessing file. :contentReference[oaicite:2]{index=2}

	## Preprocessing

	Inputs must be preprocessed exactly as defined by the release preprocessing configuration.

	This model uses per-feature min-max scaling with a target feature range of `[0, 1]` across all 30 input dimensions. The preprocessing metadata includes:

	- feature names
	- per-feature `scale`
	- per-feature `min`
	- original `data_min`
	- original `data_max`
	- feature range
	- number of expected input features

	The preprocessing config explicitly states:

	- `n_features_in = 30`
	- `feature_range = [0, 1]` :contentReference[oaicite:3]{index=3}

	### Important

	You should not reorder features, omit features, or substitute alternative telemetry fields without retraining or validating compatibility. Inference quality depends on preserving the exact training-time feature contract.

	## Example preprocessing flow

	At inference time, the expected workflow is:

	1. Extract the 30 raw features from the analyzed Windows PE sample.
	2. Arrange them in the exact order listed above.
	3. Apply the released min-max normalization parameters.
	4. Feed the normalized vector into the quantized TFLite model.
	5. Interpret the output score according to your downstream thresholding policy.

	## Quick Python runner

	Below is a minimal Python example that loads `model.tflite` and `preprocess.json`, applies the released min-max scaling, and runs inference on a single feature vector.

	```python
	import json
	import numpy as np

	import tensorflow as tf


	def load_preprocess(path="preprocess.json"):
	with open(path, "r", encoding="utf-8") as f:
	cfg = json.load(f)

	features = cfg["features"]
	scale = np.array(cfg["scale"], dtype=np.float32)
	min_offset = np.array(cfg["min"], dtype=np.float32)

	return features, scale, min_offset


	def preprocess_features(raw_features: dict, feature_order, scale, min_offset):
	missing = [name for name in feature_order if name not in raw_features]
	if missing:
	raise ValueError(f"Missing features: {missing}")

	x = np.array([raw_features[name] for name in feature_order], dtype=np.float32)
	x = x * scale + min_offset
	x = np.clip(x, 0.0, 1.0)

	# Shape: [1, 30]
	return np.expand_dims(x, axis=0).astype(np.float32)


	def run_inference(model_path="model.tflite", preprocess_path="preprocess.json", raw_features=None):
	if raw_features is None:
	raise ValueError("raw_features must be provided")

	feature_order, scale, min_offset = load_preprocess(preprocess_path)
	x = preprocess_features(raw_features, feature_order, scale, min_offset)

	interpreter = tf.lite.Interpreter(model_path=model_path)
	interpreter.allocate_tensors()

	input_details = interpreter.get_input_details()
	output_details = interpreter.get_output_details()

	input_index = input_details[0]["index"]
	output_index = output_details[0]["index"]

	input_dtype = input_details[0]["dtype"]
	output_dtype = output_details[0]["dtype"]

	if np.issubdtype(input_dtype, np.integer):
	q_scale, q_zero_point = input_details[0]["quantization"]
	if q_scale == 0:
	raise ValueError("Invalid input quantization scale")
	x_in = np.round(x / q_scale + q_zero_point).astype(input_dtype)
	else:
	x_in = x.astype(input_dtype)

	interpreter.set_tensor(input_index, x_in)
	interpreter.invoke()

	y = interpreter.get_tensor(output_index)

	# Dequantize output if needed
	if np.issubdtype(output_dtype, np.integer):
	q_scale, q_zero_point = output_details[0]["quantization"]
	if q_scale != 0:
	y = (y.astype(np.float32) - q_zero_point) * q_scale

	return y


	if __name__ == "__main__":
	sample = {
	"MajorImageVersion": 0,
	"MajorOperatingSystemVersion": 6,
	"MajorSubsystemVersion": 6,
	"ImageBase": 4194304,
	"MinorLinkerVersion": 25,
	"CheckSum": 0,
	"BaseOfData": 24576,
	"SectionsMaxEntropy": 6.12,
	"MajorLinkerVersion": 14,
	"DllCharacteristics": 34112,
	"SizeOfStackReserve": 1048576,
	"LoadConfigurationSize": 160,
	"ResourcesMinSize": 48,
	"Subsystem": 3,
	"SizeOfCode": 28160,
	"SectionsMeanVirtualsize": 8192,
	"Machine": 34404,
	"SizeOfImage": 126976,
	"AddressOfEntryPoint": 5272,
	"Characteristics": 258,
	"SizeOfOptionalHeader": 240,
	"ResourcesMaxSize": 4096,
	"ResourcesMaxEntropy": 4.21,
	"ImportsNb": 12,
	"SectionsMaxRawsize": 28672,
	"ExportNb": 0,
	"ImportsNbDLL": 3,
	"ResourcesMinEntropy": 1.37,
	"SectionMaxVirtualsize": 32768,
	"SectionsMeanRawsize": 6144,
	}

	output = run_inference(
	model_path="model.tflite",
	preprocess_path="preprocess.json",
	raw_features=sample,
	)

	print("Model output:", output)
	```

	## Intended use

	AURA Q1 is intended for:

	- defensive security research
	- malware analysis experimentation
	- academic and educational use
	- benchmarking tabular security ML pipelines
	- lightweight inference deployments

	## Out-of-scope use

	This release is not intended to be used as:

	- a standalone malware verdict engine without analyst oversight
	- a replacement for sandboxing, reverse engineering, or signature-based detection
	- a guarantee of maliciousness or benignness
	- a production enforcement control without independent validation

	## Limitations

	Users should evaluate the model carefully in their own environment. Key limitations include:

	- performance depends heavily on feature extraction quality
	- distribution shift can reduce reliability
	- adversarial adaptation is possible
	- score calibration may not transfer across datasets
	- quantization can introduce small accuracy differences relative to full-precision variants
	- security telemetry definitions may vary across tooling stacks

	## Bias, risk, and security considerations

	Security ML systems can produce both false positives and false negatives. AURA Q1 should be used as a decision-support signal, not as a sole source of truth.

	Potential risks include:

	- benign software being flagged incorrectly
	- malicious software evading classification
	- degraded performance on underrepresented file families
	- misuse in overly automated blocking pipelines

	Human review and layered security controls are recommended.

	## Reproducibility notes

	To reproduce inference correctly, use:

	- the exact feature order released here
	- the exact normalization metadata in `preprocess.json`
	- the quantized TFLite model artifact included in the repository

	Any mismatch between extracted telemetry and the expected schema may invalidate outputs. :contentReference[oaicite:4]{index=4}

	## Output

	This is a classification model that returns a prediction score or class output depending on the runtime wrapper used around the TFLite artifact.

	You should document your own:

	- output tensor interpretation
	- class mapping
	- threshold policy
	- confidence handling

	if you package this model inside a larger application.

	## Citation

	If you use AURA Q1 in research, evaluation, teaching, or derivative work, please cite this repository and retain attribution under the repository license.

	## License

	This model card is released under:

	CC-BY-4.0

	Please review the repository license terms before redistribution or derivative use.

	## Disclaimer

	AURA Q1 is provided for research, educational, and defensive purposes. It is offered as-is, without warranty, and should be validated thoroughly before any operational use.