globalcptc
/

leaky_model

Model card Files Files and versions

leaky_model / README.md

rossja's picture

Update README.md

abaa157 verified over 1 year ago

|

history blame contribute delete

3.13 kB

	---
	license: apache-2.0
	---

	# Leaky Model

	This is a simple LSTM-based text generation model, designed to illustrate how models can leak sensitive data.

	* The raw data used to train the model is comprised of a collection of penetration testing reports (in PDF format) taken
	from prior competition events. The original source files are available in the [CPTC Report Examples](https://github.com/globalcptc/report_examples)
	repository.
	* The codebase used to process the data and train this model is in the [CPTC leaky_model](https://github.com/globalcptc/leaky_model) repository.


	This model contains the following files:

	* text_generation_model.keras: trained LSTM (Long Short-Term Memory) neural network model saved in Keras format
	* text_processor.pkl: This is a pickled (serialized) TextProcessor object containing:
	- A fitted tokenizer with the vocabulary from the training data
	- Sequence length configuration (default 50 tokens)
	- Vocabulary size information


	## Usage

	```python
	import tensorflow as tf
	import pickle
	import numpy as np

	model_file = "text_generation_model.keras"
	processor_file = "text_processor.pkl"

	# Load model and processor
	model = tf.keras.models.load_model(model_file)
	with open(processor_file, 'rb') as f:
	processor = pickle.load(f)

	# Generation parameters
	prompt = "Once upon a time"
	max_tokens = 100
	temperature = 1.7 # Higher = more random, Lower = more focused (default: 0.7)
	top_k = 50 # Limit to top k tokens (set to 0 to disable)
	top_p = 0.9 # Nucleus sampling threshold (set to 1.0 to disable)

	# Process the prompt
	tokenizer = processor['tokenizer']
	sequence_length = processor['sequence_length']
	current_sequence = tokenizer.texts_to_sequences([prompt])[0]
	current_sequence = [0] * (sequence_length - len(current_sequence)) + current_sequence
	current_sequence = np.array([current_sequence])

	# Generate text
	generated_text = prompt
	for _ in range(max_tokens):
	pred = model.predict(current_sequence, verbose=0)
	logits = pred[0] / temperature

	# Apply top-k filtering
	if top_k > 0:
	indices_to_remove = np.argsort(logits)[:-top_k]
	logits[indices_to_remove] = -float('inf')

	# Apply top-p filtering (nucleus sampling)
	if top_p < 1.0:
	sorted_logits = np.sort(logits)[::-1]
	cumulative_probs = np.cumsum(tf.nn.softmax(sorted_logits))
	sorted_indices_to_remove = cumulative_probs > top_p
	sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1]
	sorted_indices_to_remove[0] = False
	indices_to_remove = np.argsort(logits)[::-1][sorted_indices_to_remove]
	logits[indices_to_remove] = -float('inf')

	# Sample from the filtered distribution
	probs = tf.nn.softmax(logits).numpy()
	next_token = np.random.choice(len(probs), p=probs)

	# Get the word for this token
	for word, index in tokenizer.word_index.items():
	if index == next_token:
	generated_text += ' ' + word
	break

	# Update sequence
	current_sequence = np.array([current_sequence[0, 1:].tolist() + [next_token]])

	print(generated_text)
	```