| | --- |
| | license: apache-2.0 |
| | --- |
| | |
| | # Leaky Model |
| |
|
| | This is a simple LSTM-based text generation model, designed to illustrate how models can leak sensitive data. |
| |
|
| | * The raw data used to train the model is comprised of a collection of penetration testing reports (in PDF format) taken |
| | from prior competition events. The original source files are available in the [CPTC Report Examples](https://github.com/globalcptc/report_examples) |
| | repository. |
| | * The codebase used to process the data and train this model is in the [CPTC leaky_model](https://github.com/globalcptc/leaky_model) repository. |
| |
|
| |
|
| | This model contains the following files: |
| |
|
| | * **text_generation_model.keras**: trained LSTM (Long Short-Term Memory) neural network model saved in Keras format |
| | * **text_processor.pkl**: This is a pickled (serialized) TextProcessor object containing: |
| | - A fitted tokenizer with the vocabulary from the training data |
| | - Sequence length configuration (default 50 tokens) |
| | - Vocabulary size information |
| | |
| | |
| | ## Usage |
| | |
| | ```python |
| | import tensorflow as tf |
| | import pickle |
| | import numpy as np |
| | |
| | model_file = "text_generation_model.keras" |
| | processor_file = "text_processor.pkl" |
| | |
| | # Load model and processor |
| | model = tf.keras.models.load_model(model_file) |
| | with open(processor_file, 'rb') as f: |
| | processor = pickle.load(f) |
| | |
| | # Generation parameters |
| | prompt = "Once upon a time" |
| | max_tokens = 100 |
| | temperature = 1.7 # Higher = more random, Lower = more focused (default: 0.7) |
| | top_k = 50 # Limit to top k tokens (set to 0 to disable) |
| | top_p = 0.9 # Nucleus sampling threshold (set to 1.0 to disable) |
| | |
| | # Process the prompt |
| | tokenizer = processor['tokenizer'] |
| | sequence_length = processor['sequence_length'] |
| | current_sequence = tokenizer.texts_to_sequences([prompt])[0] |
| | current_sequence = [0] * (sequence_length - len(current_sequence)) + current_sequence |
| | current_sequence = np.array([current_sequence]) |
| | |
| | # Generate text |
| | generated_text = prompt |
| | for _ in range(max_tokens): |
| | pred = model.predict(current_sequence, verbose=0) |
| | logits = pred[0] / temperature |
| | |
| | # Apply top-k filtering |
| | if top_k > 0: |
| | indices_to_remove = np.argsort(logits)[:-top_k] |
| | logits[indices_to_remove] = -float('inf') |
| | |
| | # Apply top-p filtering (nucleus sampling) |
| | if top_p < 1.0: |
| | sorted_logits = np.sort(logits)[::-1] |
| | cumulative_probs = np.cumsum(tf.nn.softmax(sorted_logits)) |
| | sorted_indices_to_remove = cumulative_probs > top_p |
| | sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1] |
| | sorted_indices_to_remove[0] = False |
| | indices_to_remove = np.argsort(logits)[::-1][sorted_indices_to_remove] |
| | logits[indices_to_remove] = -float('inf') |
| | |
| | # Sample from the filtered distribution |
| | probs = tf.nn.softmax(logits).numpy() |
| | next_token = np.random.choice(len(probs), p=probs) |
| | |
| | # Get the word for this token |
| | for word, index in tokenizer.word_index.items(): |
| | if index == next_token: |
| | generated_text += ' ' + word |
| | break |
| | |
| | # Update sequence |
| | current_sequence = np.array([current_sequence[0, 1:].tolist() + [next_token]]) |
| | |
| | print(generated_text) |
| | ``` |