# Introduction to the Code: Domain Evaluation with a Language Model

This code implements a system to assess whether a domain is generated by a Domain Generation Algorithm (DGA) or is a legitimate domain. We use a large language model (LLM) fine-tuned specifically for domain classification tasks.

## Features of the Code

- **Text Generation**: Uses a language model to generate labels based on a specific prompt.
- **Automatic Classification**: Determines whether a domain belongs to a DGA based on the generated labels.
- **Modular Structure**: The code is optimized with a reusable function to efficiently evaluate multiple domains.

## Code Structure

### Environment Setup:
- Loads the language model via the Hugging Face `transformers` library.
- Initializes the text generation function.

### Defining the `evaluate_domain` Function:
- The function takes a domain and a text generator as input.
- It generates a customized prompt for the given domain.
- Processes the model's output to classify the domain as:
  - `1`: If the domain is generated by a DGA.
  - `0`: If the domain is legitimate.

### Testing and Usage:
- Practical examples showing how to evaluate domains like `google.com` or `gfdjgerenyqsgert.com`.

## Example Usage

```python
# Evaluate domains
domain1 = "google.com"
domain2 = "gfdjgerenyqsgert.com"

label1 = evaluate_domain(generator, domain1)
label2 = evaluate_domain(generator, domain2)

print(f"Domain: {domain1}, Label: {label1}")
print(f"Domain: {domain2}, Label: {label2}")
```

##Applications
This system can be used in cybersecurity tasks to:

- Identify malicious domains generated by malware.

- Enhance DNS-based threat detection systems.

- Automate real-time domain list analysis.


##Hardware Requirements
This code was run on a Google Colab GPU T4 with 16GB of memory, but the minimum requirement for running it is a GPU with 8GB of memory.

In [1]:
%%capture
!pip install torch datasets transformers trl huggingface_hub pandas peft bitsandbytes

In [2]:
import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel  # Importing the custom PeftModel

from transformers import BitsAndBytesConfig  # Importing configuration for Bits and Bytes quantization

from google.colab import drive  # Importing the library to mount Google Drive
drive.mount('/content/drive')  # Mounting Google Drive in Colab environment

from huggingface_hub import login  # Importing login function from Hugging Face Hub

login(token="hf_RQLxVGKFBgcLsMSzNTY......")  # Logging into Hugging Face Hub with a specific token

# Path to the adapter files on Google Drive
adapter_dir = '/content/drive/My Drive/adapter'  # Defining the path where adapter files are located

# Configuration for 4-bit quantization
compute_dtype = getattr(torch, "float16")  # Calculation data type used for quantization (float16 in this case)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load in 4-bit quantization
    bnb_4bit_quant_type="nf4",  # Type of 4-bit quantization used
    bnb_4bit_compute_dtype=compute_dtype,  # Calculation data type for 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization for 4 bits
)

# Load pretrained model with quantization
model_name = "meta-llama/Meta-Llama-3-8B"  # Name of the pretrained model to use
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,  # Quantization configuration applied to the loaded model
    device_map={"": 0}  # Device mapping, in this case, the model is loaded on device 0
)

# Load the adapter
model = PeftModel.from_pretrained(model, adapter_dir)  # Loading a custom adapter onto the model

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)  # Loading the tokenizer associated with the model

# Create a text generation pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)  # Creating a pipeline to generate text using the model and tokenizer



Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianForCausalLM', 'MBartForCausa

In [3]:


def evaluate_domain(generator, domain, max_length=50):
    """
    Evaluates whether a domain is classified as 'dga' or 'normal' by the model.

    Args:
        generator: The text generation pipeline (e.g., Hugging Face pipeline).
        domain: The domain name to evaluate (string).
        max_length: Maximum length for the generated text (integer).

    Returns:
        label_value: 1 if classified as 'dga', 0 otherwise.
    """
    # Define the prompt
    prompt = f"#domain: {domain} \n#label:"

    # Generate text using the model
    generated_text = generator(prompt, max_length=max_length)

    # Extract the generated text
    text = generated_text[0]['generated_text']

    # Extract the value after #label:
    label_text = text.split("#label:")[1].strip()

    # Determine the label value based on the label text
    label_value = 1 if label_text.startswith("dga") else 0

    return label_value

# Ejemplo de uso:

# Evaluar dominios
domain1 = "google.com"
domain2 = "gfdjgerenyqsgert.com"

label1 = evaluate_domain(generator, domain1)
label2 = evaluate_domain(generator, domain2)

print(f"Domain: {domain1}, Label: {label1}")
print(f"Domain: {domain2}, Label: {label2}")



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Domain: google.com, Label: 0
Domain: gfdjgerenyqsgert.com, Label: 1
