Spaces:

MCP-1st-Birthday
/

ML-Starter

Running

App Files Files Community

ML-Starter / knowledge_base /timeseries /timeseries_anomaly_detection.py

emreatilgan

feat: Initialize mcp_server with embedding and loader modules

9ce984a 19 days ago

raw

history blame contribute delete

8.13 kB

	"""
	Title: Timeseries anomaly detection using an Autoencoder
	Author: [pavithrasv](https://github.com/pavithrasv)
	Date created: 2020/05/31
	Last modified: 2020/05/31
	Description: Detect anomalies in a timeseries using an Autoencoder.
	Accelerator: GPU
	"""

	"""
	## Introduction

	This script demonstrates how you can use a reconstruction convolutional
	autoencoder model to detect anomalies in timeseries data.
	"""

	"""
	## Setup
	"""

	import numpy as np
	import pandas as pd
	import keras
	from keras import layers
	from matplotlib import pyplot as plt

	"""
	## Load the data

	We will use the [Numenta Anomaly Benchmark(NAB)](
	https://www.kaggle.com/boltzmannbrain/nab) dataset. It provides artificial
	timeseries data containing labeled anomalous periods of behavior. Data are
	ordered, timestamped, single-valued metrics.

	We will use the `art_daily_small_noise.csv` file for training and the
	`art_daily_jumpsup.csv` file for testing. The simplicity of this dataset
	allows us to demonstrate anomaly detection effectively.
	"""

	master_url_root = "https://raw.githubusercontent.com/numenta/NAB/master/data/"

	df_small_noise_url_suffix = "artificialNoAnomaly/art_daily_small_noise.csv"
	df_small_noise_url = master_url_root + df_small_noise_url_suffix
	df_small_noise = pd.read_csv(
	df_small_noise_url, parse_dates=True, index_col="timestamp"
	)

	df_daily_jumpsup_url_suffix = "artificialWithAnomaly/art_daily_jumpsup.csv"
	df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix
	df_daily_jumpsup = pd.read_csv(
	df_daily_jumpsup_url, parse_dates=True, index_col="timestamp"
	)

	"""
	## Quick look at the data
	"""

	print(df_small_noise.head())

	print(df_daily_jumpsup.head())

	"""
	## Visualize the data
	### Timeseries data without anomalies

	We will use the following data for training.
	"""
	fig, ax = plt.subplots()
	df_small_noise.plot(legend=False, ax=ax)
	plt.show()

	"""
	### Timeseries data with anomalies

	We will use the following data for testing and see if the sudden jump up in the
	data is detected as an anomaly.
	"""
	fig, ax = plt.subplots()
	df_daily_jumpsup.plot(legend=False, ax=ax)
	plt.show()

	"""
	## Prepare training data

	Get data values from the training timeseries data file and normalize the
	`value` data. We have a `value` for every 5 mins for 14 days.

	- 24 * 60 / 5 = 288 timesteps per day
	- 288 * 14 = 4032 data points in total
	"""


	# Normalize and save the mean and std we get,
	# for normalizing test data.
	training_mean = df_small_noise.mean()
	training_std = df_small_noise.std()
	df_training_value = (df_small_noise - training_mean) / training_std
	print("Number of training samples:", len(df_training_value))

	"""
	### Create sequences
	Create sequences combining `TIME_STEPS` contiguous data values from the
	training data.
	"""

	TIME_STEPS = 288


	# Generated training sequences for use in the model.
	def create_sequences(values, time_steps=TIME_STEPS):
	output = []
	for i in range(len(values) - time_steps + 1):
	output.append(values[i : (i + time_steps)])
	return np.stack(output)


	x_train = create_sequences(df_training_value.values)
	print("Training input shape: ", x_train.shape)

	"""
	## Build a model

	We will build a convolutional reconstruction autoencoder model. The model will
	take input of shape `(batch_size, sequence_length, num_features)` and return
	output of the same shape. In this case, `sequence_length` is 288 and
	`num_features` is 1.
	"""

	model = keras.Sequential(
	[
	layers.Input(shape=(x_train.shape[1], x_train.shape[2])),
	layers.Conv1D(
	filters=32,
	kernel_size=7,
	padding="same",
	strides=2,
	activation="relu",
	),
	layers.Dropout(rate=0.2),
	layers.Conv1D(
	filters=16,
	kernel_size=7,
	padding="same",
	strides=2,
	activation="relu",
	),
	layers.Conv1DTranspose(
	filters=16,
	kernel_size=7,
	padding="same",
	strides=2,
	activation="relu",
	),
	layers.Dropout(rate=0.2),
	layers.Conv1DTranspose(
	filters=32,
	kernel_size=7,
	padding="same",
	strides=2,
	activation="relu",
	),
	layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),
	]
	)
	model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")
	model.summary()

	"""
	## Train the model

	Please note that we are using `x_train` as both the input and the target
	since this is a reconstruction model.
	"""

	history = model.fit(
	x_train,
	x_train,
	epochs=50,
	batch_size=128,
	validation_split=0.1,
	callbacks=[
	keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")
	],
	)

	"""
	Let's plot training and validation loss to see how the training went.
	"""

	plt.plot(history.history["loss"], label="Training Loss")
	plt.plot(history.history["val_loss"], label="Validation Loss")
	plt.legend()
	plt.show()

	"""
	## Detecting anomalies

	We will detect anomalies by determining how well our model can reconstruct
	the input data.


	1. Find MAE loss on training samples.
	2. Find max MAE loss value. This is the worst our model has performed trying
	to reconstruct a sample. We will make this the `threshold` for anomaly
	detection.
	3. If the reconstruction loss for a sample is greater than this `threshold`
	value then we can infer that the model is seeing a pattern that it isn't
	familiar with. We will label this sample as an `anomaly`.


	"""

	# Get train MAE loss.
	x_train_pred = model.predict(x_train)
	train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)

	plt.hist(train_mae_loss, bins=50)
	plt.xlabel("Train MAE loss")
	plt.ylabel("No of samples")
	plt.show()

	# Get reconstruction loss threshold.
	threshold = np.max(train_mae_loss)
	print("Reconstruction error threshold: ", threshold)

	"""
	### Compare recontruction

	Just for fun, let's see how our model has recontructed the first sample.
	This is the 288 timesteps from day 1 of our training dataset.
	"""

	# Checking how the first sequence is learnt
	plt.plot(x_train[0])
	plt.plot(x_train_pred[0])
	plt.show()

	"""
	### Prepare test data
	"""


	df_test_value = (df_daily_jumpsup - training_mean) / training_std
	fig, ax = plt.subplots()
	df_test_value.plot(legend=False, ax=ax)
	plt.show()

	# Create sequences from test values.
	x_test = create_sequences(df_test_value.values)
	print("Test input shape: ", x_test.shape)

	# Get test MAE loss.
	x_test_pred = model.predict(x_test)
	test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)
	test_mae_loss = test_mae_loss.reshape((-1))

	plt.hist(test_mae_loss, bins=50)
	plt.xlabel("test MAE loss")
	plt.ylabel("No of samples")
	plt.show()

	# Detect all the samples which are anomalies.
	anomalies = test_mae_loss > threshold
	print("Number of anomaly samples: ", np.sum(anomalies))
	print("Indices of anomaly samples: ", np.where(anomalies))

	"""
	## Plot anomalies

	We now know the samples of the data which are anomalies. With this, we will
	find the corresponding `timestamps` from the original test data. We will be
	using the following method to do that:

	Let's say time_steps = 3 and we have 10 training values. Our `x_train` will
	look like this:

	- 0, 1, 2
	- 1, 2, 3
	- 2, 3, 4
	- 3, 4, 5
	- 4, 5, 6
	- 5, 6, 7
	- 6, 7, 8
	- 7, 8, 9

	All except the initial and the final time_steps-1 data values, will appear in
	`time_steps` number of samples. So, if we know that the samples
	[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
	5 is an anomaly.
	"""

	# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies
	anomalous_data_indices = []
	for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):
	if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):
	anomalous_data_indices.append(data_idx)

	"""
	Let's overlay the anomalies on the original test data plot.
	"""

	df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]
	fig, ax = plt.subplots()
	df_daily_jumpsup.plot(legend=False, ax=ax)
	df_subset.plot(legend=False, ax=ax, color="r")
	plt.show()