Spaces:

erjonb
/

secom

Build error

App Files Files Community

secom / secom_names.csv

erjonb

Upload 4 files

53df21f over 2 years ago

raw

history blame contribute delete

4.22 kB

	Title: SECOM Data Set

	Abstract: Data from a semi-conductor manufacturing process


	-----------------------------------------------------

	Data Set Characteristics: Multivariate
	Number of Instances: 1567
	Area: Computer
	Attribute Characteristics: Real
	Number of Attributes: 591
	Date Donated: 2008-11-19
	Associated Tasks: Classification, Causal-Discovery
	Missing Values? Yes

	-----------------------------------------------------

	Source:

	Authors: Michael McCann, Adrian Johnston

	-----------------------------------------------------

	Data Set Information:

	A complex modern semi-conductor manufacturing process is normally under consistent
	surveillance via the monitoring of signals/variables collected from sensors and or
	process measurement points. However, not all of these signals are equally valuable
	in a specific monitoring system. The measured signals contain a combination of
	useful information, irrelevant information as well as noise. It is often the case
	that useful information is buried in the latter two. Engineers typically have a
	much larger number of signals than are actually required. If we consider each type
	of signal as a feature, then feature selection may be applied to identify the most
	relevant signals. The Process Engineers may then use these signals to determine key
	factors contributing to yield excursions downstream in the process. This will
	enable an increase in process throughput, decreased time to learning and reduce the
	per unit production costs.

	To enhance current business improvement techniques the application of feature
	selection as an intelligent systems technique is being investigated.

	The dataset presented in this case represents a selection of such features where
	each example represents a single production entity with associated measured
	features and the labels represent a simple pass/fail yield for in house line
	testing, figure 2, and associated date time stamp. Where .1 corresponds to a pass
	and 1 corresponds to a fail and the data time stamp is for that specific test
	point.


	Using feature selection techniques it is desired to rank features according to
	their impact on the overall yield for the product, causal relationships may also be
	considered with a view to identifying the key features.

	Results may be submitted in terms of feature relevance for predictability using
	error rates as our evaluation metrics. It is suggested that cross validation be
	applied to generate these results. Some baseline results are shown below for basic
	feature selection techniques using a simple kernel ridge classifier and 10 fold
	cross validation.

	Baseline Results: Pre-processing objects were applied to the dataset simply to
	standardize the data and remove the constant features and then a number of
	different feature selection objects selecting 40 highest ranked features were
	applied with a simple classifier to achieve some initial results. 10 fold cross
	validation was used and the balanced error rate (*BER) generated as our initial
	performance metric to help investigate this dataset.


	SECOM Dataset: 1567 examples 591 features, 104 fails

	FSmethod (40 features) BER % True + % True - %
	S2N (signal to noise) 34.5 +-2.6 57.8 +-5.3 73.1 +2.1
	Ttest 33.7 +-2.1 59.6 +-4.7 73.0 +-1.8
	Relief 40.1 +-2.8 48.3 +-5.9 71.6 +-3.2
	Pearson 34.1 +-2.0 57.4 +-4.3 74.4 +-4.9
	Ftest 33.5 +-2.2 59.1 +-4.8 73.8 +-1.8
	Gram Schmidt 35.6 +-2.4 51.2 +-11.8 77.5 +-2.3

	-----------------------------------------------------

	Attribute Information:

	Key facts: Data Structure: The data consists of 2 files the dataset file SECOM
	consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels
	file containing the classifications and date time stamp for each example.

	As with any real life data situations this data contains null values varying in
	intensity depending on the individuals features. This needs to be taken into
	consideration when investigating the data either through pre-processing or within
	the technique applied.

	The data is represented in a raw text file each line representing an individual
	example and the features seperated by spaces. The null values are represented by
	the 'NaN' value as per MatLab.