Contents
Que 1. Define data pre-processing and need of data pre-processing.
Data pre-processing
Data preprocessing is an important step in the data mining process. It refers to the cleaning, integrating, reducing, transforming, and discretization of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common techniques used in data transformation include normalization, standardization, and discretization. Normalization is used to scale the data to a common range, while standardization is used to transform the data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information. Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature selection involves selecting a subset of relevant features from the dataset, while feature extraction involves transforming the data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals. Discretization is often used in data mining and machine learning algorithms that require categorical data. Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and clustering.
Need of data pre-processing
- Improve data quality by handling missing values, removing noise/outliers, and resolving inconsistencies.
- Integrate data from multiple sources into a consistent format.
- Transform data to a suitable format for analysis (e.g., numerical representation, normalization).
- Reduce dimensionality to enhance efficiency, prevent overfitting, and improve interpretability.
- Handle missing values through imputation or deletion.
- Detect and handle outliers to ensure accurate analysis.
- Discretize continuous variables for simplification and noise reduction.
- Reduce noise through smoothing or filtering techniques.
- Scale data for consistency across variables.
- Enhance model performance and decision-making.
Que 2. Explain the major tasks in data pre-processing with examples.
1) Data Cleaning:
- Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
- When combining multiple data sources, there are many opportunities for data to be duplicated or mislabelled.
- If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct.
Methods for Data Cleaning:
a) Missing Values:
- Imagine that you need to analyze All Electronics sales and customer data.
- You note that many tuples have no recorded value for several attributes such as customer income.
- How can you go about filling in the missing values for this attribute? Let’s look at the following methods.
i. Ignore the tuple:
- This is usually done when the class label is missing.
- This method is not very effective, unless the tuple contains several attributes with missing values.
ii. Fill in the missing value manually:
- In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.
iii. Use a global constant to fill in the missing value:
- Replace all missing attribute values by the same constant, such as a label like “Unknown” or-.
- If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common-that of “Unknown.”
iv. Use the attribute mean to fill in the missing value:
- For example, suppose that the average income of All Electronics customer is $28,000. Use this value to replace the missing value for income.
v. Use the attribute mean for all samples belonging to the same class as the given tuple:
- For example, if customers are classified according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
vi. Use the most probable value to fill in the missing value:
- This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction.
- For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.
b) Noisy Data:
- “What is noise?” Noise is random error or variance in measured variable.
i. Binning:
- Binning methods smooth a sorted data value by consulting its “neighborhood”, that is, the values around it.
- The sorted values are distributed into a number of “buckets”, or bins.
- Because binning methods consult the neighborhood of values, they perform local smoothing.
- In this example, the data for price are first sorted and then partitioned into equidepth bins of depth 3 (i.e., each bin contains three values).
- In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
- For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
- Therefore, each original value in this bin is replaced by the value 9.
- Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
- In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries.
- Each bin value is then replaced by the closest boundary value.
- In general, the larger the width, the greater the effect of the smoothing.
- Alternatively, bins may be equiwidth, where the interval range of values in each bin is constant.
ii. Regression:
- Data can be smoothed by fitting the data to a function, such as with regression.
- Linear regression involves finding the “best” line to fit two variables, so that one variable can be used to predict the other.
- Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to multidimensional surface.
- Using regression to find a mathematical equation to fit the data helps smooth out the noise.
iii. Clustering:
- Outliers may be detected by clustering, where similar values are organized into groups, or “clusters”. Intuitively, values that fall outside of the set of clusters may be considered outliers.
2) Data Integration:
- Data Integration is a data pre-processing technique that combines data from multiple heterogeneous data sources into a data store and provides a unified view of the data.
- These sources may include multiple data cubes, databases, or flat files.
- Data integration is important because it gives a uniform view of scattered data while also maintaining data accuracy.
- The data integration methods are formally characterized as a triple (G, S, M), where;
- G represents the global schema,
- S represents the heterogeneous source of schema,
- M represents the mapping between source and global schema queries.
3) Data Reduction:
- Data reduction techniques ensure the integrity of data while reducing the data.
- Data reduction is a process that reduces the volume of original data and represents it in a much smaller volume.
- Data reduction techniques are used to obtain a reduced representation of the dataset that is much smaller in volume by maintaining the integrity of the original data.
- By reducing the data, the efficiency of the data mining process is improved, which produces the same analytical results.
Techniques of Data Reduction
Here are the following techniques or methods of data reduction in data mining, such as:
i. Data Cube Aggregation:
- This technique is used to aggregate data in a simpler form.
- Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction.
ii. Attribute Subset Selection:
- Attribute subset Selection is a technique which is used for data reduction in data mining process.
- Data reduction reduces the size of data so that it can be used for analysis purposes more efficiently.
iii. Dimensionality Reduction:
- Whenever we target weakly important data, we use the attribute required for our analysis.
- Dimensionality reduction eliminates the attributes from the data set under consideration, thereby reducing the volume of original data.
- It reduces data size as it eliminates outdated or redundant features.
- Here are three methods of dimensionality reduction.
- o Wavelet Transform
- o Principal Component Analysis
- o Attribute Subset Selection
iv. Numerosity Reduction:
- In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instead of actual data.
- The numerosity reduction reduces the original data volume and represents it in a much smaller form.
- This technique includes two types parametric and non-parametric numerosity reduction.
v. Discretization & Concept Hierarchy Operation:
- Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals.
- We replace many constant values of the attributes by labels of small intervals.
- This means that mining results are shown in a concise, and easily understandable way.
4) Data Transformation:
In data transformation, the data are transformed into forms required for mining.
Data transformation can involve the following:
Smoothing: Which works to remove noise from the data.
Aggregation: Where summery or aggregation operations are applied to the data.
Generalization: Where low level data replaced by high level data.
Normalization: Where the attribute data are consider within a small specified range.
Attribute Construction: Where new attribute are constructed.
5) Data Discretization:
- Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy.
- In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss.
- There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization.
- Supervised discretization refers to a method in which the class data is used.
- Unsupervised discretization refers to a method depending upon the way which operation proceeds.
- It means it works on the top-down splitting strategy and bottom-up merging strategy.
- Now, we can understand this concept with the help of an example
- Suppose we have an attribute of Age with the given values.
Que 3. Explain the process of data cleaning with the different approaches.
1) Data Cleaning:
- Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
- When combining multiple data sources, there are many opportunities for data to be duplicated or mislabelled.
- If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct.
Methods for Data Cleaning:
a) Missing Values:
- Imagine that you need to analyze All Electronics sales and customer data.
- You note that many tuples have no recorded value for several attributes such as customer income.
- How can you go about filling in the missing values for this attribute? Let’s look at the following methods.
i. Ignore the tuple:
- This is usually done when the class label is missing.
- This method is not very effective, unless the tuple contains several attributes with missing values.
ii. Fill in the missing value manually:
- In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.
iii. Use a global constant to fill in the missing value:
- Replace all missing attribute values by the same constant, such as a label like “Unknown” or-.
- If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common-that of “Unknown.”
iv. Use the attribute mean to fill in the missing value:
- For example, suppose that the average income of All Electronics customer is $28,000. Use this value to replace the missing value for income.
v. Use the attribute mean for all samples belonging to the same class as the given tuple:
- For example, if customers are classified according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
vi. Use the most probable value to fill in the missing value:
- This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction.
- For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.
b) Noisy Data:
- “What is noise?” Noise is random error or variance in measured variable.
i. Binning:
- Binning methods smooth a sorted data value by consulting its “neighborhood”, that is, the values around it.
- The sorted values are distributed into a number of “buckets”, or bins.
- Because binning methods consult the neighborhood of values, they perform local smoothing.
- In this example, the data for price are first sorted and then partitioned into equidepth bins of depth 3 (i.e., each bin contains three values).
- In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
- For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
- Therefore, each original value in this bin is replaced by the value 9.
- Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
- In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries.
- Each bin value is then replaced by the closest boundary value.
- In general, the larger the width, the greater the effect of the smoothing.
- Alternatively, bins may be equiwidth, where the interval range of values in each bin is constant.
ii. Regression:
- Data can be smoothed by fitting the data to a function, such as with regression.
- Linear regression involves finding the “best” line to fit two variables, so that one variable can be used to predict the other.
- Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are fit to multidimensional surface.
- Using regression to find a mathematical equation to fit the data helps smooth out the noise.
iii. Clustering:
- Outliers may be detected by clustering, where similar values are organized into groups, or “clusters”. Intuitively, values that fall outside of the set of clusters may be considered outliers.
Que 4. Explain the correlation analysis with suitable example.
Correlation analysis
Correlation analysis is a statistical method used to measure the strength of the linear relationship between two variables and compute their association. Correlation analysis calculates the level of change in one variable due to the change in the other. A high correlation points to a strong relationship between the two variables, while a low correlation means that the variables are weakly related.
Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For nominal data, we use the χ
2 (chi-square) test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary from those of another.
χ2 Correlation Test for Nominal Data
For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a χ2 (chi-square) test. The χ2 value (also known as the Pearson χ2 statistic) is computed as
where oij is the observed frequency (i.e., actual count) of the joint event (Ai ,Bj) and eij is
the expected frequency of (Ai, Bj), which can be computed as
Correlation Coefficient for Numeric Data
For numeric attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient (also known as Pearson’s product moment coefficient, named after its inventer, Karl Pearson). This is
Que 5. Describe the methods of dimensionality reduction.
Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data.
Whenever we target weakly important data, we use the attribute required for our analysis. Dimensionality reduction eliminates the attributes from the data set under consideration, thereby reducing the volume of original data. It reduces data size as it eliminates outdated or redundant features. Here are three methods of dimensionality reduction.
- Wavelet Transform
- Principal Component Analysis
- Attribute Subset Selection
A. Wavelet Transform
- Decomposes a signal into different frequency subbands
- Applicable to n-dimensional signals
- Data are transformed to preserve relative distance between objects at different levels of resolution
- Allow natural clusters to become more distinguishable
- Used for image compression
- Discrete wavelet transform (DWT) for linear signal processing, multi-resolution analysis
- Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
- Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
Method:
- Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
- Each transform has 2 functions: smoothing, difference
- Applies to pairs of data, resulting in two set of data of length L/2
- Applies two functions recursively, until reaches the desired length
Why Wavelet Transform?
1. Use hat-shape filters
- Emphasize region where points cluster
- Suppress weaker information in their boundaries
2. Effective removal of outliers
- Insensitive to noise, insensitive to input order
3. Multi-resolution
- Detect arbitrary shaped clusters at different scales
4. Efficient
- Complexity O(N)
5. Only applicable to low dimensional data
B. Principal Component Analysis (PCA)
- Find a projection that captures the largest amount of variation in data
- The original data are projected onto a much smaller space, resulting in dimensionality reduction.
- We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space.
- Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data.
The basic procedure is as follows:
- Normalize input data: Each attribute falls within the same range
- Compute k orthonormal (unit) vectors, i.e., principal components
- Each input data (vector) is a linear combination of the k principal component vectors
- The principal components are sorted in order of decreasing “significance” or strength
- Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data)
Works for numeric data only
C. Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
- Duplicate much or all of the information contained in one or more other attributes.
- E.g., purchase price of a product and the amount of sales tax paid.
Irrelevant attributes
- Contain no information that is useful for the data mining task at hand.
- E.g., students’ ID is often irrelevant to the task of predicting students’ GPA
Heuristic Search in Attribute Selection
There are 2n possible attribute combinations of n attributes. Typical heuristic attribute selection methods:
1. Best single attribute under the attribute independence assumption: choose by significance tests
2. Best step-wise feature selection:
- The best single-attribute is picked first
- Then next best attribute condition to the first, …
3. Step-wise attribute elimination:
- Repeatedly eliminate the worst attribute
4. Best combined attribute selection and elimination
5. Optimal branch and bound:
- Use attribute elimination and backtracking
Attribute Creation (Feature Generation)
Create new attributes (features) that can capture the important information in a data set more effectively than
the original ones.
Three general methodologies
1. Attribute extraction
- Domain-specific
2. Mapping data to new space (see: data reduction)
- E.g., Fourier transformation, wavelet transformation
3. Attribute construction
- Combining features
- Data discretization
Que 6. List out and explain the methods of numerosity reduction.
Numerosity reduction is a technique used in data mining to reduce the number of data points in a dataset while still preserving the most important information. This can be beneficial in situations where the dataset is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant or redundant data points.
1. Parametric methods (e.g., regression)
- Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
- Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces
2. Non-parametric methods
- Do not assume models
- Major families: histograms, clustering, sampling, etc.
1. Parametric methods
A. Linear regression
- When there is only a single independent attribute, such a regression model is called simple linear regression.
- Data modeled to fit a straight line
- Often uses the least-square method to fit the line
B. Multiple regression
- If there are multiple independent attributes, then such regression models are called multiple linear regression.
- Allows a response variable Y to be modeled as a linear function of multidimensional feature vector
C. Log-linear model
- Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes based on a smaller subset of dimensional combinations. This allows a higher-dimensional data space to be constructed from lower-dimensional attributes. The Log-Linear model discovers the relationship between two or more discrete attributes. Assume we have a set of tuples in n-dimensional space; the log-linear model helps derive each tuple’s probability in this n-dimensional space.
2. Non-parametric methods
- Histograms: A histogram is the data representation in terms of frequency. It uses binning to approximate data distribution and is a popular form of data reduction. Suppose a histogram for an attribute A and divisions the data distribution of A into disjoint subsets or buckets. If each bucket defines only an individual attribute-value or frequency pair, the buckets are known as singleton buckets.
- Clustering: Clustering techniques consider data tuples as objects. They partition the objects into groups or clusters so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters. It is commonly defined in terms of how “close” the objects are in space, based on a distance function.
The quality of a cluster can be defined by its diameter, the maximum distance between any two objects in the cluster. Centroid distance is an alternative measure of cluster quality. It is represented as the average distance of each cluster object from the cluster centroid denoting the “average object,” or average point in the area for the cluster. - Sampling: Sampling can be used as a data reduction approach because it enables a huge data set to be defined by a much smaller random sample or a subset of the information. Sampling can reduce large data sets into smaller sample data sets to represent the original data set. There are four types of sampling data reduction methods.
- Simple Random Sample Without Replacement of sizes
- Simple Random Sample with Replacement of sizes
- Cluster Sample
- Stratified Sample
- Data Cube Aggregation: Data cube aggregation involves moving the data from a detailed level to fewer dimensions. The resulting data set is smaller in volume, without loss of information necessary for the analysis task.
Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus achieving data reduction. Data Cube Aggregation, where the data cube is a much more efficient way of storing data, thus achieving data reduction, besides faster aggregation operations. - Data Compression: It employs modification, encoding, or converting the structure of data in a way that consumes less space. Data compression involves building a compact representation of information by removing redundancy and representing data in binary form. Data that can be restored successfully from its compressed form is called Lossless compression. In contrast, the opposite, where it is not possible to restore the original form from the compressed form, is Lossy compression.
Also refer : Unit II: Getting to Know Your Data