Data Mining
Que 1. What is data mining? Why data mining is required?
Data Mining :
- Data Mining, it is the process of discovering or mining knowledge from a large amount of data. Data mining is also called Knowledge Discovery in Database (KDD).
- It attempts to extract hidden patterns and trends from large databases. Also supports automatic exploration of data.
- It is called as exploratory data analysis, data driven and deductive learning. It is also useful for finding hidden information in the database.
- Example : Searching, Personal data on social media of any person which is mentioned in their account.
- Technically data mining is the process of finding correlation or patterns in the large relational database.
- Social media companies use data mining techniques to commodify (things) their users in order to generate profit.
- Machine learning may be used in data mining to automate many of the operations.
- A large quantity of data can be categorized and collected into numerous categories and classifications with ease using machine learning and artificial intelligence.
- After the data has been gathered and a trend has been detected, it may finally be used.
Why data mining is required ?
I. Data Mining Helps Understand Customers Behavior:
- One of the most important benefits of data mining is that it can help organizations understand the past behavior of both their customers and prospects.
- This understanding can provide a wealth of knowledge about what products or services to offer, how to position them, what prices to charge, and more.
II. Data Mining Identifies New Opportunities:
- Data mining can also help identify new opportunities.
- This could include identifying potential new markets to enter, understanding customer needs that may not yet be being met, and more.
III. Data Mining Helps Understand What Customers Want:
- Another important benefit of data mining is that it can help organizations understand what their customers want.
- This is very powerful information for businesses to possess because it allows them to create products and services that customers not only need but want as well.
IV. Data Mining Creates Differentiating Products And Services:
- One of the best benefits of data mining is that it can help create products and services that are differentiated from those offered by competitors.
- This gives companies a huge advantage, as they can provide new or better product offerings than their competitors do.
V. Data Mining Reveals Market Trends:
- Other important benefits of data mining include the ability to reveal market trends, for example, what types of new products or services customers might soon be looking for, and
- uncover future demand, for example, by identifying geographic areas where your business should expand.
Que 2. Explain the steps of Knowledge Discovery from Data (KDD) process with neat diagram.
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate knowledge from the data.
I. Data Cleaning:
- Data Cleaning is a process of removing noise and inconsistent data.
- Cleaning data in case of Missing values.
- Generally, data cleaning reduces errors and improves data quality.
- Data mining is a key technique for data cleaning.
II. Data Integration:
- It is a process where multiple data-sources may be combined.
- Data integration is defined as heterogeneous data from multiple sources combined in a common source (Data Warehouse).
III. Data Selection:
- Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection.
- The primary objective of data selection is determining appropriate data type, source, and instrument that allow investigators to answer research questions properly.
IV. Data Transformation:
- Data transformation is a technique used to convert the raw data into a suitable format.
- Data transformation includes data cleaning techniques and a data reduction technique to convert the data into the appropriate form.
- Data transformation is an essential data preprocessing technique that must be performed on the data before data mining to provide patterns that are easier to understand.
V. Data Mining:
- It is an essential process where intelligent methods are applied in order to extract data patterns.
- Data Mining is a process used by organizations to extract specific data from huge databases to solve business problems.
VI. Pattern Evaluation:
- Pattern evaluation is the process of assessing the quality of discovered patterns.
- This process is important in order to determine whether the patterns are useful and whether they can be trusted.
- There are a number of different measures that can be used to evaluate patterns, and the choice of measure will depend on the application.
VII. Knowledge Representation:
- Knowledge representation is the presentation of knowledge to the user for visualization in terms of trees, tables, rules graphs, charts, matrices, etc.
- Example Histogram, Pie chart etc.
- This process also used to generate reports, tables etc.
Que 3. Explain the different technologies used for data mining.
Data mining has incorporated many techniques from other domain fields like machine learning, statistics, information retrieval, data warehouse, pattern recognition, algorithms, and high-performance computing. Since it is a highly application-driven domain, the interdisciplinary nature is typically very significant. Research and development in data mining and its applications prove quite useful in implementing it. We will see major technologies utilized in data mining.
I. Statistics:
- Statistics studies the collection, analysis, interpretation or explanation, and presentation of data.
- Data mining has an inherent connection with statistics.
- Statistics is a component of data mining that provides the tools and analytics techniques for dealing with large amounts of data.
- Statistics is useful for mining various patterns from data as well as for understanding the mechanisms which are generating and affecting the patterns.
II. Machine Learning:
- Machine learning investigates how computers can learn or improve their performance based on data.
- Data mining uses techniques developed by machine learning for predicting the outcome.
- For example, a typical machine learning problem is to program a computer so that it can automatically recognize handwritten code on mail after learning from a set of examples.
- Supervised learning is basically a synonym for classification.
- Unsupervised learning is essentially a synonym for clustering.
- Semi-supervised learning is a class of machine learning techniques that make use of both labelled and unlabeled examples when learning a model.
- Active learning is a machine learning approach that lets users play an active role in the learning process.
III. Database Systems and Data Warehouse:
- Database systems research focuses on the creation, maintenance, and use of databases for organizations and end-users.
- Database System is used in traditional way of storing and retrieving data.
- The major task of database system is to perform query processing.
- Data Warehouse is the place where huge amount of data is stored.
IV. Information Retrieval:
- Information retrieval is the process of searching for documents or information in the documents.
- Documents can be text, multimedia and many other formats stores on a web.
- In information retrieval the data under search are unstructured.
- The queries are formed mainly by keywords, which do not have complex structure (unlike SQL queries in database systems).
V. Visualization:
- Visualization of data mining results is the presentation of the results or knowledge obtained from data mining in visual forms.
- Data visualization is the graphical representation of information and data in a pictorial or graphical format (Example: charts, graphs, and maps).
- Data visualization tools provide an accessible way to see and understand trends, patterns in data, and outliers.
- Data visualization tools and technologies are essential to analyzing massive amounts of information and making data-driven decisions.
VI. Pattern Recognition:
- Pattern is everything around in this digital world.
- A pattern can either be seen physically or it can be observed mathematically by applying algorithms.
- Pattern recognition is the process of recognizing patterns by using a machine learning algorithm.
- Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation.
Que 4. What are the major issues in data mining? Explain.
Data mining, the process of extracting knowledge from data, has become increasingly important as the amount of data generated by individuals, organizations, and machines has grown exponentially. However, data mining is not without its challenges.
In this answer, we will explore some of the main challenges of data mining.
I. Mining Methodology and User Interaction:
- These challenges are related to data mining approaches and their limitations.
- The professionals with hands-on experience in this domain are struggle with these issues while using the data mining methods.
II. Mining different kinds of knowledge in databases:
- Different users may be interested in different kinds of knowledge.
- Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
- These tasks may use the same database in different ways and require the development of frequent data mining techniques.
III. Interactive mining of knowledge at multiple levels of abstraction :
The data mining process needs to be interactive because it allows users to focus the search
for patterns, providing and refining data mining requests based on the returned results.
IV. Incorporation of background knowledge :
- To guide discovery process and to express the discovered patterns, the background knowledge can be used.
- Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
V. Handling noisy or incomplete data:
- The data cleaning methods are required to handle the noise and incomplete objects while.
- If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor mining the data regularities
VI. Pattern evaluation:
- The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
VII. Performance Issues:
- The performance of the data mining system depends on the efficiency of algorithms and techniques are using.
- The algorithms and techniques designed are not up to the mark lead to affect the performance of the data mining process.
VIII. Efficiency and Scalability of the Algorithms:
- The data mining algorithm must be efficient and scalable to extract information from huge amounts of data in the database.
IX. Improvement of Mining Algorithms:
- Factors such as the enormous size of the database, the entire data flow and the difficulty of data mining approaches inspire the creation of parallel & distributed data mining algorithms.
X. Diverse Data Types Issues:
- Handling of relational and complex types of data:
- The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc.
- It is not possible for one system to mine all these kind of data.
XI. Mining information from heterogeneous databases and global information systems:
- The data is available at different data sources on LAN or WAN.
- These data source may be structured, semi structured or unstructured.
- Therefore mining the knowledge from them adds challenges to data mining.
Que 5. What kind of data can be mined in data mining?
As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target application.
The most basic forms of data for mining applications are
- Database data
- Data warehouse data
- Transactional data.
Data mining can also be applied to other forms of data (e.g., data streams, ordered/sequence data, graph or networked data, spatial data, text data, multimedia data, and the WWW).
I. Database data:
- A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.
- The software programs provide mechanisms for defining database structures and data storage.
- A relational database is a collection of tables, each of which is assigned a unique name.
- Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).
- Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.
II. Data warehouse data:
- A data warehouse is a centralized repository that stores structured data (database tables, Excel sheets) and semi-structured data (XML files, webpages) for the purposes of reporting and analysis.
- For decision making, the data in a data warehouse are organized around major subjects (e.g., customer, item, supplier, and activity).
- The data are stored to provide information from a historical perspective, such as in the past 6 to 12 months, and are typically summarized.
- For example, rather than storing the details of each sales transaction, the data warehouse may store a summary of the transactions.
- A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension represents an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales amount).
- A data cube provides a multidimensional view of data and allows the pre-computation and fast access of summarized data.
III. Transactional Data:
- In general, each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web page.
- A transaction typically includes a unique transaction identity number (trans_ID) and a list of the items making up the transaction, such as the items purchased in the transaction.
- A transactional database may have additional tables, which contain other information related to the transactions, such as item description, information about the salesperson or the branch, and so on.
- A transactional database may have additional tables, which contain other information related to the transactions, such as item description, information about the salesperson or the branch, and so on.
IV. Other kinds of Data:
- Besides relational database data, data warehouse data, and transaction data, there are many other kinds of data that have versatile forms and structures and rather different semantic meanings.
- Such kinds of data can be seen in many applications: time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological sequence data), data streams (e.g., video surveillance and sensor data, which are continuously transmitted), spatial data (e.g., maps), engineering design data (e.g., the design of buildings, system components, or integrated circuits), hypertext and multimedia data (including text, image, video, and audio data), graph and networked data (e.g., social and information networks), and the Web (a huge, widely distributed information repository made available by the Internet).
- These applications bring about new challenges, like how to handle data carrying special structures (e.g., sequences, trees, graphs, and networks) and specific semantics (such as ordering, image, audio and video contents, and connectivity), and how to mine patterns that carry rich structures and semantics.
Que 6. What Kinds of Patterns can be mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.
- Data mining tasks can be classified into two categories: descriptive and predictive.
- Descriptive mining tasks characterize properties of the data in a target data set.
- Predictive mining tasks perform induction on the current data in order to make predictions.
I. Concept/Class Description: Characterization and Discrimination:
- Data can be associated with classes or concepts.
- For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders.
- Data characterization is a summarization of the general characteristics or features of a target class of data.
- Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of comparative classes.
II. Mining Frequent Patterns, Associations, and Correlations:
- Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including itemsets, subsequences, and substructures.
- A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as Computer and Software.
- For example, milk and bread, which are frequently bought together in grocery stores by many customers.
III. Classification and Regression for Predictive Analysis.
- Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts.
- The model are derived based on the analysis of a set of training data (i.e., data objects whose class label is known).
- The derived model may be represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.
- A decision tree is a flowchart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
- A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units.
IV. Cluster Analysis:
- Unlike classification and regression, which analyze class-labeled (training) data sets, clustering analyzes data objects without consulting class labels.
- In many cases, class-labeled data may simply not exist at the beginning. Clustering can be used to generate class labels for a group of data.
- The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
V. Outlier Analysis:
- A data set may contain objects that do not comply with the general behavior or model of the data.
- These data objects are outliers.
- Many data mining methods discard outliers as noise or exceptions.