Getting to Know Your Data
Que 1. What is an attribute? Explain the different types of attributes.
Attribute :
An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature.
The term dimension is commonly used in data warehousing. Machine learning literature tends to use the term feature, while statisticians prefer the term variable. Data mining and database professionals commonly use the term attribute, and we do here as well. Attributes describing a customer object can include, for example, customer ID, name, and address.
There are four types of attribute
- Nominal Attributes
- Binary Attributes
- Ordinal Attributes
- Numeric Attributes
1. Nominal Attributes:
- Nominal means “relating to names.”
- The values of a nominal attribute are symbols or names of things.
- Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical.
- The values do not have any meaningful order.
- In computer science, the values are also known as enumerations. Example: Nominal attributes.
- Suppose that hair_color and marital_status are two attributes describing person objects.
- In our application, possible values for hair_color are black, brown, blond, red, auburn, gray and white.
- The attribute marital_status can take on the values single, married, divorced, and widowed.
- Both hair_color and marital_status are nominal attributes.
2. Binary Attributes:
- A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present.
- Binary attributes are referred to as Boolean if the two states correspond to true and false.
- A binary attribute is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1.
- One such example could be the attribute gender having the states male and female. Example: Binary attributes.
- Given the attribute smoker describing a patient object, 1 indicates that the patient smokes, while 0 indicates that the patient does not.
- Similarly, suppose the patient undergoes a medical test that has two possible outcomes.
- The attribute medical test is binary, where a value of 1 means the result of the test for the patient is positive, while 0 means the result is negative.
3. Ordinal Attributes:
- An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.
- The ordinal attribute value provides sufficient information to order the objects. Example:
- Suppose that drink_size corresponds to the size of drinks available at a fast-food restaurant.
- This nominal attribute has three possible values: small, medium, and large.
- The values have a meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the values how much bigger, say, a medium is than a large.
4. Numeric Attributes:
- A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values.
- Numeric attributes can be interval-scaled or ratio-scaled.
i) Interval-Scaled Attributes:
- Interval-scaled attributes are measured on a scale of equal-size units.
- The values of interval-scaled attributes have order and can be positive, 0, or negative.
- Thus, in addition to providing a ranking of values, such attributes allow us to compare and quantify the difference between values.
- Example: A temperature attribute is interval-scaled.
- Suppose that we have the outdoor temperature value for a number of different days, where each day is an object.
- By ordering the values, we obtain a ranking of the objects with respect to temperature.
- In addition, we can quantify the difference between values.
- For example, a temperature of 20 ◦ C is five degrees higher than a temperature of 15 ◦ C.
ii) Ratio-Scaled Attributes:
- A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
- That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value.
- In addition, the values are ordered, and we can also compute the difference between values, as well as the mean, median, and mode. Example:
- Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature scale has what is considered a true zero-point (0 ◦ K = −273.15 ◦C)
- It is the point at which the particles have zero kinetic energy.
5. Discrete versus Continuous Attributes:
- We have organized attributes into nominal, binary, ordinal, and numeric types.
- There are many ways to organize attribute types.
- The types are not very important.
- Classification algorithms developed from the field of machine learning often talk of attributes as being either discrete or continuous.
- Each type may be processed differently.
- A discrete attribute has a finite or countable infinite set of values, which may or may not be represented as integers.
- The attributes hair_color, smoker, medical test, and drink_size each have a finite number of values, and so are discrete.
Que 2. Explain scatterplot and data correlation with suitable examples.
Scatterplot :
- A scatterplot is a type of data display that shows the relationship between two numerical variables.
- Each member of the dataset gets plotted as a point whose (x, y) coordinates relates to it values for the two variables.
- For example, here is a scatterplot that shows the shoe sizes and quiz scores for students in a class:
- Each data point is a student whose xxx-coordinate gives their shoe size and yyy-coordinate gives their quiz score.
Data correlation :
- We often see patterns or relationships in scatterplots.
- When the y variable tends to increase as the x variable increases, we say there is a positive correlation between the variables.
- When the y variable tends to decrease as the x variable increases, we say there is a negative correlation between the variables.
- When there is no clear relationship between the two variables, we say there is no correlation between the two variables.
Que 3. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8).
Compute : a. Euclidean distance between two objects.
b. Manhattan distance between two objects.
Tuples (22, 1, 42, 10) and (20, 0, 36, 8)
a) Euclidian distance formula:
b) Manhattan Distance Formula:
Que 4. What is data visualization ? Explain various visualization techniques and explain any three.
Data Visualization :
- Data visualization is a graphical representation of quantitative information and data by using visual elements like graphs, charts, and maps.
- Data visualization convert large and small data sets into visuals, which is easy to understand and process for humans.
- Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data.
- In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts of information.
- Data visualizations are common in your everyday life, but they always appear in the form of graphs and charts. The combination of multiple visualizations and bits of information are still referred to as Infographics.
Visualization techniques :
a) Pixel oriented visualization techniques:
- A simple way to visualize the value of a dimension is to use a pixel where the color of the pixel reflects the dimension’s value.
- For a data set of m dimensions pixel oriented techniques create m windows on the screen, one for each dimension.
- The m dimension values of a record are mapped to m pixels at the corresponding position in the windows.
- The color of the pixel reflects other corresponding values.
- Inside a window, the data values are arranged in some global order shared by all windows.
Example:
- All Electronics maintains a customer information table, which consists of 4 dimensions: income, credit_limit, transaction_volume and age.
- We analyze the correlation between income and other attributes by visualization.
b) Geometric Projection visualization techniques:
- A drawback of pixel-oriented visualization techniques is that they cannot help us much in understanding the distribution of data in a multidimensional space.
- Geometric projection techniques help users find interesting projections of multidimensional data sets.
- A scatter plot displays 2-D data point using Cartesian co-ordinates. A third dimension can be added using different colors of shapes to represent different data points.
Example:
- Where x and y are two spatial attributes and the third dimension is represented by different shapes
- Through this visualization, we can see that points of types “+” &”X” tend to be collocated.
c) Icon based visualization techniques:-
- It uses small icons to represent multidimensional data values
- 2 popular icon based techniques:-
i) Chern off faces:
- Chern off faces were introduced in 1973 by statistician Herman Chernoff.
- They display multidimensional data of up to 18 variables as a cartoon human face.
ii) Stick figures:
- It maps multidimensional data to five –piece stick figure, where each figure has 4 limbs and a body.
- 2 dimensions are mapped to the display axes and the remaining dimensions are mapped to the angle and/ or length of the limbs.
d) Hierarchical Visualization:
- For a large data set of high dimensionality, it would be difficult to visualize all dimensions at the same time.
- Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces).
- The subspaces are visualized in a hierarchical manner “Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical visualization method.
- To visualize a 6-D data set, where the dimensions are F, X1, X2, X3, X4, X5.
- We want to observe how F changes w.r.t. other dimensions.
- We can fix X3, X4, X5 dimensions to selected values and visualize changes to F w.r.t. X1, X2
e) Visualizing Complex Data and Relations:
- Most visualization techniques were mainly for numeric data.
- Recently, more and more non-numeric data, such as text and social networks, have become available.
- Many people on the Web tag various objects such as pictures, blog entries, and product reviews.
- A tag cloud is a visualization of statistics of user-generated tags.
- Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
- The importance of a tag is indicated by font size or color.
Que 5. Consider that the following data for analysis includes the attribute age where,
Age values are: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
Calculate the mean, median, midrange and mode of the given data.
Mean=29.96~=30
Median: Since the data is already arranged in ascending order and there are 27 elements. 14th element is the median of the data i.e.
Median=25
Mode: Values occuring most frequently are: 25 and 35
Therefore the data is bimodal.
Mode=25 and 35
Midrange: The end values are 13 and 70
Therefore Midrange=(13+70)/2
Midrange=41.5
Que 6. Analyze the following Data set grouped into intervals and Evaluate Mean Value for the given observation.
Also refer Chapter 1 : Introduction : Data Mining