Unit V: Data Cube Technology
Que 1. Explain Data Cube Computation: Preliminary Concepts
Explain Cube Materialization: 1. Full Cube 2. Iceberg cube 3. Closed Cube 4. Cube Shell.
Data Cube :
- When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has a few alternative names or a few variants, such as “Multidimensional databases,” “materialized views,” and “OLAP (On-Line Analytical Processing).”
- The general idea of this approach is to materialize certain expensive computations that are frequently inquired.
Data Cube Computation :
- Data cube computation is an essential task in data warehouse implementation.
- The pre-computation of all or part of a data cube can greatly reduce the response time and enhance the performance of online analytical processing.
Why data cube computation is needed?
- To retrieve the information from the data cube in the most efficient way possible.
- Queries run on the cube will be fast.
Cube Materialization (pre-computation):
Different Data Cube materialization include.
- Full cube
- Iceberg cube
- Closed cube
- Shell cube
1. The Full cube:
- The multi way array aggregation method computes full data cube by using a multidimensional array as its basic data structure
- Partition array into the chunks
- Compute aggregate by visiting (i.e. accessing the values at) cube cells.
Advantage:
- The queries run on the cube will be very fast.
Disadvantage:
- Pre-computed cube requires a lot of memory.
2. An Iceberg-Cube:
- Contains only those cells of the data cube that meet an aggregate condition.
- It is called an Iceberg-Cube because it contains only some of the cells of the full cube, like the tip of an iceberg.
- The purpose of the Iceberg-Cube is to identify and compute only those values that will most likely be required for decision support queries.
- The aggregate condition specifies which cube values are more meaningful and should therefore be stored.
- This is one solution to the problem of computing versus storing data cubes.
Advantage:
- Pre-compute only those cells in the cube which will most likely be used for decision
- support queries.
3. A Closed Cube:
- A closed cube is a data cube consisting of only closed cells.
4. Shell Cube:
- We can choose to pre-compute only portions or fragments of the cube shell, based on cuboids of interest.
Que 2. What are the general optimization techniques for efficient computation of data cubes.
Data Cube Computation:
- Data cube computation is an essential task in data warehouse implementation.
- The pre-computation of all or part of a data cube can greatly reduce the response time and enhance the performance of online analytical processing.
- However it may require large computational time and storage space.
1. Multiway Array Aggregation for full Cube Computation:
- The multiway array aggregation or simply Multiway method computes a full data cube by using a multidimensional array as its basic data structure.
- It is a typical MOLAP approach that uses direct array addressing.
- In multiway array aggregation method value are accessed through index of their corresponding array locations.
- Hence, Multiway cannot perform any value-based reordering.
- A different approach is developed for the array-based cube construction, as follows:
1. Partition the array into chunks.
A chunk is a sub-cube that is fit into the memory available for cube computation. Chunking is a method for dividing an A-dimension array into small n-dimensional chunks, where each chunk is stored as an object on a disk. The chunks are compressed so as to remove wasted space resulting from empty array cells. For instance, “chunk ID + offset” can be used as a cell addressing mechanism to compress a sparse array structure and when searching for cells within a chunk. Such a compression technique is powerful enough to handle sparse cubes, both on disk and in memory.
2. Compute aggregates by visiting (i.e., accessing the values at) cube cells.
The order in which cells are visited can be optimized so as to minimize the number of times that each cell must be revisited, thereby reducing memory access and storage costs. The trick is to exploit this ordering so that partial aggregates can be computed simultaneously, and any unnecessary revisiting of cells is avoided. Because this chunking technique involves “overlapping” some of the aggregation computations, it is referred to as multiway array aggregation. It performs simultaneous aggregation, that is, it computes aggregations simultaneously on multiple dimensions.
2. BUC: Computing Iceberg Cubes from the Apex Cuboid Downward:
- BUC is an algorithm for the computation of sparse and iceberg cubes.
- Unlike Multiway, BUC constructs the cube from the apex cuboid toward the base cuboid.
- This allows BUC to share data partitioning costs.
- This processing order also allows BUC to prune during construction, using the Apriori property.
- Figure shows a lattice of cuboids, making up a 3-D data cube with the dimensions A, B, and C.
- The apex (0-D) cuboid, representing the concept all (i.e., (∗, ∗, ∗)), is at the top of the lattice.
- This is the most aggregated or generalized level.
Algorithm: BUC algorithm for the computation of sparse and iceberg cubes.
BUC Algorithm Explanation:
- Initially, the algorithm takes relation (set of tuples) as input and aggregates the entire input, and writes the resulting total.
- Divide dimensions into partitions: for each dimension, the input is partitioned on dimensions and count the total no. of tuples for each distinct value of the dimension.
- Each distinct value of dimension from its own partition.
- Test the partition for minimum support to perform iceberg pruning:
3. Explain Computing Iceberg Cubes Using a Dynamic Star-Tree Structure:
- Star-Cubing combines the strengths of the other methods.
- Star-Cubing integrates top-down and bottom-up cube computation and explores both Multidimensional aggregation (similar to Multiway) and Apriori-like pruning (similar to BUC).
- It operates from a data structure called a star-tree, which perform lossless data compression, thereby reducing the computation time and memory requirements.
- The star cubing algorithm explores both the bottom-up and top-down computation model.
- In the dynamic star-tree structure we start the aggregation process by traversing in a bottom-up fashion.
- Traversal is a depth-first traversal.
- If we were to follow only the bottom-up model (similar to MultiWay), then the cuboids marked as pruned by Star-Cubing would still be explored.
4. Describe Precomputing Shell Fragments for Fast High-Dimensional OLAP.
- Data cube has been playing an essential role in fast OLAP (online analytical processing) in many multi-dimensional data warehouses.
- However a full data cube of high dimensionality needs massive storage space and unrealistic computation time.
- Iceberg cubes provide a more feasible alternative, wherein the iceberg condition is used to specify the computation of only the subset of the full cubes cells.
- However an Iceberg cube is smaller and requires less computation time than its corresponding full cube, it is not an ultimate solution.
- For one, the computation and storage of the iceberg cube can still be costly.
- For example, if the base cuboid cell, (a1, a2, …., a60), passes minimum support(or the iceberg threshold), it will generate 260 iceberg cube cells.
- Second, it is difficult to determine an appropriate iceberg threshold.
- Setting the threshold too low will result in a huge cube, whereas setting the threshold too high may invalidate many useful applications.
- Third an iceberg cube cannot be incrementally updated.
Que 3. Explain OLAP-Based Mining on Sampling Data (Sampling Cubes).
Sampling Cubes:
- Sampling cubes technology can be used to answer queries on sample data, such as survey data, which represent a sample or subset of a target data population of interest.
- When collecting data, we often collect only a subset of the data.
- In statistics, this is known as collecting a sample of the data population.
- The resulting data are called sample data.
- Statistical surveys based on sampling are a common tool in many fields like politics, healthcare, market research and social and natural sciences.
- OLAP traditionally contain the full data population having sample data
Also read : Unit IV: Data Warehousing and Online Analytical Processing