Data Warehousing and Online Analytical Processing
Que 1. Describe the architecture of data warehouse with proper diagram.
Data Warehouse:
- A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc.
- A Data Warehouse is structured by data integration from multiple heterogeneous sources.
- It is a system used for data analysis and reporting.
- A data warehouse is a core factor of business intelligence.
- Data Warehouse is referred to the data repository that is maintained separately from the organization’s operational data.
- Multi-Tier Data Warehouse Architecture consists of the following components:
- o Bottom Tier
- o Middle Tier
- o Top Tier
Bottom Tier (Data sources and data storage):
- The bottom Tier usually consists of Data Sources and Data Storage.
- It is a warehouse database server. For Example RDBMS.
- In Bottom Tier, using the application program interface(called gateways), data is extracted from operational and external sources.
- Application Program Interface likes ODBC (Open Database Connection), OLE-DB (Open Linking and Embedding for Database), JDBC (Java Database Connection) is supported.
Middle Tier:
- A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
- The OLAP server is implemented using either
- a) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on multidimensional data to standard relational operations.
- b) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements multidimensional information and operations.
Top Tier:
- The top tier is a front-end client layer, which includes query and reporting tools, analysis tools, and/or data mining tools (eg, trend analysis, prediction, etc.).
- Since the data marts are created from the data warehouse, provides consistent dimensional view of data marts.
- Also, this model is considered as the strongest model for business changes. That’s why, big organisations prefer to follow this approach.
Que 2. Differences between Operational Database Systems and Data Warehouse.
Operational Database Systems:
- An operational database is the database that is accessed and updates on a continual basis and usually handles the daily transactions for a business.
- Operational databases use an Online Transactional Processing (OLTP) approach and are designed to be write-optimized.
Data Warehouse:
- A data warehouse is a repository of an organizations electronically stored data.
- Data warehouses are designed to typically answer “who” and “what” types of questions.
- Data warehouses use an Online Analytical Processing (OLAP) approach and perform in conjunction with the operational database on a constant basis.
Sr.no. | OLAP | OLTP |
1. | OLAP referred to online analytical processing. | OLTP referred to online transactional processing. |
2. | It is also called as an online database query management system. | It is well-known as an online database modifying system. |
3. | Consists of historical data from various Databases. | Consists of only of operational current data. |
4. | It makes use of a data warehouse. | It makes use of a standard database management system (DBMS). |
5. | It is subject-oriented. | It is application-oriented. |
6. | Used for Data Mining, Analytics, Decisions making, etc. | Used for business tasks. |
7. | In an OLAP database, tables are not normalized. | In an OLTP database, tables are normalized (3NF). |
8. | The data is used in planning, problem solving, and decision-making. | The data is used to perform day-to-day fundamental operations. |
9. | It provides a multi-dimensional view of different business tasks. | It reveals a snapshot of present business tasks. |
10. | It serves the purpose to extract information for analysis and decision making. | It serves the purpose to Insert, Update, and Delete information from the database. |
11. | A large amount of data is stored typically in TB, PB. | The size of the data is relatively small as the historical data is archived. For ex MB, GB. |
12. | Relatively slow as the amount of data involved is large. Queries may take hours. | Very fast as the queries operate on 5% of the data. |
13. | The OLAP database is not often updated. As a result, data integrity is unaffected. | The data integrity constraint must be maintained in an OLTP database. |
14. | It only need backup from time to time as compared to OLTP. | Backup and recovery process is maintained rigorously. |
15. | The processing of complex queries can take a lengthy time. | It is comparatively fast in processing because of simple and straightforward queries. |
16. | This data is generally managed by CEO, MD, GM. | This data is managed by clerks, managers. |
17. | Only read and rarely write operation. | Both read and write operations. |
18. | Process in OLAP focused on the customer. | Process in OLTP focused on the market. |
19. | Design with a focus on the subject. | Design that is focused on the application. |
20. | Improves the efficiency of business analysts. | Enhances the user’s productivity. |
Que 3. What are the different data warehouse models? Explain each with diagram.
Data warehouse modeling is the process of designing the schemas of the detailed and summarized information of the data warehouse. The goal of data warehouse modeling is to develop a schema describing the reality, or at least a part of the fact, which the data warehouse is needed to support.
From the architecture point of view, there are three data warehouse models.
- The enterprise warehouse
- The data mart
- The virtual warehouse
Enterprise Warehouse
- An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It supports corporate-wide data integration, usually from one or more operational systems or external data providers, and it’s cross-functional in scope. It generally contains detailed information as well as summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.
- An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers, or parallel architecture platforms. It required extensive business modeling and may take years to develop and build.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of users. The scope is confined to particular selected subjects. For example, a marketing data mart may restrict its subjects to the customer, items, and sales. The data contained in the data marts tend to be summarized.
Data Marts is divided into two parts:
Independent Data Mart: Independent data mart is sourced from data captured from one or more operational systems or external data providers, or data generally locally within a different department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses.
Virtual Warehouses
- Virtual Data Warehouses is a set of view over the operational database.
- For effective query processing, only some of the possible summary vision may be materialized.
- A virtual warehouse is simple to build.
- It required excess capacity on operational database servers.
Que 4. Illustrate the different OLAP operations performed on data cube with examples.
Data Cube: A Multidimensional Data Model:
- The multi-Dimensional Data Model is a method which is used for ordering data in the database along with good arrangement.
- A multidimensional model views data in the form of a data-cube.
- A data cube enables data to be modeled and viewed in multiple dimensions.
- It is defined by dimensions and facts.
- OLAP (online analytical processing) and data warehousing uses multi-dimensional databases.
- It is used to show multiple dimensions of the data to users.
Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below:
- In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which is organized into quarters and the dimension of items, which is sorted according to the kind of item which is sold.
- The facts here are represented in rupees (in thousands).
- Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented in the diagram given below.
- Here the data of the sales is represented as a two dimensional table.
- Let us consider the data according to item, time and location (like Kolkata, Delhi, and Mumbai).
Here is the table:
- This data can be represented in the form of three dimensions conceptually, which is shown in the image below:
Que 5. Explain the process of designing a data warehouse in brief.
- A data warehouse is a single data repository where a record from multiple data sources is integrated for online business analytical processing (OLAP).
- This implies a data warehouse needs to meet the requirements from all the business stage within the entire organization.
- The technical architecture of data warehouses is somewhat similar to other systems, but does have some special characteristics.
- There are two approaches by which data warehouse can be built.
o A top-down approach.
o A bottom-up approach.
1. A top-down approach:
- In top-down design approach, a data warehouse is described as a subject-oreinted, timevarient, non-volaite and integrated data repository.
- In top-down design approach data are stored in a normalized database.
- The data warehouse stores atomic information.
- A top-down approach supports a single integrated data source.
The essential components are :
1) External Sources:
- External source is a source from where data is collected.
- Data can be structured, semi structured and unstructured as well.
2) Stage Area:
- Since the data, extracted from the external sources does not follow a particular format, so there is a need to validate this data to load into data warehouse.
- For this purpose, it is recommended to use ETL tool.
- E (Extracted): Data is extracted from External data source.
- T (Transform): Data is transformed into the standard format.
- L (Load): Data is loaded into data warehouse after transforming it into the standard format.
3) Data-warehouse:
- After cleaning of data, it is stored in the data warehouse as central repository.
- It actually stores the Meta data and the actual data gets stored in the data marts.
- Note that data warehouse stores the data in its purest form in this top-down approach.
4) Data Marts:
- Data mart is also a part of storage component.
- It stores the information of a particular function of an organisation which is handled by single authority.
- There can be as many number of data marts in an organisation depending upon the functions.
- We can also say that data mart contains subset of the data stored in data warehouse.
5) Data Mining:
- The process of analysing the big data present in data warehouse is data mining.
- It is used to find the hidden patterns that are present in the database or in data warehouse with the help of algorithm of data mining.
Advantages:
- Data Marts are loaded from the data warehouses.
- Developing new data mart from the data warehouse is very easy.
Disadvantages:
- This technique is inflexible to changing departmental needs.
- The cost of implementing the project is high.
2. Bottom-up Approach:
- The bottom-up approach starts with experiments and prototypes.
- This is useful in the early stage of business modeling and technology development.
The essential components are given below:
- First, the data is extracted from external sources (same as happens in top-down approach).
- The data go through the staging area (as explained above) and loaded into data marts instead of data warehouse. The data marts are created first and provide reporting capability. It addresses a single business area.
- These data marts are then integrated into data warehouse.
Advantages of bottom-up design:
- Documents can be generated quickly.
- The data warehouse can be extended to accommodate new business units.
- It is just developing new data marts and then integrating with other data marts.
Disadvantages of bottom-up design:
- The locations of the data warehouse and the data marts are reversed in the bottom-up approach design.
- This model is not strong as top-down approach.
Que 6. Explain the following terms in brief: 1. ROLAP 2. MOLAP 3. HOLAP
1. Relational Online Analytical Processing (ROLAP) :
ROLAP servers are placed between relational backend server and client front-end tools. It uses relational or extended DBMS to store and manage warehouse data. ROLAP has basically 3 main components: Database Server, ROLAP server, and Front-end tool.
Advantages of ROLAP –
- ROLAP is used for handle the large amount of data.
- ROLAP tools don’t use pre-calculated data cubes.
- Data can be stored efficiently.
- ROLAP can leverage functionalities inherent in the relational database.
Disadvantages of ROLAP –
- Performance of ROLAP can be slow.
- In ROALP, difficult to maintain aggregate tables.
- Limited by SQL functionalities.
2. Multidimensional Online Analytical Processing (MOLAP) :
MOLAP does not uses relational database to storage. It stores in optimized multidimensional array storage. The storage utilization may be low With multidimensional data stores. Many MOLAP server handle dense and sparse data sets by using two levels of data storage representation. MOLAP has 3 components : Database Server, MOLAP server, and Front-end tool.
Advantages of MOLAP –
- MOLAP is basically used for complex calculations.
- MOLAP is optimal for operation such as slice and dice.
- MOLAP allows fastest indexing to the pre-computed summarized data.
Disadvantages of MOLAP –
- MOLAP can’t handle large amount of data.
- In MOLAP, Requires additional investment.
- Without re-aggregation, difficult to change dimension.
3. Hybrid Online Analytical Processing (HOLAP) :
Hybrid is a combination of both ROLAP and MOLAP.It offers functionalities of both ROLAP and as well as MOLAP like faster computation of MOLAP and higher scalability of ROLAP. The aggregations are stored separately in MOLAP store. Its server allows storing the large data volumes of detailed information.
Advantages of HOLAP –
- HOLAP provides the functionalities of both MOLAP and ROLAP.
- HOLAP provides fast access at all levels of aggregation.
Que 7. Explain Data Warehouse Implementation.
- Data Warehouse contains huge amount of data.
- The big data which is to be analyzed and handled to remove insights from it will be stored in data warehouses.
- These warehouses are run by OLAP servers which require processing of a query within a seconds.
- So, a data warehouse should need highly efficient cube computation techniques, access methods, and query processing techniques.
1. Efficient Computation of Data Cube:
- At the core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions.
- In SQL terms, these aggregations are referred to the group-by.
- Each group-by can be represented by a cuboid, where the set of group-by forms a lattice of cuboid known as data cube.
Example:
Suppose that we create a data cube for ProElectronics(Company) sales that contains the following: city, item, year, and sales_in_dollars.
- Compute the sum of sales, grouping by city, and item.
- Compute the sum of sales, grouping by city.
- Compute the sum of sales, grouping by item.
- What is the total number of cuboids, or group-by’s that can be computed for this data cube?
Ans: Three attributes:
- city, item, year (dimensions), sales_in_dollars (measure).
- The total number of cuboids or group-by’s computed for this cube is 2^3=8.
- Group-by’s: {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year),()}.
- (): group-by is empty i.e. the dimensions are not grouped.
- The base cuboid contains all three dimensions.
2. Indexing OLAP Data: Bitmap index and join index:
- An index is a data structure that improves the speed of data accessing operations on a database table.
- Indexes are used to quickly access data without having to search every row in a database table.
- There are two popular indexing methods for OLAP.
a) Bitmap Indexing:
- The bitmap indexing method is popular in OLAP products because it allows quick searching in data cubes.
- The bitmap index is an alternative representation of the record ID (RID) list.
- The bitmap indexing is best as compared to hash and tree indices.
b) Join Indexing:
- The join indexing method gained popularity from its use in relational database query processing.
- The join index records can identify joinable tuples without performing costly join operations.
- Join indexing is especially useful for maintaining the relationship between a foreign key and its matching primary keys, from the joinable relation.
Also refer : Unit III: Data Preprocessing