DataWareHouse Interview Questions and Answers For Freshers Part-1
1.What is the purpose of cluster analysis in Data Warehousing?
Cluster analysis is used to define the object without giving the class label. It analyzes all the data that is present in the data warehouse and compare the cluster with the cluster that is already running. It performs the task of assigning some set of objects into the groups also known as clusters. It is used to perform the data mining job using the technique like statistical data analysis. It includes all the information and knowledge around many fields like machine learning, pattern recognition, image analysis and bio-informatics. Cluster analysis performs the iterative process of knowledge discovery and includes trials and failures. It is used with the pre-processing and other parameters as a result to achieve the properties that are desired to be used.
Purpose of cluster analysis:-
• Ability to deal with different kinds of attributes
• Discovery of clusters with attribute shape
• High dimensionality
• Ability to deal with noisy
2.What is the difference between agglomerative and divisive Hierarchical Clustering?
• Agglomerative Hierarchical clustering method allows the clusters to be read from bottom to top so that the program always reads from the sub-component first then moves to the parent whereas Divisive Hierarchical clustering uses top-bottom approach in which the parent is visited first than the child.
• Agglomerative hierarchical method consists of objects in which each object creates its own clusters and these clusters are grouped together to create a large cluster. It defines a process of continuous merging until all the single clusters are merged together into a complete big cluster that will consist of all the objects of child clusters. However, in divisive clustering, the parent cluster is divided into smaller cluster and it keeps on dividing until each cluster has a single object to represent.
3.Why is chameleon method used in data warehousing?
Chameleon is a hierarchical clustering algorithm that overcomes the limitations of the existing models and the methods present in the data warehousing. This method operates on the sparse graph having nodes: that represent the data items, and edges: representing the weights of the data items.
This representation allows large dataset to be created and operated successfully. The method finds the clusters that are used in the dataset using two phase algorithm.
• The first phase consists of the graph partitioning that allows the clustering of the data items into large number of sub-clusters.
• Second phase uses an agglomerative hierarchical clustering algorithm to search for the clusters that are genuine and can be combined together with the sub-clusters that are produced.
4.What is Virtual Data Warehousing?
• A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse has no historic data. It can be considered as a logical data model of the containing metadata.
• Virtual data warehousing is a ‘de facto’ information system strategy for supporting analytical decision making. It is one of the best ways for translating raw data and presenting it in the form that can be used by decision makers. It provides semantic map – which allows the end user for viewing as virtualized.
5.What is active data warehousing?
• An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and SUPPLIERS. It helps to deliver the updated data through reports.
• A form of repository of captured transactional data is known as ‘active data warehousing’. Using this concept, trends and patterns are found to be used for future decision making. Active data warehouse has a feature which can integrate the changes of data while scheduled cycles refresh. Enterprises utilize an active data warehouse in drawing the company’s image in statistical manner.
6.What is snapshot with reference to data warehouse?
• Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly.
• A snapshot is a process of knowing about the activities performed. It is stored in a report format from a specific catalog. The report is generated soon after the catalog is disconnected.
7.What is XMLA?
• XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses ‘discover’ and ‘Execute’ methods. Discover fetches information from the internet while Execute allows the applications to execute against the data sources.
• XMLA is an industry standard for accessing data in analytical systems, such as OLAP. It is based on XML, SOAP and HTTP.
• XMLA specifies MDXML as the query language. In the XMLA 1.1 version, the only construct in MDXML is an MDX statement enclosed in the tag
8.What is ODS?
• An operational data store (“ODS“) is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not sent back to operational systems. It may be passed for further operations and to the data warehouse for reporting.
• In ODS, data can be scrubbed, resolved for redundancy and checked for compliance with the corresponding business rules. This data store can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried while business operations occur. This is the place where most of the data used in current operation is housed before it’s transferred to the data warehouse for longer term storage or archiving.
• An ODS is designed for relatively simple queries on small amounts of data (such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the data warehouse.
• An ODS is similar to your short term memory where it only stores very recent information. On the contrary, the data warehouse is more like long term memory storing relatively permanent information.
9.What is level of Granularity of a fact table?
A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table.
e.g.Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.
The granularity is the lowest level of information stored in the fact table. The depth of data level is known as granularity. In date dimension, the level could be year, month, quarter, period, week, day of granularity.
The process consists of the following two steps:
– Determining the dimensions that are to be included
– Determining the location to locate the hierarchy of each dimension of information
The above factors of determination will be resent to the requirements.
10.What is the difference between view and materialized view?
– Tail raid data representation is provided by a view to access data from its table.
– It has logical structure that does not occupy space.
– Changes get affected in corresponding tables.
– Pre-calculated data persists in materialized view.
– It has physical data space occupation.
– Changes will not get affected in corresponding tables.