A data warehouse is a centralized database, which is specifically optimized for analysis and analytics workloads. It has been classically defined by the following four unique characteristics:
- Subject Oriented: A data warehouse is designed to analyze data from a particular subject (e.g. sales)
- Integrated: A data warehouse does not contain new information, but rather data gathered from distributed, usually heterogenous sources. The process of loading this data into a data warehouse (“ETL” – Extract, Transform, and Load) creates a consistent view of the data.
- Nonvolatile: Data stored in a data warehouse is considered stable and will not change.
- Time-variant: Analysis done on data in a data warehouse considers the change over time.
Some data warehouses nowadays do not necessarily follow all these principles, since they can be perceived as restricting. However, the central property is, that a data warehouse has the specific purpose of enabling data analysis and data mining.
Architecture and Common Technologies
A data warehouse is really a combination of several technologies working together to achieve the desired data analysis tasks. This usually includes the following:
- The foundation is provided by a relational database management system. This is used for the main purpose of structured storage and management of the data.
- An ETL solution to extract the data from the sources, transform it into the consistent format of the data warehouse, and finally load it into the database.
- Capabilities to enable statistical analysis, reporting, and data mining (usually already provided by the database management system).
- User friendly client tools to do basic visualization and presentation of the data for business users.
- More sophisticated analytical applications, which use machine learning and artificial intelligence to generate new data insights.
Some commonly used tools include Oracle SQL, PostgreSQL, MS SQL Server, Oracle Warehouse Builder, SAP Data Services, MS SQL Server integration services, MS PowerBI, Talend, and many more. Integrated cloud solutions also exist. Some examples are Amazon Redshift, Azure Synapse, Google BigQuery, or Oracle Data Warehouse Cloud Services.
The full process of gathering, organizing, and analyzing date is called data warehousing. Data is often provided to users in so-called Data Marts.
What are Data Marts?
A Data Mart is a copy of some part of the data stored in the data warehouse. This copy is not intended to be persistent, but rather generated to support a specific purpose, e.g. an analysis task. There are several reasons to be working on such copies as opposed to a direct view of the relevant data in the data warehouse.
One reason is that the specific task may need specialized data structures not provided directly by the data warehouse (for example in multidimensional analysis). The data warehouse then prepares the data into this structure and provides a copy in form of a data mart. Furthermore, a data mart may be stored locally or distributed to other servers, reducing network traffic induced by the analysis (if local) and the CPU load on the data warehouse. Local storage also reduces the user’s reliance on a network connection to the data warehouse. Finally, there can be security reasons: users get access to the data marts containing the exact data they need, but not the entire data warehouse.
Data marts are often organized in a multidimensional form, that is, as Star, Snowflake, or Galaxy Schemas consisting of a large fact table, supported in various ways by smaller dimension tables containing the key values.
Why would we use a Data Warehouse?
Reasons to construct a data warehouse generally follow two main ideas. One is that the integration of data from distributed, heterogenous data sources enables a global overview of the data. This makes it possible to easily use data analysis tools to get new insights based on the entire available data (the “big picture”), rather than some complicated operation on the distributed source data. The second idea is that a data warehouse achieves a separation of concerns: It separates the data used for the operative day-to-day business from the data used for reporting, decision support, business analysis, and controlling (where the latter concern is what the data warehouse is for).
What is the Difference to Data Lakes?
Another mass storage architecture is the data lake. A precise definition of that is out of the scope of this article, but we will give a brief overview of the main difference between a data warehouse and a data lake: As detailed above, the main purpose of a data warehouse is to enable analysis and analytics workloads, usually targeted at a specific subject. A data lake on the other hand is a much vaguer data storage architecture: It provides an unfiltered, unstructured storage space for data to be used for a specific purpose late on. In a way a data lake is more exploratory than a data warehouse. Where in the latter the purpose should already be defined at the time of construction, a data lake stores data that could be useful at a later point in time, but this use has not been defined yet.
Data warehousing is a big topic and this article can only give a first overview of it. If you want to know which data storage solutions are right for you, do not hesitate to contact us at MA Data Consulting. We are committed to give you an independent consultation on the technologies exactly fitting your use.