Friday, 21 March 2014

What is the purpose of a staging area? why we need it in DWH?

1. One or more database schema(s) or file stores used to “stage” data extracted from the source OLTP systems prior to being published to the “warehouse” where it is visible to end users. 
2. Data in the staging area is NOT visible to end users for queries, reports or analysis of any kind. It does not hold completed data ready for querying.
3. It may hold intermediate results, (if data is pipe lined through a process)
4. Equally it may hold “state” data – the keys of the data held on the warehouse, and used to detect whether incoming data includes New or Updated rows. (Or deleted for that matter).
5. It is likely to be equal in size (or maybe larger) than the “presentation area” itself.
6. Although the “state” data – eg. Last sequence loaded may be backed up, much of the staging area data is automatically replaced during the ETL load processes, and can with care avoid adding to the backup effort. The presentation area however, may need backup in many cases.


        If target and source databases are different and target table volume is high it contains some millions of records in this scenario without staging table we need to design your informatica using look up to find out whether the record exists or not in the target table since target has huge volumes so its costly to create cache it will hit the performance.
If we create staging tables in the target database we can simply do outer join in the source qualifier to determine insert/update this approach will give you good performance.
It will avoid full table scan to determine insert/updates on target.And also we can create index on staging tables since these tables were designed for specific application it will not impact to any other schemas/users.
While processing flat files to data warehousing we can perform cleansing.
Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. During data cleansing, records are checked for accuracy and consistency.
  • Since it is one-to-one mapping from ODS to staging we do truncate and reload.
  • We can create indexes in the staging state, to perform our source qualifier best.
  • If we have the staging area no need to relay on the informatics transformation to known whether the record exists or not.
Data cleansing
Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it more meaningful and informative
Data merging
Data can be gathered from heterogeneous systems and put together
Data scrubbing
Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to end user.
Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bring consistency to various data sets that may have been created with different, incompatible business rules.


Thanks 
Ur's Hari

No comments:

Post a Comment