Thursday, 4 December 2014

How to Implement SDC 2 with MD5 Checksum

In general when requirement come to implement SCD type 2, we look out for any date field in the Target. If not date field we go for SCD type 2 with flag column, still we can go for SCD type two with versioning. This all scenario holds good when there is a date column or flag column in the table it’s easy for a developer to implement SCD type2.
Let’s say we come across a situation where there are no date columns or flag column, then how to implement SCD type 2. Usually we come across such situation when you are dealing with very old legacy system which doesn’t have such fields in the table. Then the solution is to generate hash code for the entire row and use it as identifier column for reference. To achieve this informatica allow us to use hash function namely MD5 () – Message Digest and CRC32 () – Cyclic Redundancy Check. The hash code in ETL perspective is referred as checksum value
Now we will try this in a simple example to see how does it works. I am considering a simple flat file as source and a target table for this scenario.
The above is the mapping for the same. So to begin with in expression transformation create two new output ports like below 


And edit the expression of the two new ports as below
MD5_CHECKSUM à MD5 (COL1||COL2|COL3||…..||COLn)
CRC32_CHECKSUM à CRC32(COL1||COL2|COL3||…..||COLn)
So how does it works, we are passing a concat value of all the fields to these hash function. It will create a unique hash value for these columns and we can use it for further unique key reference.
NOTE: It’s good to handle NULL values in the column which are used by NVL2() function to avoid any data discrepancies.
Next in look up transformation look up on the primary key for any changes and pass primary key and MD5_CHECKSUM and CRC32_CHECKSUM to a router transformation and populate rest fields from the source.
In Router create groups for insert n update rows as below
INSERT à ISNULL(LKP_PK)
UPDATE à NOT ISNULL(LKP_PK) AND  ( LKP_MD5_CHECKSUM != MD5_CHECKSUM AND LKP_CRC_CHECKSUM != CRC32_CHECKSUM )
Here you can use either of MD5 or CRC32 function, I have used both just for understabd process. So once you have driven the records for INSERT and UPDATE rows perform SCD type 2 then.
NOTE: Tacke care of Nulls and spcaces in the column your are performing concat operation and passing it to MD5() function, Nulls create a big problem here as teh value generated by MD5 won't be unique.
MD5(IIF(ISNULL(TRIM(COL1)),'A',TRIM(COL1))||IIF(ISNULL(TRIM(COL2)),'A',TRIM(COL2))||IIF(ISNULL(TRIM(COL3)),'A',TRIM(COL3)).....)
This way you can implement SCD type 2 without bothering about DATE or FLAG columns in the table.
Thnaks,
Hari

How to pull yesterdays or the last business day records - Informatica with CDC process

We often come across a scenario where we are asked to pull yesterday’s records or the latest records after the last run to avoid unnecessary fetching of old records from the source.
We can think of SYSDATE -1 here to solve this but it’s not always correct in all cases suppose the workflow was scheduled in such a way that it should only run on business days and imagine you are pulling INVOICE related data so source table are updated on only business days , So if we run the workflow on Monday and if we have used SYSDATE-1 logic we end up in pulling Sundays records were as Sunday there was no Sales we are wrong here. We need to pull last business day record ie either Friday or Saturday depending on the business. Also we end up with same situation when there are any holidays.
We can pull last business day invoices from many ways, here I am going to share the two ways I have came across.
1.      Mapping Variable:
I believe it’s the easiest one of all where we declare a mapping level variable in the mapping and assign it with SESSIONSTARTTIME and use it in SQL override of the same mapping.

Example:

Consider the below simple mapping

SourceàSQàEXPàTarget

Follow these steps to implement this logic

·         Create mapping Parameter and variable option, there create a mapping variable like $$LASTRUNDT of type date. There you can see initial value; either you can assign that value with some particular date or NULL. Note the date format here is MM/DD/YYYY HH24:MI:SS, so be careful when you are assigning the date there.

·         Now use the mapping variable $$LASTRUNDT in the Source qualifier like
Select COL1, COL2, COL3….. COLn from SALES.INVOICE
Where INVOICE_DATE >= TO_DATE(‘$$LASTRUNDT’,’ MM/DD/YYYY HH24:MI:SS’)

·         In Expression transformation create a output port LASTRUN and in the expression use
SETVARIABLE($$LASTRUNDT,SESSSTARTTIME). So here you are assigning the SESSIONSTARTTIME value to your mapping variable $$LASTRUNDT.
So how this gone a work now, it’s quite simple at the 1st run it will use default value if you have declared any initial value while creating mapping variable or IS will pick the default value i.e. 01/01/1753 00:00:00. This is how 1st run work, it is advised to initialize the variable with required date.
Then we have used setvariable function in expression right that will assign the session start time to the variable $$LASTRUNDT.
When you run the session again IS will pick the updated $$LASTRUNDT value which hold previous run’s timestamp. So this way we can implement a system which pulls last business day records or latest records after the last run.
2.      Mapping Parameter:
Parameter file can also be used to store the last run timestamp as a parameter and then use the parameter in the mapping. This system needs two mapping and session objects to perform the last business day load.
Let’s see how we can achieve it
·         Create a mapping parameter in the data load mapping where you are loading the last business day records as $$LASTRUNDT
·         Now use the mapping parameter $$LASTRUNDT in the Source qualifier like
Select COL1, COL2, COL3….. COLn from SALES.INVOICE
Where INVOICE_DATE >= TO_DATE(‘$$LASTRUNDT’,’ MM/DD/YYYY HH24:MI:SS’)
·         Create a mapping to write the parameter file like below
Source à SQ à Exp à Target(Parameter file).
·         In the expression transformation create a output port with expression as below
[Session Name]
$$LASTRUNDT = SESSSTARTTIME
Drag it and connect with the mapping parameter port.
·         So here we are writing the last run timestamp to a mapping parameter file, which will be used in the next mapping in filter condition.
So how this will work, here it is during the 1st run u can either initialize the parameter value with some date or you can keep the date value in the parameter file itself so it will take the initial value for the filter condition and pull the record.
Next runs is the mapping parameter session which is going to write the last run timestamp to the parameter. This will be used during next cycle where IS will pick the $$LASTRUNDT parameter value from the parameter which was written by the previous cycle. This is how we can implement data pulling of last business day by using the parameter file.
Thanks
Hari,