Implement Type 2 Slowly changing dimension using Apache Nifi
What is Type 2 Slowly changing dimension ?
Type 2 Slowly changing dimension is a very important design pattern in Data Warehouse development. In this pattern, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. This allows us to keep all historical information accurately. For example, let’s assume that we have following Patient dimension table in a star schema.
An original patient record will look like where current row indicator CRN_IND is set to Y and the VLD_FM_TS is set to current timestamp and the VLD_TO_TS is Null.
An update will update the CRN_IND in original record to N, set the VLD_TO_TS to current Timestamp and insert a new row with its own primary key.
A deleted patient will look like
Following query will always fetch the current active row of an active patient.
All the leading ETL (Extraction, Transformation, Loading) tools such as DataStage, Informatica have built-in support for Type 2 Slowly Changing dimension. I expected similar support from Apache Nifi. Unfortunately, my search always came up empty till this day. So, I decided to implement the pattern using existing Nifi processors.
Nifi process group to achieve the Type 2 Slowly Changing Dimension.
The processors in the process groups are
- This processor is of type “GenerateFlowFile”. “Custom Text” property contains the example patient json.
2. This processor is of type “SplitJson”. “JsonPath Expression” property is set to $.*
3. This processor is of type “EvaluateJsonPath”. A custom property “DML_VERB” is defined with default set to “INSERT”.
4. This processor is of type “UpdateAttribute”. This processor makes sure some of the properties are returning scalar value from json.
5. This processor is of type “ExecuteSQL”. This is the heart of the Type 2 Slowly Changing Dimension enablement in Nifi. This processor determines if the SQL operation is an UPDATE or INSERT and set the DML_VERB property value accordingly.
The “SQL select query” property contains the DBMS specific query ( Figure 11, Line 19–24) to figure out the SQL operation.
6. This processor is of type “ConvertAvroToJSON”. This processor converts ExecuteSQL Avro output to Flow attributes.
7. This processor is of type “EvaluateJsonPath”. This processor extracts properties from flow json.
8. This processor is of type “RouteOnAttribute”. This processor routes to INSERT or UPDATE operations based on DML_VERB value.
9. This processor is of type “PutSQL”. If INSERT, then insert a row in the patient table.
10. This processor is of type “PutSQL”. If UPDATE, then set the current patient row CRN_IND to ‘N’.
11. This processor is of type “PutSQL”. This processor inserts the new patient row in the patient table with its own primary key.
Conclusion:
This is a quick implementation using already available Nifi processors. There is a potential to customize some processors to do the job with fewer processors. I think fewer processors will result in better performance and less points of failure. I hope to share that in future.