Implement Type 2 Slowly changing dimension using Apache Nifi

Shakil Khan
4 min readJul 24, 2020

What is Type 2 Slowly changing dimension ?

Type 2 Slowly changing dimension is a very important design pattern in Data Warehouse development. In this pattern, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. This allows us to keep all historical information accurately. For example, let’s assume that we have following Patient dimension table in a star schema.

Figure 1 : Patient Dimension

An original patient record will look like where current row indicator CRN_IND is set to Y and the VLD_FM_TS is set to current timestamp and the VLD_TO_TS is Null.

Figure 2 : Original patient record

An update will update the CRN_IND in original record to N, set the VLD_TO_TS to current Timestamp and insert a new row with its own primary key.

Figure 3 : Updated patient record

A deleted patient will look like

Figure 4 : Deleted patient record.

Following query will always fetch the current active row of an active patient.

Figure 5 : Query to fetch current patient record

All the leading ETL (Extraction, Transformation, Loading) tools such as DataStage, Informatica have built-in support for Type 2 Slowly Changing dimension. I expected similar support from Apache Nifi. Unfortunately, my search always came up empty till this day. So, I decided to implement the pattern using existing Nifi processors.

Nifi process group to achieve the Type 2 Slowly Changing Dimension.

Figure 6 : Type 2 SCD Nifi process group

The processors in the process groups are

  1. This processor is of type “GenerateFlowFile”. “Custom Text” property contains the example patient json.
Figure 7 : Example patient json

2. This processor is of type “SplitJson”. “JsonPath Expression” property is set to $.*

3. This processor is of type “EvaluateJsonPath”. A custom property “DML_VERB” is defined with default set to “INSERT”.

Figure 8: Processor EvaluateJsonPath properties

4. This processor is of type “UpdateAttribute”. This processor makes sure some of the properties are returning scalar value from json.

Figure 9: Processor UpdateAttribute properties

5. This processor is of type “ExecuteSQL”. This is the heart of the Type 2 Slowly Changing Dimension enablement in Nifi. This processor determines if the SQL operation is an UPDATE or INSERT and set the DML_VERB property value accordingly.

Figure 10: Processor ExecuteSQL properties

The “SQL select query” property contains the DBMS specific query ( Figure 11, Line 19–24) to figure out the SQL operation.

Figure 11 : DML determination query

6. This processor is of type “ConvertAvroToJSON”. This processor converts ExecuteSQL Avro output to Flow attributes.

Figure 12: Processor ConvertAvroToJson properties

7. This processor is of type “EvaluateJsonPath”. This processor extracts properties from flow json.

Figure 13: Processor EvaluateJsonPath properties

8. This processor is of type “RouteOnAttribute”. This processor routes to INSERT or UPDATE operations based on DML_VERB value.

Figure 14: Processor RouteOnAttribute properties

9. This processor is of type “PutSQL”. If INSERT, then insert a row in the patient table.

Figure 15: Processor PutSQL properties

10. This processor is of type “PutSQL”. If UPDATE, then set the current patient row CRN_IND to ‘N’.

Figure 16: Processor PutSQL properties

11. This processor is of type “PutSQL”. This processor inserts the new patient row in the patient table with its own primary key.

Figure 17: Processor PutSQL properties

Conclusion:

This is a quick implementation using already available Nifi processors. There is a potential to customize some processors to do the job with fewer processors. I think fewer processors will result in better performance and less points of failure. I hope to share that in future.

--

--

Shakil Khan

Passionate about Data architecture/engineering, Semantic Web Technologies