Handling Late-Arriving Members in SSIS
I was chatting with some other SQL Community members recently and the topic of handling late-arriving members for a warehouse load came up. I figured it was worth taking a couple of minutes to share the general concept I’ve been using to handle late-arriving members in SSIS. The steps are relatively straightforward if you’ve done much SSIS, but sometimes the little things can trip you up. The example below is based loosely on the WideWorldImporters database samples from Microsoft.
The general steps are
- Set up your source query.
- Pass the data through a Lookup for your Dimension with the missing results routed to a “No Match” output.
- Insert those “No Match” rows into your Dimension using a SQL task – checking to make sure that this particular row hasn’t already been inserted (this is important).
- Do another lookup using a “Partial Cache” to catch these newly-inserted members.
- Use a UNION ALL transform to bring the existing and late-arriving members together.
Sample code for the Inferred Insert:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
|DECLARE @CustomerID INT|
|SELECT @CustomerID = ?|
|IF NOT EXISTS (SELECT *|
|WHERE [WWI Customer ID] = @CustomerID)|
|([WWI Customer ID],|
|[Bill To Customer],|
|VALUES ( @CustomerID, -- WWI Customer ID - int|
|N'Unknown:' + Cast(@CustomerID AS NVARCHAR(10)),-- Customer - nvarchar(100)|
|N'',-- Bill To Customer - nvarchar(100)|
|N'',-- Category - nvarchar(50)|
|N'',-- Buying Group - nvarchar(50)|
|N'',-- Primary Contact - nvarchar(50)|
|N'',-- Postal Code - nvarchar(10)|
|Sysdatetime(),-- Valid From - datetime2(7)|
|'9999-12-31 23:59:59.9999999',-- Valid To - datetime2(7)|
|2 -- Lineage Key - int|
Will there be any performance issue since here we are handling row by row basis using OLEDB Command component for inserting missing member. Your thought on this
There are obviously going to be some performance hits if you have a lot of late-arriving members. The idea is that the first lookup is fully cached. The lookup for the newly-inserted members is partially cached. You’ll still want a decent index to avoid inserting duplicate rows in the “execute SQL” task, but ideally your main dimension load catches most, if not all, members. This is a way to ensure that you can catch anything that was created between the dimension load and the fact load.
It’s probably worth noting that you need some way to handle updating those inferred members later. I have usually added a step in the Dimension load for that. If you find that you get a _lot_ of inferred members that are never updating, you’ll probably want to clear those out and have a talk with your developers about why there are so many unmatched members.