Handling Late-Arriving Members in SSIS

I was chatting with some other SQL Community members recently and the topic of handling late-arriving members for a warehouse load came up. I figured it was worth taking a couple of minutes to share the general concept I’ve been using to handle late-arriving members in SSIS. The steps are relatively straightforward if you’ve done much SSIS, but sometimes the little things can trip you up. The example below is based loosely on the WideWorldImporters database samples from Microsoft.

The general steps are

  1. Set up your source query.
  2. Pass the data through a Lookup for your Dimension with the missing results routed to a “No Match” output.
  3. Insert those “No Match” rows into your Dimension using a SQL task – checking to make sure that this particular row hasn’t already been inserted (this is important).
  4. Do another lookup using a “Partial Cache” to catch these newly-inserted members.
  5. Use a UNION ALL transform to bring the existing and late-arriving members together.

Inferred Member Data Flow

 

Sample code for the Inferred Insert:

DECLARE @CustomerID INT
SELECT @CustomerID = ?
IF NOT EXISTS (SELECT *
FROM WideWorldImportsDW.Dimension.Customer
WHERE [WWI Customer ID] = @CustomerID)
INSERT WideWorldImportersDW.Dimension.Customer
([WWI Customer ID],
Customer,
[Bill To Customer],
Category,
[Buying Group],
[Primary Contact],
[Postal Code],
[Valid From],
[Valid To],
[Lineage Key])
VALUES ( @CustomerID, -- WWI Customer ID - int
N'Unknown:' + Cast(@CustomerID AS NVARCHAR(10)),-- Customer - nvarchar(100)
N'',-- Bill To Customer - nvarchar(100)
N'',-- Category - nvarchar(50)
N'',-- Buying Group - nvarchar(50)
N'',-- Primary Contact - nvarchar(50)
N'',-- Postal Code - nvarchar(10)
Sysdatetime(),-- Valid From - datetime2(7)
'9999-12-31 23:59:59.9999999',-- Valid To - datetime2(7)
2 -- Lineage Key - int
)
Comments
  1. Sarthak

    Will there be any performance issue since here we are handling row by row basis using OLEDB Command component for inserting missing member. Your thought on this

    • paschott

      There are obviously going to be some performance hits if you have a lot of late-arriving members. The idea is that the first lookup is fully cached. The lookup for the newly-inserted members is partially cached. You’ll still want a decent index to avoid inserting duplicate rows in the “execute SQL” task, but ideally your main dimension load catches most, if not all, members. This is a way to ensure that you can catch anything that was created between the dimension load and the fact load.

      It’s probably worth noting that you need some way to handle updating those inferred members later. I have usually added a step in the Dimension load for that. If you find that you get a _lot_ of inferred members that are never updating, you’ll probably want to clear those out and have a talk with your developers about why there are so many unmatched members.

Leave a Reply to Sarthak Cancel reply

Your email address will not be published. Required fields are marked *