BCP for Flat File ETL
Brent Ozar is hosting today’s #TSQL2sday about the most recent issue you closed. Working with a consulting firm for a variety of clients, I tend to come across all sorts of issues. The most common one is usually “grant or remove access” to some server or database. That usually involves a request that is vetted and approved, then added to a ticket queue. They’re not complex, but I appreciate a system that has an approval and also tracks an “end date”, which generates a request to renew to the original person opening the ticket. If there’s no response or the access is no longer needed, we get another ticket to remove that access. The worst we usually encounter is that users don’t often know about things like “servers” or “databases” or “what do I actually need to do my job”. That involves chatting or even picking up the phone to get more details.
However, at the time this popped up, my most recent “ticket” was a separate request. I’d been chatting with a client who had mentioned that they were closing an account for one of the SaaS apps they use. The vendor would provide DDL and extract files for import into their own system, but only after the account was closed. We chatted back and forth about some ideas for them to load the data into their own Azure SQL DB instance. At one point, he asked if I’d want to just do it for a small consulting fee. We chatted a bit more and he realized that he really didn’t want to do it.
Come the night, I arrived at their site to do the work (no jump boxes here). I looked at the file provided by the vendor and found one DDL file and about 200 DAT files – pipe-delimited, with no headers. The vendor’s docs had indicated something much smaller because their example process was “just grab the headers from (something) and paste into the file, then import”. Of course, they also used MS Access as their example target, which should have been a bit of a warning. Anyway, off to work!
I ran the DDL and realized that they’d defined some field repeatedly as a varchar(0). I fixed that, then analyzed the overall structure. It was quickly apparent that they’d at least named their tables and DAT files exactly the same. Great! I knew I wouldn’t have to try to do anything too fancy to determine file mappings. I still didn’t want to try to define ETL flows for 200 files, but I’ve been around a long time by now. BCP to the rescue! (Related, this article from Red-Gate popped up the other day.)
WAY back in the day, I remember using TSQL to generate the appropriate commands for BCP – back when you didn’t necessarily want to have a full database backup to disk and needed to move data around from select tables. Add in the “INFORMATION_SCHEMA” views and I could write up some simple code to import the data. It helped (sort of) that there were absolutely no constraints in the provided DDL, nor any indexes.
SELECT 'BCP dbo.' + TABLE_NAME + ' IN C:\DataPath\' + TABLE_NAME +
'.dat -S clientserver.database.windows.net -d TargetDBName -U sql_login -P password -t "|" -e c:\DataPath\'
+ TABLE_NAME + '.err.txt'
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
order by TABLE_NAME
That was a bunch of text and some options I remembered easily (server, username, password, database). Some I did not – put the pipe character in double-quotes, use the “-e” for error output. And some I just fat-fingered like putting a space after the hyphen for an option. It should be noted that BCP command-line switches are case-sensitive. ”-E” is not the same as “-e”, for example.
To summarize the options that were used:
- BCP tablename IN – specifies the target table and that we are importing data
- “C:\DataPath\Tablename.dat” – the filename and path to import
- -d Database name into which we will import the data
- -U sql_login – the sql login to use. There are other options to use trusted connections or even Entra IDs
- -P the password (yeah – it’s plain text. Don’t save this in a file you don’t want people to see if you go this route.)
- -t “|” – this tells BCP that the column delimiter is a pipe symbol instead of the default Tab character. You need the quotes around it or the command complains about invalid syntax
- -e C:\DataPath\*.err.txt – another generated name to redirect error output. With 200 or so files to run through, this was pretty important.
That generated a bunch of import commands. I copied them out to a new text file and turned that into a BAT file so I could just run it. But I wanted to test the first row to see what would happen. I copied that out, pasted it in the CMD window, ran it, and … it failed. Quick review of the error message showed that we’d been provided some sort of SystemGUID column defined as NOT NULL, but every DAT file gave us an empty field to import for this value. As this was not critical data, off to INFORMATION_SCHEMA.COLUMNS….
SELECT 'ALTER TABLE ' + TABLE_SCHEMA + '.' + TABLE_NAME
+ ' ALTER COLUMN SystemGUID varchar(20) NULL '
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = 'SystemGUID'
Copy the output, run, all columns of that name are now NULLable. I re-run the test line and … success! Great! I remove that one line from my BAT file and run it. I took a quick look at the generated “.err.txt” files and only had 3 that had errors. That’s a pretty good success rate for that many files. Upon further inspection, we’re trying to import into a varbinary column. Looking at the data … that is most definitely not a varbinary value. I checked with the client and he said that it didn’t look like a value they’d ever reference for their purposes so we used the above ALTER TABLE trick to adjust those couple of tables to make those columns a simple varchar(50). After that, truncate the tables that had the failures because some of the data had been written before the first row containing that particular column/value, and re-run the BCP command just for them. The imports ran without further error so we poked around a little in the database to verify counts and such.
After verification, I added comments to my “scratch” TSQL that I’d used to run all of the above commands and saved it so they could reference it if needed. I took a couple of minutes to explain the various commands I’d run and why. He was content and we had taken just about an hour from start to finish, making both of us pretty happy to not have spent many hours at the office after dark. He also appreciated that I managed to complete in an hour what he knows would have taken him a couple of weeks to do.
We have all sorts of tools we can use as DBAs. SSIS provides a nifty GUI for ETL. Azure Data Factory can do the same for Azure. The Import/Export wizard can be helpful for really simple one-off imports. That even could have helped here, but not to define that many flows without column names. PowerShell could have been an option, but I wasn’t quite ready to run through that and handle exceptions. In this case, the old BCP utility came in quite handy and did the job. It’s a tool worth remembering when faced with a “how do I get this data from here to there” problem.