Skip to content

TSQLTuesday #160: Microsoft Open AI Wish List

March 14, 2023

This is my 11th TSQLTuesday, and this month Damien Jones has given the prompt: “What’s on your wishlist for the partnership between Microsoft and OpenAI?”

This is probably going to be quite a short blog, both because I’m pressed for time today, and also because unlike most people I haven’t spent much time looking at the new shiny AI stuff that’s come out recently. There was someone at my company who gave a talk a few weeks ago about how he sees AI being able to write the simpler code that we currently end up repeating over and over again (as a BI developer, merging into a slowly changing dimension springs to mind here), and I think there is something to talk about there, but that’s not what I’m going to talk about today.

Instead, the main hope I have for the Microsoft/OpenAI partnership is that it will produce something to make ingesting files easier.

I have spent a lot of my career writing ETL processes to ingest data into a data warehouse, usually via some sort of staging layer. Inevitably, some of that data will come from files rather than other databases, and that is a problem. The reasons I’ve needed to import file data have varied, sometimes my employer dealt with third parties and they were only willing to send the data in file format. Sometimes we needed ways for other people in the business to manage some of the data, and for whatever reason nobody was willing/able to build a proper master data system. One company’s entire business model was managing data for other companies, so we needed to ingest data from both our clients and the third parties they had contracts with. The one thing all of these scenarios had in common was that the files did not remain consistent. By this, I mean everything could change. Some examples off the top of my head:

  • Header text could change
  • Column order could change
  • Data types in the same column could change
  • Headers could be on the wrong columns (always fun)
  • New columns could appear
  • The number of rows before the data started could change
  • Values might be wrapped in “” one month and not the next
  • In Excel, sheet names could change
  • In Excel, region names could change if we were picking only the data in a particular region
  • I even had one case where row delimiters would sometimes change

Needless to say, this is always extremely frustrating to deal with. I’ve come up with various ways to mitigate some of these issues over the years, usually involving something like checking inside the file, building a basic import table based off that, and dynamically building an import and cleanse process that would hopefully produce the columns we needed at the end of it.

This really feels like something an AI might be able to do, from my largely unqualified vantage point. The import is largely repetitive, but with some changes every time, so maybe you would need to train it on a set of files you have already, but I feel like it could learn the particular quirks unique to a data supplier and the best ways of dealing with them when they happen. I feel like I’m being quite unambitious here, when talking about a tech that might change the world, but I’ve been at a company where the the files they were receiving were so numerous, and so important to the business, that they employed half a dozen contractors just to write and troubleshoot file imports.

One Comment

Trackbacks & Pingbacks

  1. T-SQL Tuesday #160: Round-Up - amazonwebshark

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: