I recently created my first Alteryx Turducken – an iterative macro inside of a batch macro inside of an app. I didn’t set out to create a Turducken, but my workflow naturally evolved into one as I pursued my goal, which was to audit the content of my Tableau workbooks. I know there are a number of resources out there (some free, some not) that will comb through the XML of Tableau workbooks and tell you how many calculated fields you have, which fields are used in which worksheets etc., but I liked the idea of creating my own, both to have complete control over the end product and also to expand my Alteryx proficiency.
- Given how long this took me (I’d never worked with XML before so I went down many dead ends before I got some things to work), I want to create a reference for myself so my next foray into this is much quicker.
- Provide a reference to others who might have a similar goal.
- Expose my steps in the hope that smarter people out on the interwebs can advise me on how to make my workflow better.
So the first thing I needed to come to grips with was how to read in the XML from any TWB file. There’s a great XML Parse tool in Alteryx, but it expects you to have a basic grasp of the underlying XML so that you can tell it what to pull in.
I had no idea what any of this meant when I first looked at this tool, and I still only have a loose grasp of it. XML is clearly logically structured, but there’s quite a bit of repetitiveness in the code (i.e. elements like column names, worksheets etc. show up numerous times in different sections), making it hard for a novice to ascertain where the juicy and relevant bits are. And the XML for a TWB with a fair bit of fields and worksheets is intimidatingly long! One of my workbooks had almost 30,000 lines of code that all looked like this. God help me.
- The _OuterXML fields are the ones I need to feed into the XML Parse tool. If I fed in other fields I’d often either get error messages (e.g. ‘Invalid document structure’) or just minimal and useless results.
- There are multiple key root elements in a workbook (e.g. datasources, worksheets, etc.) that all have children (e.g. column names, filters, etc.) and the same elements could be children of different roots (e.g. a column could be a child of a data source and a worksheet).
Before I could do any parsing I had to connect an Input Tool to a TWB file and check pretty much everything (root element, child values, OuterXML) since I knew I needed all of it downstream. I suspect there are use cases where you wouldn’t need to bring all of this stuff in, but I knew I needed the kitchen sink for my workflow.
Once I had my connection, I then needed to find the columns. This proved to be a little less straightforward than I hoped it would be, but it showed me how there are at least two ways to arrive at the same end, one that leverages a single XML Parse tool and a lot of RegEx (to parse fields from within long sections of script) and another that uses two XML Parse tools and some simpler Select and Join tools. I opted for the latter.
The first XML Parse tool (1) pulls in the datasource root and lists out all of the child elements associated with each datasource. If you only have a single data connection and no parameters, then you will only have one datasource, but if you have multiple connections or parameters in your workbook then each of those will be a separate datasource (i.e. Tableau considers your parameters as a distinct datasource in the XML). As you can see below (which is showing a Transpose tool placed after the output of the XML Parse tool), each child element is returned as separate data fields in the result set including a separate field for many of the column attributes (e.g. size, type, etc.). So the first column in the datasource has the field names titled “column name”, “column caption”, etc., the second has field names titled “column 2 name”, “column 2 caption” and so on. But some key attributes (like calculation) are still missing. Not sure why, but they are. Perhaps they’re buried inside of the Column OuterXML.
So I used a second XML Parse tool to bring in the Column child from the datasource root, which gave me a nice table of all of the column names and associated attributes including calculation formula.
To get around this I had to create a formula to take the output of the Transpose tool in the top stream and identify the column name associated with each column field. In other words, find all instances where the Name field contained the string “column name” and then return the contents of the associated Value field. I did the same thing for Column ID, which I figure could come in handy as well so that each column in my data set had a unique ID number.
Joining the top and bottom streams on Column name gave me the complete data set I was looking for, at least in terms of all of the columns in my workbook, all of the attributes I was interested in and which datasource each column belonged to. Pretty cool.
…let me break it down for easier comprehension:
- \[Calculation\d+\]|\[Calculation_\d+\] looks for any string that starts with the word “Calculation” followed either by a series of numbers or an underscore and then a series of numbers (e.g. [Calculation2] or [Calculation_8921123152237715])
- \[[^\]]+\s\(copy\)\]|\[[^\]]+\s\(copy\s\d\)\] looks for any string that contains the sub-string “(copy)” or “(copy)” followed by any digit (e.g. [Wk/Wk Pipeline (copy)])
- \[Parameter\s\d+\s\d\]|\[Parameter\s\d+\] looks for any string that starts with the word “Parameter” followed by a space and then one or more digits.
Once I had that worked out, I just needed to set up the “iterative” part of the iterative macro. This involved a few key steps:
SUM(zn([Calculation_596726966189228044]))- SUM(zn([Open Pipeline Qtr (Current Wk) (copy)]))
contains two fields that need cleaning, but only one will be replaced on the first pass since the RegEx function will only parse out one column identifier each time. I could have created multiple parse functions and then joined on multiple fields, but I felt a single parse function inside of a loop was simpler. So after a record is cleaned the following formula checks to see if there are more column names that need cleaning. If no, then it goes to the “Cleaned” Output tool, otherwise it goes to the “Still Dirty” Output tool which loops that record back through the macro.
The Turducken is complete
The last stage of the Alteryx Turducken was also very simple. In a new workflow, all I needed to do was set up a Directory tool to find a folder with TWB files, filter out the TWBX files (for some reason the Directory tool still pulls these in even though I specify .twb), feed the rows into the batch macro I just created and then take the batch macro output and save it as a .tde I can analyze in Tableau.