Creating a TWB (Tableau Workbook) Audit Workflow in Alteryx


I recently created my first Alteryx Turducken – an iterative macro inside of a batch macro inside of an app.  I didn’t set out to create a Turducken, but my workflow naturally evolved into one as I pursued my goal, which was to audit the content of my Tableau workbooks.  I know there are a number of resources out there (some free, some not) that will comb through the XML of Tableau workbooks and tell you how many calculated fields you have, which fields are used in which worksheets etc., but I liked the idea of creating my own, both to have complete control over the end product and also to expand my Alteryx proficiency.


I’m going to document my approach in this post for three reasons:
  1. Given how long this took me (I’d never worked with XML before so I went down many dead ends before I got some things to work), I want to create a reference for myself so my next foray into this is much quicker.
  2. Provide a reference to others who might have a similar goal.
  3. Expose my steps in the hope that smarter people out on the interwebs can advise me on how to make my workflow better.

So the first thing I needed to come to grips with was how to read in the XML from any TWB file.  There’s a great XML Parse tool in Alteryx, but it expects you to have a basic grasp of the underlying XML so that you can tell it what to pull in. 


I had no idea what any of this meant when I first looked at this tool, and I still only have a loose grasp of it.  XML is clearly logically structured, but there’s quite a bit of repetitiveness in the code (i.e. elements like column names, worksheets etc. show up numerous times in different sections), making it hard for a novice to ascertain where the juicy and relevant bits are.  And the XML for a TWB with a fair bit of fields and worksheets is intimidatingly long!  One of my workbooks had almost 30,000 lines of code that all looked like this.  God help me.


But the core insights I arrived at after perusing the help were these:
  • The _OuterXML fields are the ones I need to feed into the XML Parse tool.  If I fed in other fields I’d often either get error messages (e.g. ‘Invalid document structure’) or just minimal and useless results.
  • There are multiple key root elements in a workbook (e.g. datasources, worksheets, etc.) that all have children (e.g. column names, filters, etc.) and the same elements could be children of different roots (e.g. a column could be a child of a data source and a worksheet). 
So knowing which OuterXML of which root element to bring in and which children to look for was going to be key to getting the workbook insights I was looking for.  Given that my driving interest was primarily around which fields were in my workbooks (and how/where they were used), I figured I’d start there.

Identifying the Columns in a Workbook

Before I could do any parsing I had to connect an Input Tool to a TWB file and check pretty much everything (root element, child values, OuterXML) since I knew I needed all of it downstream.  I suspect there are use cases where you wouldn’t need to bring all of this stuff in, but I knew I needed the kitchen sink for my workflow. 


Once I had my connection, I then needed to find the columns.  This proved to be a little less straightforward than I hoped it would be, but it showed me how there are at least two ways to arrive at the same end, one that leverages a single XML Parse tool and a lot of RegEx (to parse fields from within long sections of script) and another that uses two XML Parse tools and some simpler Select and Join tools.  I opted for the latter.


The first XML Parse tool (1) pulls in the datasource root and lists out all of the child elements associated with each datasource.  If you only have a single data connection and no parameters, then you will only have one datasource, but if you have multiple connections or parameters in your workbook then each of those will be a separate datasource (i.e. Tableau considers your parameters as a distinct datasource in the XML).  As you can see below (which is showing a Transpose tool placed after the output of the XML Parse tool), each child element is returned as separate data fields in the result set including a separate field for many of the column attributes (e.g. size, type, etc.).   So the first column in the datasource has the field names titled “column name”, “column caption”, etc., the second has field names titled “column 2 name”, “column 2 caption” and so on. But some key attributes (like calculation) are still missing.  Not sure why, but they are.  Perhaps they’re buried inside of the Column OuterXML. 


So I used a second XML Parse tool to bring in the Column child from the datasource root, which gave me a nice table of all of the column names and associated attributes including calculation formula.


This stream, though, didn’t appear to have the datasource name associated with it (at least not in a distinct field I could see/grab), so I was going to have to join the two streams together to get a complete data set with all column names, associated attributes and which datasource it was a part of.  The problem was that the top stream had each column as a separate field (e.g. “column 3 name”) whereas the bottom stream had all of the names in a single field called Name.

To get around this I had to create a formula to take the output of the Transpose tool in the top stream and identify the column name associated with each column field.  In other words, find all instances where the Name field contained the string “column name” and then return the contents of the associated Value field.  I did the same thing for Column ID, which I figure could come in handy as well so that each column in my data set had a unique ID number.



Joining the top and bottom streams on Column name gave me the complete data set I was looking for, at least in terms of all of the columns in my workbook, all of the attributes I was interested in and which datasource each column belonged to. Pretty cool. 



But I did notice one flaw in the content – the calculation formulas often contained references to the non-user-friendly column identifier vs the caption/alias that a user would see inside of the workbook. For example, the field [Calculation_8921123152237715] refers to a field I named QTD Linearity Delta but there’s no way to know that from the calculation string here:

IF [Calculation_8921123152237715] > 0 then “up” ELSEIF [Calculation_8921123152237715] < 0 then "down" ELSEIF [Calculation_8921123152237715] = 0 then "same" END

What I needed was a routine to replace all instances of these identifiers within each formula with the caption/alias it was associated to. This wouldn’t matter much if I just wanted to compare formulas across workbooks, since the actual string content of the calculation was still unique, but presenting this information to any human would be much harder if the entire formulas weren’t comprehensible. 

So this is where the iterative macro entered the picture, since I needed to do this calculation clean up for a bunch of columns within my data set. 

Cleaning up the Calculations with an Iterative Macro

I set up my calculation clean-up workflow to take in the same input stream twice, so that I could use the caption from one of them and plug that into each formula in the other stream where its less readable identifier was being used instead. 


The hardest part of this workflow was figuring out the RegEx I needed to find all instances of the column identifiers in the calculation string and parse them out so that I could use them in both the join tool and replace formula. And since the final RegEx was pretty long:


(\[Calculation\d+\]|\[Calculation_\d+\]|\[[^\]]+\s\(copy\)\]|\[[^\]]+\s\(copy\s\d\)\]|\[Parameter\s\d+\s\d\]|\[Parameter\s\d+\])

…let me break it down for easier comprehension:

  • \[Calculation\d+\]|\[Calculation_\d+\] looks for any string that starts with the word “Calculation” followed either by a series of numbers or an underscore and then a series of numbers (e.g. [Calculation2] or [Calculation_8921123152237715])
  • \[[^\]]+\s\(copy\)\]|\[[^\]]+\s\(copy\s\d\)\] looks for any string that contains the sub-string “(copy)” or “(copy)” followed by any digit (e.g. [Wk/Wk Pipeline (copy)])
  • \[Parameter\s\d+\s\d\]|\[Parameter\s\d+\] looks for any string that starts with the word “Parameter” followed by a space and then one or more digits.

Once I had that worked out, I just needed to set up the “iterative” part of the iterative macro. This involved a few key steps:

1. Set up two Macro Input tools, one for the stream I want to clean and one for the stream that will provide the replacement values.  As I stated above, both of these input tools will take in the same data stream (i.e. the data set I created by joining the two sets of column data), but one will go down a path where the calculations will be cleaned and the other will go down a path to provide the alias values for the calculation strings.

2. Set up two Macro Output tools, one that would receive the processed records and one that would receive the records that still needed cleaning. 


3. Set up a formula to check whether a particular record needs to be processed more than once if it contains more than one “unreadable” column or parameter reference.   For example, this calculation

SUM(zn([Calculation_596726966189228044]))- SUM(zn([Open Pipeline Qtr (Current Wk) (copy)]))

contains two fields that need cleaning, but only one will be replaced on the first pass since the RegEx function will only parse out one column identifier each time.  I could have created multiple parse functions and then joined on multiple fields, but I felt a single parse function inside of a loop was simpler.  So after a record is cleaned the following formula checks to see if there are more column names that need cleaning.  If no, then it goes to the “Cleaned” Output tool, otherwise it goes to the “Still Dirty” Output tool which loops that record back through the macro.


4. Configure the Iterative Macro by telling it which Input and Output tools to use for the iterations and what to do with the output once the iterations were done.


5. Save the macro and then insert it into my primary workflow.



Turning the Primary Workflow into a Batch Macro

So now I had a workflow that read in a TWB file, parsed out all of the columns associated with each datasource and then fed all of those columns through an iterative macro to clean up the calculation formulas. Before I could call it done, though, I wanted to add a few final steps to it. 

1. Just like with columns, some of the datasource names I brought in were in the form of an identifier vs a readable string (e.g. “sqlproxy.42394.437988391204” vs. “Sales Analytics Data”). So I needed to add a routine that read in two different portions of the XML and joined the identifier to its alias and then add that alias into the primary workflow stream. 


2. Pull in the worksheet root object and identify which columns were used in each and then join those results with the main data stream on column name.  This will duplicate records (i.e. if the same field is in 5 worksheets, it and its attributes will show up 5 times in the final data set) but since nothing is being aggregated there is no risk in doing this.  You can still get a distinct count of unique columns (e.g. COUNTD) while also knowing which columns are used in the most worksheets.



Once I had all of this, I had the workflow I wanted…but it could only read in one workbook at a time.  In order to read in more than one workbook at a time (e.g. an entire folder of TWB files), I had to convert the workflow into a batch macro. Luckily that part was really easy.  I just needed to add a Macro Input, a Macro Output and then configure the workflow to be saved as a batch macro. I also needed to set the macro to stack the output so that a single data set was produced that the calling workflow could consume. 


The Turducken is complete 
The last stage of the Alteryx Turducken was also very simple. In a new workflow, all I needed to do was set up a Directory tool to find a folder with TWB files, filter out the TWBX files (for some reason the Directory tool still pulls these in even though I specify .twb), feed the rows into the batch macro I just created and then take the batch macro output and save it as a .tde I can analyze in Tableau.  


Turning this into an app just involved adding a couple Interface tools (Folder Lookup and Action) to the front of the workflow and calling it done. I plan to pepper this with more parsing as time permits, but I’m satisfied with what I’ve dished up thusfar. 

Bon Apetit and thanks for reading.  If you’d like to check out the Turducken, click here.


Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s