Anatomy of a Viz

Every visualization is an opportunity to tell a story.  I don’t always take advantage of that opportunity, opting instead to just “display data” in a simple and user-friendly layout.  Granted, sometimes that is what I’m asked to do (business users do love their data tables), but I can say that whenever I approach a visualization as a storyteller, I am more pleased with the results and, candidly, I think the final visualization is always better.

And while my storytelling opportunities at work may be limited, there are countless opportunities “in the wild”.  MakeoverMonday is one very popular option (which I’ve almost religiously contributed to, although I have missed a few and am planning the requisite penance), but more recently I’ve started contributing to the Viz for Social Good project, a wonderful initiative that connects the data viz community with the non-profit community, harnessing and directing our geeky skills towards some very worthy causes.
 
I’d like to use this blog post to explain how I approached my latest viz for that project, both because a few people have asked about some of the specific design techniques I used and because it might help the future amnesiac me remember what I did (and why).  I find that as I age, repetition and reminders are increasingly my friends.

Getting acquainted with the story

Before I could figure out what story I wanted to tell, and then select the right design elements to support that story, I needed to acquaint myself with the contours and key details of the possible narrative.  This seems like an obvious step, but I can’t tell you how often I’ve just started building a viz only to get lost in a maze of charts and calculations that may be pretty or technically sophisticated but don’t say anything meaningful.
 
Chloe Tseng, the founder and driving force behind Viz for Social Good, always provides supporting materials for each project, including links to websites, news articles and relevant data sets.  So what were my main takeaways/impressions after perusing the materials she provided for the Hidden Genius Project initiative?
  • The focus of the project is explicitly young black men.
  • The focus of the training is technical, particularly coding.
  • African Americans are underrepresented in tech (both in terms of education and employment) and missing out on the growing opportunities in that field. 

So, in a sentence: Young black males are missing out on the growing opportunities in tech because they are not receiving the education they need to participate.

 
There were other possible stories to tell (e.g. the history of the Hidden Genius project, the structure of its programs, the number of students who have participated, individual success stories, etc), but the three items listed above are what resonated with me and what I wanted to focus on.  I also felt that the project’s website already did a great job telling some of those other stories, so linking to it (vs. replicating it) felt like a more fruitful approach.  That said, I’ve seen a few amazing vizzes that focused on some of those other elements (see here and here), but I suspect I wouldn’t have done as good of a job had I taken that approach since that isn’t where my inner storyteller wanted to go.

Selecting the story elements

The Central Image

Once I had the story I wanted to tell, the central design element came to me pretty quickly.  I wanted to have a photo of a young African American man dominate the viz, with half of his face obscured.  To me, this image conveyed the central theme of the story, namely that young black men were both overlooked and under-represented in the technology field.  I also felt that a large photo would have a more visceral impact and (hopefully) draw people in more than a number of charts would.  There would be charts too, but I wanted them to play a supporting role.
 
Finding the right photo took some time, since I needed it to meet a few specific criteria:
  • It needed to be available for me to use, i.e. not have restricted usage rights.  I didn’t always pay attention to this when I created visualizations, but I now know to make sure any image I use has been shared by the photographer in a way that allows me to use it without violating copyright.  Many photos are shared under Creative Commons, so the available pool is still pretty big, but it is both prudent and respectful to make sure any image you use is one you should use.  
    • An easy way to check is to use the Google Image search tools and only select images that are labeled for re-use.  
    • Some sites (like Flickr, where I ended up finding the photo I used) also allow you to see the usage rights for an image.  Clicking the “Some rights reserved” will tell you exactly what you can and cannot do with the image.
  • I wanted it to be monochrome and have no background.  I know that you can desaturate any color photo and remove backgrounds, but those post-production steps don’t always produce the best results, so I was looking for a photo that I could use as-is.
  • I wanted a photo that would look good big.  Some photos shared online are available only in smaller sizes that get very pixelated when you enlarge them.

I eventually found the perfect photo and went about obscuring one half of the young man’s face.  There are a number of ways to do this.  Pooja Gahndi (who inspired me to use semi-transparent images in my vizzes) uses MS Word, but since I have a Mac without MS Office, I do my photo prep in the free software GIMP.

 
The steps to create a split and semi-transparent image are pretty simple:
 
1. Open the photo in GIMP
 
2. Duplicate the layer (right-click on the image in the Layers palette and select Duplicate Layer)
 
3. Lower the opacity of the bottom layer.  This may require some trial and error to get to the right amount.  Since I wanted to still see a little of his face, I set the opacity to 15 (or 14.9 in the uber-precise Gimpian universe).
4. Add a layer mask to the top layer, initializing it to “White” (or full opacity).
 
5. Use the paintbrush tool to color in half of the layer mask with black.  What painting with black does is obscure any part of the image that you paint over, allowing anything underneath to show through.
 
6. Save the image as a .png file, as that will maintain the transparency.



Once you have the semi-transparent image file, simply open it inside of Tableau by dragging an Image object on a dashboard tab and selecting the file you just saved.  In my case, I also set the dashboard background to black so that the image blended in the way I wanted it to.

The Color Palette

I wanted to let the Hidden Genius logo drive the color palette.  As such, I decided to only use monochrome colors (i.e. gradients of black, white and grey) and the specific shade of green used in the logo (which I sampled via the color picker tool in Tableau and added as a custom palette).

In particular, I wanted to use the green as an accent color to identify the focus areas of the story, which were African American men and the Tech field (specifically coding/computers).  So, in each of the charts and text boxes, I only used that green to format elements that related to either African Americans or the computer industry.

Floating and Faux-Transparency

Now that I had my central image and overall color palette, it was time to start adding the other elements to my dashboard.  This is where floating and floating order become indispensable design tools, since they allow you to have precise control over layout and how the different elements on your dashboard interact.

I wanted to divide my canvas based on the split nature of the photo.  The left side would highlight the “Hidden” nature of African Americans in STEM fields (both in terms of employment and education), and the right side would focus on the opportunity available to the untapped potential of this target demographic.


So, first came the title, which is just two text boxes floated on top of the photo.  Their placement and formatting are very deliberate.

  • I chose to make the “Hidden” text box smaller than the “Genius” text box (80 pixels high vs. 100) and used more subdued colors to shift the visual emphasis to the right.
  • I placed the “Genius” text box atop the young man’s brain, as that is where the untapped potential resides.  This also ties in with the quote I chose for the bottom portion of the viz, namely “There’s value in you, your brain, and what you’re going to create.”
  • I used a skinny text box (height of 1-2 pixels) to connect the two text boxes, which helps underscore that this project guides the young men from obscurity to visible success.


The rest of the left side is comprised of a couple sheets and a few text boxes.   Beyond the choice of color palette described above, I only made a few additional formatting decisions:

  • I set the Worksheet shading to ‘None’ so that they inherited the background color from the dashboard (i.e. black).
  • I made sure the worksheets were lower in floating order than the photo.  You can do this by selecting and dragging the objects where you want them on the Layout panel.  This is the way you make it appear that the sheet background is transparent.  In actuality, the photo is just on top of the sheet, but since you’ve set the opacity of the photo very low, it doesn’t obscure the chart.
  • I positioned the line chart so that the young man’s right eye could still be seen.
 

The right side followed a similar approach, only I didn’t need to worry about floating order as much since I didn’t want any of the elements (the snippet of code, the bar chart) to be on top of the young man’s face.  I just nudged things around the canvas until my OCD was satisfied. This mostly involved using the Position and Size settings in the Layout pane to ensure each element was where I wanted it to be.

Finishing Touches

Sometimes I like to place borders around my text boxes, and sometimes I don’t.  Sometimes, though, I only want a border on one side of a text box, as a way of providing a little separation between the text box and the background without being too “heavy”.  To do this I can’t use the standard formatting options, since it’s all or nothing when it comes to borders (i.e. all four sides or no sides).  In these cases, you can use another text box (with a width of 1-2 pixels) and simply place it alongside the edge of the other text box.

Text box without borders
Text box with a skinny text box acting as a left-hand border


And finally, since there were a number of data sources used for this viz, I wanted an easy way for the viewer to see where the data came from (and maybe get a little more contextual information) without having to go to another tab or reference a single info pop-up in one corner of the viz.  So instead I created a few info sheets and placed them directly next to the specific data I wanted sourced.


Each info sheet contains a single measure (I created a measure called ‘One’ which is just the number 1) with a Mark Type of Shape. For the shape I used an information icon downloaded from Google.  Read this post in case you don’t know how to add custom shapes to Tableau.  It’s very easy.


I set each sheet to have a width and height of 20 pixels and placed them onto the dashboard canvas.  I then added the necessary reference information to each sheet’s tool tip.  Since I wanted people to be able to hover over these sheets and read the information in the tool-tip, I made sure that these info sheets were higher in the floating order than the main image.


___________________

Each viewer will ultimately decide whether all of these design choices aided the visualization.  Was the story clear?  Was it compelling?  Did I make it easy for them to engage with the data?  I certainly hope so.  But I will know that I did my best job when I can clearly articulate why each element was added to the visualization and what purpose it served in telling the story I wanted to tell.  And by being able to do that, I believe I increase my chances of connecting with my audience and making a successful visualization.

Thanks for reading.

Advertisements

Crossing Streams is a Good Thing in Alteryx

When you’re fighting ghosts, crossing streams might be something you want to avoid. But when dealing with data inside of Alteryx, you should embrace creating and crossing as many streams as necessary to get the results you want. 


A co-worker was recently confronted by an Excel spreadsheet that didn’t want to conform to the workflow he was building in Alteryx. It was one of those spreadsheets that are all too common in corporations – blank rows, hidden columns, stacked headers, sub-total rows sprinkled throughout, etc. The only corporate “norm” it didn’t possess was a rainbow color palette.  


In this case, he needed to extract the total forecast volume by product and calendar date from the spreadsheet shown below. 

In other words:


1. Grab only the subtotal rows
2. Strip out the word “total” from the product name
3. Grab the calendar date header, but only from the columns where the secondary header didn’t say “Actuals”

Number 1 and 2 seemed pretty straightforward, but #3 presented a challenge since it wasn’t immediately clear how to pull out the date header as it was one row above the other header values.  However, after a bit of noodling, this proved very quick and easy to accomplish with a handful of Ateryx’s awesome tools. 

Step 1 – Create two data streams 

In order to associate the date headers to the rest of the data, I needed to break out the Excel file into two data streams that could be joined together after a bit of clean-up. The multiple streams also allowed me to get rid of the blank rows I didn’t want. 

By adding a Record ID tool to the Input tool, I could then use a Filter tool to select Row 2 as one stream and Rows >= 5 as a second stream, getting rid of rows 1, 3 and 4 which were blank. 

Step 2 – Create a column ID field 

Now that I had two data streams, I needed a way to join them back together and add the calendar date from stream one to the data in stream two. The logical field for that would be the column position in the spreadsheet. So I used a Transpose tool to un-pivot the data in both streams at the same point, namely where the main headers ended and the date fields began.

At this point in the workflow all of the column headers were the default “F#” format you get when you import an Excel file without explicit headers. I would need to add actual header names eventually but those numbers were what I would use to generate the column ID field. All I needed to do was strip out the “F”. Easy enough with the RegEx tool.

Once I had the column ID in both streams, I could use that field in a Join tool to add the calendar date to the main data stream as an additional field.  Now I had a complete data set with every calendar date associated to every product.  Only problem was that the date was showing as a number vs an actual date. This happened because a column can only have a single data type and each column in the spreadsheet contained strings, numbers and dates, which could only be handled by a String data type, a conversion that reverted the formatted date value into the Excel date number value.

Step 3 – Convert the Excel date number into an actual date

A quick search of the fantastic Alteryx Community forums landed me on the solution for this. Since the number Excel assigns to a date value is simply the number of days since Jan 1, 1900, all I had to do was create that arithmetic calculation and wrap it inside of a DateAdd function to spit out the original date.

Step 4 – Name the columns

As I mentioned above, many of the column names still had useless names like F1, F2, etc.  The columns I had added to the workflow, via the Join or a Formula tool, already had good names but the original fields from my primary data stream did not.  Those field names existed as data values in the first row. 

 A Dynamic Rename tool allowed me to use those first row values to rename the headers.

Step 5 – Grab only the total rows

The final step in getting the required data set out of this Excel file was to filter the data set down to just the total rows.  I used a Filter tool to look for any row where the word “total” existed. There were two fields where this word existed – PROD_NO and ROLL_UP. I chose the latter but either would have worked – the PROD_NO field would have required a Contains condition whereas the ROLL_UP field required an Equals condition. 

Once I had the totals, I used a RegEx tool to strip out the word “Total” from the product name since then I could cleanly join this data with other product data (the data in this particular Excel file was merely one of several inputs in a larger workflow).

And with that, the Excel file was tamed and ready to play nice with my other data sets.  Thanks for reading.

A love letter (of sorts) to Makeover Monday

Throughout my life I’ve had an uneasy relationship between my logical and creative impulses. I’ve pursued both, but have generally allowed the logical impulses to dominate or win when a crossroads was reached. For example, as a kid and young man I had dreams of going into filmmaking or writing, but I never allowed myself to fully embrace those dreams (I.e. put in the requisite time and effort) because my logical side convinced me that these paths were too risky to go down without a solid practical fallback. But in going about establishing the fallback (e.g. getting and keeping good corporate jobs) I allowed the fallback to become the path, and the road my creative side had dreamed of traveling down became an alleyway I’d periodically take a detour into instead.
This isn’t to say that the logical path is a bad or unfulfilling one. My logical impulses are still a fundamental part of me (it is why I am happier in an orderly vs messy house, and why I get a visceral thrill when I build workflows in Alteryx), and following the practical path has provided me and my family with a relatively comfortable life. But when the logical side is disproportionately catered to, like when I allow my perfectionist tendencies at work to eat up the hours I could have spent out in nature with my camera, I really feel that imbalance and seek out ways to right the existential ship.
I know this dissonance is fairly common, and there are likely multiple reasons why I struggle with it. Although, if I had to synthesize my psychological conflict down to a single moment from my past, it’d be the time my family and I were visiting my aunt and uncle in Germany, back when I was a 16 year old daydreamer with gauzy aspirations of becoming a famous film director.  We were all sitting in my aunt and uncle’s basement, which had been converted into a bar with liters and liters of exquisite German beer on tap.  My uncle, a successful businessman, and my dad, a successful lawyer, were talking about how much happier they were in their 50s than they ever were in their 20s and 30s, mainly because of the comfortable lives they’d built for themselves and their families.  My mom and aunt were sitting at the other end of the bar discussing the struggles some friends’ kids were going through due to their dogged pursuit of a “Brotlose Kunst” (German for “breadless art”, I.e. artistic endeavors that don’t earn any money).   And I sat between them, enjoying the buzz from a beer I would eventually grow quite fond of while fortifying a worldview, forged of pragmatism and insecurity, that saw no viable connection between that early artistic hardship and a desired future of comfort and happiness.  As a result, my creative impulses tended to remain in the basement (metaphorically) while the pragmatic / logical pursuits took over the main floor.
Enter Makeover Monday
This autobiographical preamble is just a way of explaining why I’m so appreciative of how the Makeover Monday project and community has encouraged me to marry my logical and creative impulses in ways I hadn’t before, and why my love affair with Tableau has only deepened as a result, since that impressive piece of software doesn’t stand in the way of this often unlikely marriage.
In my particular case, the creative and logical coupling has taken the form of adding photos to my Tableau vizzes. And not just adding an image to the top of my viz (which I’ve done before), but rather selecting images that captured an essence of the story the data was telling and then allowing the selected image (or images) to influence all of my other deign choices, including dashboard layout, background colors, fonts and which dashboard elements should float (increasingly all of them).  This approach to creating a viz is very different from the way I’ve designed most Tableau vizzes throughout my career, where tiled order, simplicity and unemotional displays of data have been my guiding principles. At work, where I did the majority of my Tableau vizzing prior to Makeover Monday, I was definitely on the Few side of the Few/McCandless spectrum.  And while there is nothing wrong with that (there’s a reason why Few’s guidelines are part of the bedrock of data viz education – they do very effectively convey information),  it’s good to flex across the spectrum and explore the benefits of connecting with your audience on a more emotional or aesthetic level.
For example, one of the weeks’ data sets involved theft in Japan.  It contained some interesting tidbits, like the fact that theft of women’s underwear was a big enough event to warrant its own category in the data set, but the big story seemed to be that bicycles were far and away the most attractive target for Japan’s thieves. Conveying this fact wouldn’t require more than a simple bar chart in descending order of incidents – the bar for bikes is much larger than all other categories.  But how interesting would that visual be?  Not very.  It’d get a nod of approval from my bosses at work (along with perhaps a frustrating request for a complimentary table of numbers), but that wasn’t my aim this time.  I wanted a photo backdrop to that bar chart that conveyed the key takeaway (that bikes were the favorite target of thieves in Japan) in a more emotionally engaging way.
So to Google Image Search I went.  I find that this part is often one of, if not the lengthiest part of the design process since it is so important to find good images. Not only do they need to convey some part of the data story, but they should ideally be composed in a way that encourages the layout of an effective dashboard. So negative space (I.e. a portion of the image file where nothing exists) or the ability to extend the image into negative space (I.e. using masking to blend the image into a larger, complimentary background image) is something I look for.  I also allow my intended mood for the viz (humorous, serious) to help me filter the results down to something I want to use.
In this case, when I came across the image of the masked thief reaching his large hand toward the viewer, I knew I had my anchor photo.  But I needed to incorporate Japan and bikes into it somehow in order to complete the scene. Pictures of bikes and the Japanese flag were in ready abundance, but I wanted images that either reflected or could be incorporated into the overall sinister feel of the main photo.  The image of the Japanese flag already embellished with shadows was perfect, and the heavily silhouetted photo of a bicyclist riding away from the camera would work because it both didn’t add unnecessary color (I intentionally searched for monochrome images) and was moody enough that it could convey multiple feelings depending on how it was used.

To have it fit into the dark mood I was going for, I knew that I wanted the bicyclist to be riding toward the outstretched hand of the thief, like he was riding into a trap.  In order to achieve that, I would have to extend the bottom portion of the thief image so that there was enough room for the bicyclist photo to fit and still have some negative space below it (I didn’t want any of the layout elements to bump up against the borders).

This is where layers and masking entered the process, photo editing techniques I hadn’t used in a while but still had muscle memory for. Since I no longer had a working copy of Photoshop (the software I’d used in the past to layer images), I downloaded GIMP, a free and Mac-friendly software package that provides most of the functionality Photoshop does and in a very similar interface.

Using the photo of the thief as my baseline, I created a new layer that was just as wide but about twice as long, and then filled it with a single color sampled from the bottom of the thief photo (using the color picker tool). After that I dragged the thief photo atop the new layer, added a mask onto the thief layer and then used the brush tool to blend the transition between these two layers so that it looked like a single image.

Once I had that, I dragged the bike and flag images on top of the thief layer (so that I now had four separate layers), arranged them where I wanted them (with the bike below the hand, and the flag near the head of the thief to reinforce the story of Japanese thieves) and then used masking and brushwork on both the bike and flag layers to blend them into the overall image. This primarily involved removing the hard borders so that the visual transition between these photos and the background layer seemed natural. After adding a simple text box for my dashboard title, I had a complete background image I could export from GIMP and bring into Tableau.

I sized the dashboard to be the same as the exported image file and then placed the image into the dashboard. At that point it was just a matter of bringing the simple bar chart I mentioned earlier onto the dashboard as a floating object and editing the bar colors so that the Bicycle bar stood out from the rest. I chose to use the same red as was in the Japanese flag to visually connect the elements together. To add a bit of humor into it, I also highlighted the bar for women’s underwear, making it the same pink color as the panties in the underwear ninja cartoon I serendipitously discovered during my initial Googling.  All I had to do with that one was replace the blue background with a black one.

While the final viz didn’t convey any additional information over what the basic bar chart offered, it was certainly more engaging and a hell of a lot more fun to create.

I used a similar approach for another week where we looked at the militarization of the Middle East, an arguably more serious data set but one where the embedded stories could just as easily get neutered by a quotidian visualization. In this case the story that resonated most with me was the fact that lying just beneath the surface of these numbers were human lives destroyed by warfare.  So I wanted to create a connection between the sterile numbers that counted the volume of arms traded and the lives that those weapons ruined. To do that, I envisioned a layout where the top section focused on the number of weapons exported over time and the bottom section drew attention to the human impact of that arms trade. There weren’t any stats in the data set about lives lost, so I knew that this portion of the story would rely solely on the selected image and any corresponding text. I also wanted to create a transition between the top and bottom so that the two sides of this story sort of “bled” into each other.
That blood metaphor influenced my design of the bottom section – it was going to be some shade of red and feature an image that captured the human suffering of warfare.  So if the bottom was going to be red, then the top section would need to be a color that could blend well into red while also serving as a backdrop for the unemotional data that I would have up there. Black fit that bill. While the color black can certainly pack an emotional wallop, it also rivals white in its use as a standard background color and therefore inherits the neutrality that comes with ubiquity.

After a bit of searching I found two images that would work with that color scheme – a black & white photo of shells being prepared in a munitions factory (a visual nod to the counting inherent in the data) and then a photo of a young, bloodied war victim.  I first created a background layer in GIMP set to the dimensions I wanted for my viz (it needed to be long so that the bottom portion didn’t show up immediately when someone looked at it – I wanted them to scroll before seeing the image at the bottom) and then placed the two photos in their respective sections.  Then some masking to remove the hard edges and blend them into a cohesive whole and a gradient fill for the background layer that went from black at the top (sampled from the black in the shell prep photo) to a deep red at the bottom (sampled from the blood on the child’s face) and I had my background image.

In Tableau, I set up the charts to go from the very macro (i.e. the total dollars that have exchanged hands in the arms trade) to the more micro (i.e. the increase in imports into the Middle East since 2002) and made sure to use colors that either pulled directly from the background image or were complimentary to it.  I floated all of them so that I could control the amount of “breathing room” between the charts.  The final step was to add some descriptive text to each of the chart titles to reinforce the story I was telling visually and then the viz was complete.

I recognize that this approach to creating vizzes can eat up a lot more time than the recommended hour we are supposed to spend on our Makeover Monday entries, and that it can run counter to the K.I.S.S. mantra, but for me, the opportunity to move beyond the simple communication of numbers (while still paying attention to the effective communication of story) has been a rewarding one. A design-oriented mindset is not my default state, and so affording myself the time to consider things like color, font and layout has increased my overall confidence in these areas and helped build a new bridge between the logical and creative sides of my personality. While I won’t always take this approach when I visualize data (time, audience and purpose need to align for that), having this option in my toolkit gives me a level of creativity I didn’t have six months ago. And for that I am grateful.
Thanks for reading.

Creating a TWB (Tableau Workbook) Audit Workflow in Alteryx


I recently created my first Alteryx Turducken – an iterative macro inside of a batch macro inside of an app.  I didn’t set out to create a Turducken, but my workflow naturally evolved into one as I pursued my goal, which was to audit the content of my Tableau workbooks.  I know there are a number of resources out there (some free, some not) that will comb through the XML of Tableau workbooks and tell you how many calculated fields you have, which fields are used in which worksheets etc., but I liked the idea of creating my own, both to have complete control over the end product and also to expand my Alteryx proficiency.


I’m going to document my approach in this post for three reasons:
  1. Given how long this took me (I’d never worked with XML before so I went down many dead ends before I got some things to work), I want to create a reference for myself so my next foray into this is much quicker.
  2. Provide a reference to others who might have a similar goal.
  3. Expose my steps in the hope that smarter people out on the interwebs can advise me on how to make my workflow better.

So the first thing I needed to come to grips with was how to read in the XML from any TWB file.  There’s a great XML Parse tool in Alteryx, but it expects you to have a basic grasp of the underlying XML so that you can tell it what to pull in. 


I had no idea what any of this meant when I first looked at this tool, and I still only have a loose grasp of it.  XML is clearly logically structured, but there’s quite a bit of repetitiveness in the code (i.e. elements like column names, worksheets etc. show up numerous times in different sections), making it hard for a novice to ascertain where the juicy and relevant bits are.  And the XML for a TWB with a fair bit of fields and worksheets is intimidatingly long!  One of my workbooks had almost 30,000 lines of code that all looked like this.  God help me.


But the core insights I arrived at after perusing the help were these:
  • The _OuterXML fields are the ones I need to feed into the XML Parse tool.  If I fed in other fields I’d often either get error messages (e.g. ‘Invalid document structure’) or just minimal and useless results.
  • There are multiple key root elements in a workbook (e.g. datasources, worksheets, etc.) that all have children (e.g. column names, filters, etc.) and the same elements could be children of different roots (e.g. a column could be a child of a data source and a worksheet). 
So knowing which OuterXML of which root element to bring in and which children to look for was going to be key to getting the workbook insights I was looking for.  Given that my driving interest was primarily around which fields were in my workbooks (and how/where they were used), I figured I’d start there.

Identifying the Columns in a Workbook

Before I could do any parsing I had to connect an Input Tool to a TWB file and check pretty much everything (root element, child values, OuterXML) since I knew I needed all of it downstream.  I suspect there are use cases where you wouldn’t need to bring all of this stuff in, but I knew I needed the kitchen sink for my workflow. 


Once I had my connection, I then needed to find the columns.  This proved to be a little less straightforward than I hoped it would be, but it showed me how there are at least two ways to arrive at the same end, one that leverages a single XML Parse tool and a lot of RegEx (to parse fields from within long sections of script) and another that uses two XML Parse tools and some simpler Select and Join tools.  I opted for the latter.


The first XML Parse tool (1) pulls in the datasource root and lists out all of the child elements associated with each datasource.  If you only have a single data connection and no parameters, then you will only have one datasource, but if you have multiple connections or parameters in your workbook then each of those will be a separate datasource (i.e. Tableau considers your parameters as a distinct datasource in the XML).  As you can see below (which is showing a Transpose tool placed after the output of the XML Parse tool), each child element is returned as separate data fields in the result set including a separate field for many of the column attributes (e.g. size, type, etc.).   So the first column in the datasource has the field names titled “column name”, “column caption”, etc., the second has field names titled “column 2 name”, “column 2 caption” and so on. But some key attributes (like calculation) are still missing.  Not sure why, but they are.  Perhaps they’re buried inside of the Column OuterXML. 


So I used a second XML Parse tool to bring in the Column child from the datasource root, which gave me a nice table of all of the column names and associated attributes including calculation formula.


This stream, though, didn’t appear to have the datasource name associated with it (at least not in a distinct field I could see/grab), so I was going to have to join the two streams together to get a complete data set with all column names, associated attributes and which datasource it was a part of.  The problem was that the top stream had each column as a separate field (e.g. “column 3 name”) whereas the bottom stream had all of the names in a single field called Name.

To get around this I had to create a formula to take the output of the Transpose tool in the top stream and identify the column name associated with each column field.  In other words, find all instances where the Name field contained the string “column name” and then return the contents of the associated Value field.  I did the same thing for Column ID, which I figure could come in handy as well so that each column in my data set had a unique ID number.



Joining the top and bottom streams on Column name gave me the complete data set I was looking for, at least in terms of all of the columns in my workbook, all of the attributes I was interested in and which datasource each column belonged to. Pretty cool. 



But I did notice one flaw in the content – the calculation formulas often contained references to the non-user-friendly column identifier vs the caption/alias that a user would see inside of the workbook. For example, the field [Calculation_8921123152237715] refers to a field I named QTD Linearity Delta but there’s no way to know that from the calculation string here:

IF [Calculation_8921123152237715] > 0 then “up” ELSEIF [Calculation_8921123152237715] < 0 then "down" ELSEIF [Calculation_8921123152237715] = 0 then "same" END

What I needed was a routine to replace all instances of these identifiers within each formula with the caption/alias it was associated to. This wouldn’t matter much if I just wanted to compare formulas across workbooks, since the actual string content of the calculation was still unique, but presenting this information to any human would be much harder if the entire formulas weren’t comprehensible. 

So this is where the iterative macro entered the picture, since I needed to do this calculation clean up for a bunch of columns within my data set. 

Cleaning up the Calculations with an Iterative Macro

I set up my calculation clean-up workflow to take in the same input stream twice, so that I could use the caption from one of them and plug that into each formula in the other stream where its less readable identifier was being used instead. 


The hardest part of this workflow was figuring out the RegEx I needed to find all instances of the column identifiers in the calculation string and parse them out so that I could use them in both the join tool and replace formula. And since the final RegEx was pretty long:


(\[Calculation\d+\]|\[Calculation_\d+\]|\[[^\]]+\s\(copy\)\]|\[[^\]]+\s\(copy\s\d\)\]|\[Parameter\s\d+\s\d\]|\[Parameter\s\d+\])

…let me break it down for easier comprehension:

  • \[Calculation\d+\]|\[Calculation_\d+\] looks for any string that starts with the word “Calculation” followed either by a series of numbers or an underscore and then a series of numbers (e.g. [Calculation2] or [Calculation_8921123152237715])
  • \[[^\]]+\s\(copy\)\]|\[[^\]]+\s\(copy\s\d\)\] looks for any string that contains the sub-string “(copy)” or “(copy)” followed by any digit (e.g. [Wk/Wk Pipeline (copy)])
  • \[Parameter\s\d+\s\d\]|\[Parameter\s\d+\] looks for any string that starts with the word “Parameter” followed by a space and then one or more digits.

Once I had that worked out, I just needed to set up the “iterative” part of the iterative macro. This involved a few key steps:

1. Set up two Macro Input tools, one for the stream I want to clean and one for the stream that will provide the replacement values.  As I stated above, both of these input tools will take in the same data stream (i.e. the data set I created by joining the two sets of column data), but one will go down a path where the calculations will be cleaned and the other will go down a path to provide the alias values for the calculation strings.

2. Set up two Macro Output tools, one that would receive the processed records and one that would receive the records that still needed cleaning. 


3. Set up a formula to check whether a particular record needs to be processed more than once if it contains more than one “unreadable” column or parameter reference.   For example, this calculation

SUM(zn([Calculation_596726966189228044]))- SUM(zn([Open Pipeline Qtr (Current Wk) (copy)]))

contains two fields that need cleaning, but only one will be replaced on the first pass since the RegEx function will only parse out one column identifier each time.  I could have created multiple parse functions and then joined on multiple fields, but I felt a single parse function inside of a loop was simpler.  So after a record is cleaned the following formula checks to see if there are more column names that need cleaning.  If no, then it goes to the “Cleaned” Output tool, otherwise it goes to the “Still Dirty” Output tool which loops that record back through the macro.


4. Configure the Iterative Macro by telling it which Input and Output tools to use for the iterations and what to do with the output once the iterations were done.


5. Save the macro and then insert it into my primary workflow.



Turning the Primary Workflow into a Batch Macro

So now I had a workflow that read in a TWB file, parsed out all of the columns associated with each datasource and then fed all of those columns through an iterative macro to clean up the calculation formulas. Before I could call it done, though, I wanted to add a few final steps to it. 

1. Just like with columns, some of the datasource names I brought in were in the form of an identifier vs a readable string (e.g. “sqlproxy.42394.437988391204” vs. “Sales Analytics Data”). So I needed to add a routine that read in two different portions of the XML and joined the identifier to its alias and then add that alias into the primary workflow stream. 


2. Pull in the worksheet root object and identify which columns were used in each and then join those results with the main data stream on column name.  This will duplicate records (i.e. if the same field is in 5 worksheets, it and its attributes will show up 5 times in the final data set) but since nothing is being aggregated there is no risk in doing this.  You can still get a distinct count of unique columns (e.g. COUNTD) while also knowing which columns are used in the most worksheets.



Once I had all of this, I had the workflow I wanted…but it could only read in one workbook at a time.  In order to read in more than one workbook at a time (e.g. an entire folder of TWB files), I had to convert the workflow into a batch macro. Luckily that part was really easy.  I just needed to add a Macro Input, a Macro Output and then configure the workflow to be saved as a batch macro. I also needed to set the macro to stack the output so that a single data set was produced that the calling workflow could consume. 


The Turducken is complete 
The last stage of the Alteryx Turducken was also very simple. In a new workflow, all I needed to do was set up a Directory tool to find a folder with TWB files, filter out the TWBX files (for some reason the Directory tool still pulls these in even though I specify .twb), feed the rows into the batch macro I just created and then take the batch macro output and save it as a .tde I can analyze in Tableau.  


Turning this into an app just involved adding a couple Interface tools (Folder Lookup and Action) to the front of the workflow and calling it done. I plan to pepper this with more parsing as time permits, but I’m satisfied with what I’ve dished up thusfar. 

Bon Apetit and thanks for reading.  If you’d like to check out the Turducken, click here.


Controlling the Format of the Cross Tab Header in Alteryx

Alteryx has convinced me that data is infinitely malleable.  You can flip it, spin it, twist it, shape it, break it into a million pieces and then rebuild it into the precise configuration you need.  So long as you know what you want the output to look like, Alteryx will let you get there.  The exact steps may not be immediately obvious (especially to people starting out), but there are enough ways to combine tools that you will ultimately get the output you want.

I recently got a request to take a bunch of disparate data files and produce a single output file.  This is bread and butter stuff for Alteryx.  A few joins, unions, filters and formulas allowed me to take half a dozen individual files and produce (within a few seconds) a single, consolidated Excel file.  Awesome.  Job done…or so I thought.

After I produced the output file, I was given the spec required to upload the consolidated data into a customer’s system.  They needed the data pivoted so that the header dates became columns, with the dates sorted in ascending order.

Okay…that’s what the Cross Tab tool is for.  So I pivoted on Forecast Date (I.e. made Forecast Date the new column header) which did create a column for each unique date, but the field format was wrong.  Whereas the customer spec indicated “MM/DD/YY” for the date headers, the Cross Tab tool was outputting “YYYY_MM_DD”.

However, if I reformatted the date to be in the correct format, then the Cross Tab tool would sort the fields in ascending order based on Month and Day, meaning that January would always come first even if it was in the following year.  This is because the Cross Tab tool sets the column header as a string and sorts ascending with no obvious way to change that sort.  It also adds those underscores, which I don’t particularly like.

In pre-Alteryx days this would have necessitated some recurring tweaking in Excel to change the format of the column headers.  But I knew there had to be a way to make this happen within the existing workflow so that the file generated was immediately ready for upload. And there was, via 4 pretty easy steps.


*********

Step 1. Create a separate data stream that uses the Summarize tool to pull in the original Forecast Date and a new field called “Header Date” that is simply a string version of Forecast Date, but in the right format.  I used String vs Date since the Cross Tab tool converts all column headers to string anyway.

I then used a Sort tool to sort ascending on Forecast Date and a Record ID tool to add a record ID to every row.

  

What this step produces is a simple 3-column data set that looks like this:


Step 2. Use the Join tool to combine this new stream with the main data stream, joined on Forecast Date.


Step 3. Use the Cross Tab tool to pivot the data as before, but this time using the new Record ID field as the header.


Step 4. Finally, use the Dynamic Rename tool to replace the Record ID with the corresponding Header Date.  This tool takes the main data stream (which is now in cross-tab format, with Record ID as the header) and uses that 3-column data stream I created in Step 1 to replace every Record ID with the Header Date it is associated to.

This produced the output in the required structure and format, adding at most another second or so to the overall workflow time.  And my love affair with this software continues…

Thanks for reading.

Two Colors for a Single Dimension Value in Tableau

For a recent Makeover Monday, I wanted to be able to visually focus on either the positive or the negative aspect of a metric based on the toggle of a parameter.  In particular, I wanted to be able to draw attention to the chosen aspect (positive or negative) by way of a bold color and turn the other portion grey without having to set up separate charts or worksheets.

As with many things in Tableau, this ended up being easier than originally thought.  It just took a couple of calculated fields and the realization that the same value can be given multiple names.

Here is what the view looked like when I was done.  The chart structure doesn’t change (other than the sort of countries) but the same portion of each chart (bar, donut, etc.) takes on a different color depending upon which parameter choice is selected.

The field I had initially used to construct my views was a fairly simple one that identified each response as either positive (“pos”) or negative (“neg”).  It’s also one that I borrowed shamelessly from Andy Cotrgreave’s viz…if someone else does something well, why not borrow it? 🙂

But when I placed it on the color shelf I was only able to assign one color to “pos” and one color to “neg”.

Since I knew I didn’t want to create two charts or duplicate the dimension, the main thing I needed to figure out was how to get the positive and negative values to play two roles, i.e. to be both highlighted and not highlighted depending on what parameter value was selected.  What solved the conundrum for me was the realization that the values of “pos” and “neg” were arbitrary…understandable and logical, but ultimately arbitrary.  I could just as easily have encoded the values as “captain” and “underpants” and the chart would have behaved the same way.  With that insight, I determined that I could encode the values one way when the first parameter value was selected and encode them another way when the second parameter value was selected.  Like so:

Now I have four distinct values within this dimension vs. just two.  The values still just represent the two response types of “positive” or “negative” but I’ve given each response type two discreet aliases.  This allows me to assign one set of colors to the values ending in “1”…

…and another set of colors to the values ending in “2”.
And with that I had the bulk of my desired viz created.  What remained was to create a few additional calculated fields to allow for a dynamic title…

…and to dynamically sort the countries based on either total positive or total negative responses:

I hope this short tutorial proves useful to others.  Thanks for reading.

Row-Level Security On A Published Tableau Data Source

There are a number of great articles that explain how to enable row-level security inside of a workbook (see herehere, and here), but I haven’t come across any that show how to do that inside of a TDE (i.e. a published Tableau data source).  While setting up row-level security in a workbook is useful, that model falls apart very quickly when you want the same security to persist across more than a handful of workbooks.

Why would we want to have the security handled in the TDE vs. inside each workbook?  For us, there were three key reasons:

  • We wouldn’t have to replicate the same user filters across every workbook that pointed to the TDE.  We expect to have dozens of workbooks (if not more) pulling from the same TDE.  
  • We could allow users to create their own workbooks against the TDE using web authoring but ensure that they only saw the slice of the TDE they were supposed to.
  • Any changes to the security model would only have to be implemented inside of the TDE without having to make any changes to the connected workbooks; they would just inherit the changes.
Setting it Up

After some experimentation, the solution ended up being pretty straightforward and simple.  These are the steps we took to make it work:

1. Identify the users and determine what level of data security they need.  For us, there are two types of users – (a) users who are assigned to specific customers (i.e. they can only see data for those customers), and (b) users who are assigned to one or more regional LDAP groups that our Tableau Server uses for authentication (i.e. they can see all data within those regions).
2. For the first group of users, we added their employee IDs to the TDE.  Since we use Alteryx to generate the TDEs, this involved adding a join between the main data set and a secondary data file from Salesforce.com containing the relationships between these users and their customers (e.g. a customer number mapped to an employee ID).  
A zoom on the actual Join portion of the workflow.  We joined the main data stream to the account mapping table based on customer code.
By creating this join we definitely increase the size of the resulting TDE (i.e. from roughly 1.5M rows to just under 5M rows), so this method may not perform well on massive data sets but we saw no performance degradation (and Alteryx generates the whole thing in under 30 minutes).  This is also why we only add in the employee IDs of those users who are assigned to specific customers; all other users (i.e. the second group described above) are handled via their LDAP group entitlements and the ISMEMBEROF function shown in Step 3 below.
3. Connect to the TDE from Tableau Desktop and create a calculated field that determines whether the authenticated user belongs to one or the other group and then filters the data accordingly.  You want this calculation to evaluate to either True or False.

4. Add this calculated field as a data source filter (right-click on the data connection and select Edit Data Source Filters), selecting “True” as the only value to keep.   Then publish the data source to Tableau Server.
Now when I open, create or edit any workbook pointing to that data source, the data is filtered for my credentials without the workbook containing any user filters or blends with secondary data sources.  This also prevents people from removing these user filters from the downloaded workbooks and thereby circumventing the security model.

While applying data security at the database level and connecting to it live is arguably a more robust solution overall, that approach certainly takes longer to set up and may perform slower than a TDE unless significant effort is employed to optimize those queries.  The method I described in this blog post allows us to take advantage of the myriad benefits of Tableau extracts while still delivering data security via a very easy-to-maintain model.