Crossing Streams is a Good Thing in Alteryx

When you’re fighting ghosts, crossing streams might be something you want to avoid. But when dealing with data inside of Alteryx, you should embrace creating and crossing as many streams as necessary to get the results you want. 


A co-worker was recently confronted by an Excel spreadsheet that didn’t want to conform to the workflow he was building in Alteryx. It was one of those spreadsheets that are all too common in corporations – blank rows, hidden columns, stacked headers, sub-total rows sprinkled throughout, etc. The only corporate “norm” it didn’t possess was a rainbow color palette.  


In this case, he needed to extract the total forecast volume by product and calendar date from the spreadsheet shown below. 

In other words:


1. Grab only the subtotal rows
2. Strip out the word “total” from the product name
3. Grab the calendar date header, but only from the columns where the secondary header didn’t say “Actuals”

Number 1 and 2 seemed pretty straightforward, but #3 presented a challenge since it wasn’t immediately clear how to pull out the date header as it was one row above the other header values.  However, after a bit of noodling, this proved very quick and easy to accomplish with a handful of Ateryx’s awesome tools. 

Step 1 – Create two data streams 

In order to associate the date headers to the rest of the data, I needed to break out the Excel file into two data streams that could be joined together after a bit of clean-up. The multiple streams also allowed me to get rid of the blank rows I didn’t want. 

By adding a Record ID tool to the Input tool, I could then use a Filter tool to select Row 2 as one stream and Rows >= 5 as a second stream, getting rid of rows 1, 3 and 4 which were blank. 

Step 2 – Create a column ID field 

Now that I had two data streams, I needed a way to join them back together and add the calendar date from stream one to the data in stream two. The logical field for that would be the column position in the spreadsheet. So I used a Transpose tool to un-pivot the data in both streams at the same point, namely where the main headers ended and the date fields began.

At this point in the workflow all of the column headers were the default “F#” format you get when you import an Excel file without explicit headers. I would need to add actual header names eventually but those numbers were what I would use to generate the column ID field. All I needed to do was strip out the “F”. Easy enough with the RegEx tool.

Once I had the column ID in both streams, I could use that field in a Join tool to add the calendar date to the main data stream as an additional field.  Now I had a complete data set with every calendar date associated to every product.  Only problem was that the date was showing as a number vs an actual date. This happened because a column can only have a single data type and each column in the spreadsheet contained strings, numbers and dates, which could only be handled by a String data type, a conversion that reverted the formatted date value into the Excel date number value.

Step 3 – Convert the Excel date number into an actual date

A quick search of the fantastic Alteryx Community forums landed me on the solution for this. Since the number Excel assigns to a date value is simply the number of days since Jan 1, 1900, all I had to do was create that arithmetic calculation and wrap it inside of a DateAdd function to spit out the original date.

Step 4 – Name the columns

As I mentioned above, many of the column names still had useless names like F1, F2, etc.  The columns I had added to the workflow, via the Join or a Formula tool, already had good names but the original fields from my primary data stream did not.  Those field names existed as data values in the first row. 

 A Dynamic Rename tool allowed me to use those first row values to rename the headers.

Step 5 – Grab only the total rows

The final step in getting the required data set out of this Excel file was to filter the data set down to just the total rows.  I used a Filter tool to look for any row where the word “total” existed. There were two fields where this word existed – PROD_NO and ROLL_UP. I chose the latter but either would have worked – the PROD_NO field would have required a Contains condition whereas the ROLL_UP field required an Equals condition. 

Once I had the totals, I used a RegEx tool to strip out the word “Total” from the product name since then I could cleanly join this data with other product data (the data in this particular Excel file was merely one of several inputs in a larger workflow).

And with that, the Excel file was tamed and ready to play nice with my other data sets.  Thanks for reading.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s