Using Import.io and Tableau to find Nature

I’m fairly new to the state of Washington (having only moved here 6 months ago), and am very eager to explore all of the natural wonders that now surround me.  But unlike my younger years, where I could just head out on my own for hikes of undetermined length and difficulty (so long as I didn’t die, I was totally okay with the uncertainty that went along with unplanned forays into the outdoors), I now need to be a bit more pragmatic and find hikes that my family will enjoy, specifically my kids.  The 4-year old gets excited by anything new but has little legs, while the teenage girls can certainly walk longer but have a pretty steep boredom threshold.

There are a number of great resources out there for researching hikes.  The two I use the most are the Washington Trail Association website and the book The Best Hikes with Kids.  Both are very thorough, with lots of great information, but in both cases I often find myself flipping back and forth between pages as I try to compare hikes.  The impatient data geek in me figured there had to be a better way.

Getting the Data

The first thing I needed to do was grab the necessary data.  Up until now, I’ve always dealt with tabular data, pulled either from Excel or a data warehouse (spoiled, I know).  In this case, the data I needed was inside pages and pages of search results.  Furthermore, some of the data wasn’t on the main search results page, but rather one level deeper on the hike details page.  Even after I filtered the Washington Trail Association’s search results for just the kid-friendly hikes, that still left well over 500 hits.  That’s a lot of data to manually transcribe, and I’ve never been that ambitious.

Luckily, I’d recently read a few blog posts about a tool called import.io, which allows you to scrape data from web pages into a format that you can actually work with.  The tutorials made everything look pretty simple, so I gave it a try.  And in general everything worked as advertised.  The interface is intuitive and the underlying engine gets “trained” with just a few clicks.  There were, however, some wrinkles in the extraction process that required a touch more attention and tweaking. 
1. I couldn’t make the crawler as smart as I wanted it to be. 
You’re supposed to “train” the crawler by feeding it at least five separate pages and then providing it with a template that it uses to grab all subsequent pages.  The logic of the URL was pretty straightforward (it just incremented the start integer by 30, since there were 30 results per page)…
but I wasn’t able to create a template that would take the last integer I fed it (e.g. 120) and keep adding 30 until it reached the end of the available data.  So instead I just copied and pasted the URL string enough times to cover all of the returned pages, ensuring that each one ended with an integer 30 greater than the previous one.  
Not horrible for a one-time extract of 20 pages, but this is not something I’d want to do if I decided I wanted to grab all 3,200 hike records.  I’ve emailed tech support to see if there is a way to loop through URL strings like this.  I suspect there has to be. 
2. One crawler couldn’t feed another one. 
This one is pretty common (i.e. they talked about it during one of their webinars), but I still hope they introduce some future logic to handle this better.  
After I had grabbed the primary data for the 570 hikes using the first crawler, I then needed to go into each hike’s individual page for some additional data (like latitude, longitude, the # of trip reports, etc.)  Unlike the first extraction, which had a parameter-driven URL string, each hike’s page had a very unique URL string (e.g. “https://www.wta.org/go-hiking/hikes/deception-creek“).  I had just finished extracting all of those URL strings, but I had no way of automatically feeding those strings into a second crawler, similar to a sub-query.  So instead I just opened the result of the first crawler into Excel, copied all of the URL strings and then pasted them into the “Where to start” window of the second crawler.
This worked well, but again it means that this extraction process can’t be automated.  Each time I want a refreshed list of hikes, I’ll need to run the first crawler and then copy and paste the results of that into the second crawler.  Good thing new hikes aren’t popping up every day. 
3. The hike features were icons vs words. 

While most of the data I needed was in string or numeric format, the hike features (e.g. whether there were mountain views, whether dogs were allowed, etc.) were all represented by icons, so I couldn’t just map those to a text column.  However, if I changed the column type to html, I could grab the underlying URL string for each unique icon, figuring I could just strip out everything but the feature string later.

Another problem was that the features (and icons) were inconsistent between hikes.  In other words, not only did hikes contain a different number of features, but the order of the icons could vary from hike to hike.  So I just created nine generic icon fields (e.g. icon 1, icon 2, etc.) and mapped them to the specific location on each page.  This resulted in each icon field containing multiple values, but I knew this was something I could correct after I had extracted the data.

Preparing the data for Tableau

After extracting all of the hike data, I ended up with two CSV files – one with the primary hike data, and another with the secondary data.  The first one could be imported into Tableau without incident, since it was structured optimally with one row per hike and each column representing a unique attribute like name, region, rating etc.  The second file, however, contained those nine separate icon columns.  Before I could import that into Tableau, I had to turn those into two columns, one with the icon number and the second with icon URL string.

Using the Tableau Excel Reshaping tool made that very easy.  Now I could import this second file into Tableau as well.

The final piece of data prep was stripping out all of the URL string junk and just leaving the feature description.  A Tableau string function using a combination of mid, len and find did the trick.  I now had two data files I could play with.

Were I to need to do this type of data prep regularly, or on a significantly larger data set, I think I’d definitely look to Alteryx to automate the workflow. But for a small and ad-hoc use-case like this, import.io and the Tableau data reshaping tool were more than sufficient, despite the little hiccups.

Building the Tableau View

I wanted my Tableau view to do a few things:

1. Let me filter the list of possible hikes based on things like region, feature set, distance etc.
2. Compare those hikes in terms of rating, popularity and distance
3. See where they were on a map
4. Link out to additional info like trip reports and photos

The only slightly tricky part was #1, simply because I had imported the hike data with multiple rows per hike in order to accommodate the feature field.  And while I could use the single Hike Feature field to filter those rows, what I really wanted the filter to do was only show me the hikes that met all of my filter criteria.  For example, if I selected ‘Waterfalls’, ‘Mountain Views’ and ‘Dogs Allowed’, I would want to see only those hikes where all three of those were true.  As you can see below, simply applying the Hike Feature filter gives me all hikes where any of those criteria are true.

As happens more often than not in Tableau, table calculations to the rescue!  I created three table calculations to help me identify which hikes met all of the features in my filter set:

1. # of Selected Features – I used total(countd([Hike Feature])), set to compute along Table Down to give me the total number of features that had been selected.

2. # of Matching Features – I used window_sum(countd([Feature Number])), set to compute along all Names and Hike Features but restarting every Name, to tell me how many features were present per unique hike.

3. Matching Hikes? – this is just a Boolean comparison of the two prior calculations ([# of Matching Features]=[# of Selected Features]) to tell me which hikes had the same number of features as the number selected.  Placing this on the filter shelf, with only “True” selected, ensures that my view will only contain hikes that meet all of my filter criteria.

Once I had that, the rest was predominantly formatting & design decisions.

  • I had originally planned to make a map the central part of the view, but felt that it wouldn’t allow for the kind of comparisons I wanted to make.  In addition, you couldn’t see all of the hike names at once, which you can with a bar chart.  So I placed a comparative bar chart in the top (i.e. main) section of the view, and used the map as a supporting chart, linked to the bar chart via a highlight action.
  • I first placed all of the hike details and description in the tool-tip, but then quickly realized that the tool-tip often obscured the map below (given how long some of the hike descriptions were).  So I moved the hike info down next to the map, and left the bar chart tool-tip leaner, focusing primarily on the links to other resources.
  • Speaking of the links, I wanted to provide more than just a link back to the hike details page that I pulled the data from.  So I also created a link to a Google Image search, passing the hike name + “WA” as a search parameter (given how generic some of the hike names were, adding WA to the string increased the changes of getting relevant photos).  Being a photographer, I often want to assess a hike based on the photographic opportunities it might offer me.  This is something that often drives my family insane. 🙂
  • I created a parameter and an associated calculated field that lets me sort the hikes based on things like rating, popularity, distance etc. 
  • Since I borrowed many of the icons from the WTA website, I chose a green hue for the text and bars that I felt was complimentary.  

Click the image of the dashboard to go to the interactive version.

Thanks for reading.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s