Make the Grade

Recently the Alberta government hosted Apps for Alberta – a competition using the province’s open data. Being an Alberta-based data visualization firm, we felt encouraged, perhaps even duty-bound, to enter. So we did. We managed to pull together a couple submissions, the first of which is a look at high school grades in the province.

Make the Grade Visualization

In an ideal world, you start by asking: What question am I helping to answer? Then you go out and locate the most appropriate data available to help you answer that question and, because it is an ideal world, exactly the data you need is already there. It’s squeaky clean and accessible, just waiting to be visualized.

In the real world, data availability is often the limiting factor. You therefore start with the data you have and try to discover what questions it might help answer. If any of the question/answer combinations you discover end up engaging your curiosity, then you have a candidate for a visualization.

Such was the case with this competition. We poured through the available Alberta data sets looking for something that was interesting (engagement value) and relevant (societal value). Eventually we landed on the high school grades data.

Next we needed to define our audience. It’s easy to fall into the trap of defining an overly broad audience (e.g. the General Public). I mean, it’s on the Internet, right? The whole world is your audience. But there is a paradox: the more narrow your focus, the more engaging your application. We therefore defined our target user as “parents of children about to enter high school”. The application will answer some questions well rather than many questions poorly.

It’s also important to recognize the limitations of the data. Our audience’s question might be “what is the best school for my child?” Our data can only answer “How do schools’ grades compare?” There are a host of education quality concerns that aren’t addressed by our data (e.g. the socio-economic status of the students, opportunities outside the classroom, etc.). And there are likely biases in the data collection (e.g. teaching to the test, cheating, small sample-size schools, or inconsistent grading criteria). But what we do have (average grade by school and subject) is still valuable.

As an aside, our visualization doesn’t make claims it cannot possibly deliver on. Headlines like Find the Best School in Alberta or Where Do Schools Suck the Most? may pull in more hits on Reddit, but would be a disservice to the users of this tool and the teachers and institutions reported on. The socio-economic factors likely dominate any straight school-to-school comparison. Nevertheless, a more nuanced exploration reveals some interesting gems.

Students across the province seem to struggle with English more than the other subjects. Or perhaps English grading standards are higher.

Private schools are over-represented among the higher ranks (as you’d expect), but there are still some private schools for whom grades are not a primary focus.

Private Schools' Rankings
Overall private schools grades highlighted among a ranking of all schools in Alberta.


Our most interesting finding is in looking at intra-school grading. While most schools have fairly consistent scores across subjects, certain institutions appear to excel in a specific subject. By comparing within a school, we are naturally accounting for most of the socio-economic impact. So if my daughter wants to go into engineering, Henry Wise Wood might be an excellent choice with its strong Mathematics program. If I thought my son was destined for politics, I might point him toward St. Francis Xavier and its exceptional Social Studies program.
Intra-school Grades

Play around with the tool and feel free to share your findings and feedback in the comments below.

Posted in Visualization | Comments Off

Radar: More Evil Than Pie?

It is by now common knowledge among the viz-savvy crowd that pie charts are best avoided. Many pixels have been used explaining why pies are bad, when occasionally pies are good, and my personal favorite from my colleague at Darkhorse: Salvaging the Pie. Far less attention, however, has been paid to pie’s evil cousin, the radar chart. Perhaps it’s not as ubiquitous, but it can be more nefarious.

One such example recently appeared in a Harvard Business Review (HBR) blog post: CEOs Get Paid Too Much, According to Pretty Much Everyone in the World

The visual design is appealing. The clarity is appalling. There are good things to be said about the choice of contrasting but complementing colors. The subtle axes are helpful without obscuring the data. The glaring question here is why a radar chart? Do countries somehow naturally follow each other alphabetically in a circle? (Only at the Olympic closing ceremonies!)

Comparison of data points is difficult
To compare one country to another, your eyes have to do a fair bit of work either by following the circle around, or counting grid squares. For example, how does the United States compare to Italy? You have three seconds. Ready, set, go!

Facts are lost in the mushy middle
What’s Spain’s value (or is that Sweden?). How about Hungary’s “Ideal” value? The inside blue values are almost impossible to read.

Over-emphasis of high numbers
The gap between countries gets larger as values increase. When connecting the points the area created by points higher on the scale is disproportionately larger than those created at the bottom.

To illustrate this point, compare the following charts.

The area of the second chart is four times larger than the first even though we’ve only doubled the values. The third chart is nine times larger than the first despite only tripling. The area grows exponentially as the values increase. The result is a disproportionate emphasis on the top few countries.

Sequence determines area
Another thing that happens in radar charts is the arrangement of the data points influences the area they create. The charts below plot the exact same data (three 9s and three 1s) but in a different order. The result is obviously problematic as neighboring points amplify each other.

Since the alphabetical sort order in the HBR chart is arbitrary, the shape created is equally arbitrary. The random “explosion” could have been ordered by value, creating this bizarre-looking seashell:

Disclaimer: the paper did not publish its raw data table. Transcribing the values here was an inaccurate exercise which involved lots fingers on the screen, squinting, and neck twisting. In the end, the values are probably not quite right.

A better way
So, what should this data really look like? Here is a quick take on a simpler representation of the same data:

The original chart obscures the facts. Perhaps the original chart’s purpose was to provoke more than to inform. But the chart does so at the expense of the author’s argument. We can’t even see the points she wishes to highlight. In the new chart, interesting facts begin to emerge. It is now easy to compare any two countries and we can see that Spain’s “estimated” is actually twice as high as Sweden’s, whereas we would have easily confused them before. We see that although the “ideal” is always smaller than the “estimated”, there is little relationship between the two. For example, while South Korea and Australia are neck-and-neck for the lead in “estimated” ratio, Taiwan’s “ideal” is almost double that of any other nation.

In conclusion…
The radar chart should rarely be used. Maybe it should never be used? I can’t for the life of me think of a situation where its power of engagement compensates for its lack of clarity.

Do you ever use radar charts? Have you ever seen a radar chart used well? I would love to put a positive spin on this and showcase some successes. Let me know in the comments below.


Posted in Visualization | 6 Comments

Pluck the Low-Hanging Fruit

This is part two in a series of articles about effective analytics implementation. The first part “The Five Faces of Analytics” explores the roles necessary to develop a successful analytics team.

So you’ve assembled a team of world beaters and they’re chomping at the bit. They’re ready to transform your organization into a data-driven decision-making juggernaut. But where should they start? How do you coordinate this team in such a way that they’ll actually be effective.

Remember that some estimates put the failure rate of analytics projects at 80%. That’s almost twice as bad as the average IT project. How do you ensure your initiatives are firmly planted in the 20%?

Analytics lives and dies on adoption by decision makers. These are managers who have spent their lives processing information and deciding based on gut feel. You are trying to get them to augment or in some cases abandon decades of received wisdom. No wonder your results are met with skepticism. Analytics is the outsider, the newcomer. The Disruptor.

And that’s the key. You need to recognize that what you are doing is a disruptive innovation. And like all successful innovations, you need to start small and work your way up. You’re not going to slay the dragon while you’re still just a squire. You need to win some minor battles, solve some smaller problems, and collect a bit of credibility.

Step 1: The Brain Storm
We start at the end: the decision. Write down all the decisions that the organization makes, whether big or small. Add to these all of the decisions that they could be making (but aren’t) because of uncertainty or laziness. Document every rule of thumb.


Think of strategic, one-off decisions like “should we buy this other company”, or “should we build a second store” right through to the more tactical and ongoing “how many flyers we should send out this week” or “what routes should our drivers take today”.

At this point, you’ll probably realize that there are more problems to solve than you have time in your career. That’s a good thing. The next steps are how we’ll pare the list down. We’ll do so by eliminating those with high risk in data inputs, research, and implementation. Let’s cull the herd.

Step 2: Evaluate the Data
Using these “decision problems” as a guide, take an inventory of all the data resources that could be used to support or solve them. Don’t limit yourself to data that is behind the firewall. Look for open data, proxy data, and other sources available outside of the organization. The world is swimming in data, so be a little crazy here. You may not find exactly what you’re looking for, but there’s often a pretty good proxy that you can use. Creativity here will make you look like a hero later.

Next, match each data set (or sets) to each decision problem.

Now data can be a killer. More than one analyst has lost his job when he discovered that garbage in actually means garbage out. So it’s time to be ruthless: eliminate all the decision problems where the data is expensive, suspect, or unavailable.

Wow. That chopped the list down to a much more manageable size. But we’re not finished chopping.

Step 3: Scope the Projects
In looking at what remains, you can start to estimate the difficulty or uncertainty associated with finding a solution. We’re turning our “decision problems” into potential projects.

Talk to your analytic explorer. Ask her how many weeks or months each would take to “solve”. Have her think through her approach including data collection, cleaning, modeling, verification, visualization, tool or metric development, and implementation. This doesn’t have to be super-accurate, but you want to know if it’s days, weeks, or months.

Now double all her estimates and get rid of any that are longer than three months.

Step 4: Scope the Implementation
Implementation and adoption are tied at the hip. So we’re going to do a little pre-screening based on likelihood of acceptance. Look at each project and try to envision how many people would be involved in actually using it and supporting it. Who will need to sign off? Are there multiple end users? Will it require some kind of software tool? Who will care for and feed the model new data? How often will it need to be updated with fresh data? How often will it need to be recalibrated?

Some projects will require one or two people, others will require people from half a dozen departments in the organization.

Eliminate anything that involves more than three people.

And BAM! You have half a dozen potential projects that have data available, can be solved in a few weeks, and won’t require a change management consultant to implement.

Step 5: Engage the Decision Makers
At this point, you’ve been on the job for a few weeks, and your boss is probably wondering what you’ve been up to. It’s a perfect time to show her the list of projects. Pull together all the decision makers who are represented on your list and let them digest the implications of each one.

fruit pluck.jpeg

Describe each project in terms of how it will support decision-making and make them look like heroes. Have them envision the upside of each project and the value to the organization. Remind them that your solutions won’t tell them what to do, but will simply reduce uncertainty. They’ll still use their gut, but they can now supplement it with their heads.

They’ll prioritize the ones that eliminate the most pain and help them sleep soundly at night.

Voila! You have found the low-hanging fruit.


Posted in Analytics | 2 Comments

When small is more

We’ve done a few critique/redesigns of graphics on the site, but now its time to shine that sometimes unflattering light back on ourselves. While going through some materials I came across a graphic much like this one.

The chart is clean, with axes lightened so the data is in the foreground, and the series direct labelled. Unfortunately it is not very effective at conveying much beyond “There is lots of pillaging.” If we look closely we may also see a slight upward trend in Thieving and in revenues overall, but insights beyond that are all obscured by our chart choice.

The problem with stacked charts is that only the first series, in this case Thieving, and the total of all series are clearly displayed. Everything else is distorted by the shifting baseline of the series beneath it. If you want to note patterns in the individual series, stacked charts are inadequate.

Don’t be afraid of small things
The solution in this case is to simply make a small chart for each series. Often called “small multiples” these charts reveal the patterns for each series and let us compare between the series. To give a sense of the proportions of the different series we can add another chart specifically for that purpose. What pops out now is a rather interesting change in Plundering which was previously hidden.

Don’t be afraid to make your charts smaller to communicate a bigger, more complete message.

Posted in Visualization | 3 Comments

Salvaging the Pie

The poor, maligned 3D pie chart. He is so popular among the common folk, but put him next to his peers and his vacant stare betrays (not entirely unfounded) feelings of insecurity and inadequacy. Sometimes the only way to address such feelings is to let go of your inhibitions and do something unexpected. He has value hidden away, we’re sure of it. And so, for the third installment in our Data Looks Better Naked series, we are recommending that the 3D pie do what the bar chart and table have done before him: start stripping to see what he might be concealing.

Devour the Pie

There are a ton of articles out there explaining the disadvantages of pie charts, which is why they rarely turn up in our work. My good friend, Dan, is fond of saying, “You have three pie charts to use in your lifetime, so choose them wisely,” however this article by Bruce Gabrielle makes some decent counterpoints worth considering (except the second point, pie charts are terrible at trends.) So if you’ve got a pie chart, consider if something else wouldn’t be a better solution. Or you could just embrace it.

The slides for those interested:


Posted in Visualization | 7 Comments

Breathing City

Inspired by John Nelson’s breathing earth and Conveyal’s aggregate-disser post, I wondered if I could make a breathing city. Manhattan looks somewhat lung-like, so it seemed natural. Should be a fun, quick project. How naive I was.

Search and Recover
Conveyal had already gathered the data I would need to do a dot density plot, so it should be easy to find it using their post as a starting point. But wait they didn’t share links to the source files and they didn’t respond to my email. Google should solve that… hours of surfing later, I find what I’m looking for in four different places: population, employment, land use, building footprints.

Excellent, now just run it all through Conveyal’s conveniently open source tool. Except its written in Java, so lets install the Java SDK. Oh and it has several library dependencies. Finding… installing… finding… (several hours later)… installing… not working. Clearly I know far too little about java to get this going.

Python Wrestling
We use python at Darkhorse, and learning some geographic libraries could be useful. Let’s use the code for the racial dot map project as a starting point for creating a python version of Conveyal’s disser tool. I just need several new libraries which translates to more hours finding, installing, re-installing, searching, uninstalling, removing, copying, installing again because computers are petty and vindictive. Then finally shapely and osgeo are working… yay!

Baby Steps
Now we just take several hours to learn how not to use these libraries and eventually we stumble across one or two things that work, then a couple more, and then we crawl toward a messy program that does what someone else has already done, but at least I speak this one’s language.

Combine CSVs with Shapes
What? Excel stopped allowing you to edit and save DBF files, when did that happen? This tool is a bit buggy, but it brings that feature back.

QGIS forces antialiasing
You can’t turn it off. If you want to create single pixel markers, it just won’t let you color them properly, I tried for far too long (if you know how, please let me know). Good thing Excel is a poor man’s GIS (Yep every one of the frames was made in Excel)

Find More Data
So now I can make a plot that doesn’t breathe. But I want to show change in the typical workday. I’m gonna need more data for that. Several searches and false starts later we find work related activity percentages by time of day. Manhattan probably has a different profile than the US average, but close enough.

Find even more data
But each dot is a person, so I can’t just have them flicking on and off randomly to match the time of day percentages. They need to go to work for a while then come home for a while, so I need to give each dot a schedule. Maybe this is overkill, but I can’t stop now. More searching finds us a rough hours of work distribution.

Solver time
Now I’m forced to assign schedules to the ~1.5 million people living in Manhattan and the ~2 million people working in Manhattan. But the sum of those schedules needs to resemble to hours of work distribution and the percentage at work for each hour of the day. Time to break out Excel’s solver engine. With it we can create ~200 schedules with probabilities to match those profiles. Then we can distribute them to each of of our people.

Data is done
Finally we have data for each of 24 hours for both home and work. We’re making some huge simplifying assumptions (e.g. Manhattan’s work profile is the same as the rest of the US, when people aren’t at work they are at home, there are only 200 possible ways to spend your day, when we build this someone will want to see it) but we have a reasonable data set.

Now we just make some maps and push pixels around on the screen until they look good, then painstakingly create 24 versions to string together for the animation. Eric Fisher has some great tips for making and coloring dot density plots in this post. Then we’ll add a bar chart, no an area chart, no a line chart, wait… I’ve got it, a mesmerizing heart rate monitor looking thingy to go with our breathing theme. Nice.

See, it is super easy and takes almost no time at all to create something like this, as long as your definitions of “super easy” and “no time” are flexible enough to include difficult and time-consuming.


Note: A previous version of this GIF had the orange work line in the ‘heart rate’ chart incorrectly shifted one hour. This has since been corrected.

Posted in Analytics, Visualization | 21 Comments

Clear Off the Table

We received a lot of attention for our Data Looks Better Naked post. People got bored on Christmas Eve and some interesting searches for Star Trek somehow landed them on our page. Now their charts look better.

The principles outlined in that article aren’t just for charts, though. You can apply them to your data tables with similar improvements in readability and aesthetics. To paraphrase Edward Tufte, too often when we create a data table, we imprison our data behind a wall of grid lines. Instead we can let the data itself form the structure that aids readability by making better use of alignment and whitespace.

In the gif below we start with a table formatted similar to one of Excel’s many styling options which, much like the chart styles, do nothing to improve the table. Progressive deletions and some reorganization deliver a clearer and more compelling picture.

As with charts, rather than dressing up our data we should be stripping it down. For more information on table design, you can read Chapter 8 of Stephen Few’s Show Me the Numbers. My apologies to any true fans of 80′s wrestling, the stats below, much like the ring rivalries, are entirely fabricated.

The slide deck for viewing at your own pace:

Posted in Visualization | 15 Comments

The Uniform Distribution

The goaltenders from my youth – Bill Ranford, Andy Moog, Grant Fuhr – all had jersey numbers in the low thirties. And most of the goalies I can think of now have numbers in the low thirties. This got me wondering, how do the numbers number by position across the major sports leagues? What traditions, rules, and preferences do they reveal?

So, after some python scraping and excel manipulating we find ourselves with a paradox: uniform distributions that aren’t uniform distributions. Download them for yourself in tower or poster format.


So what can we see in the numbers? Well, Jackie Robinson’s league-wide retired #42 is quite clear, while the NHL’s only league-wide retired number, #99, is not nearly as apparent given the thin number of 90s to begin with and its position at the end of the spectrum. 

The NFL’s arcane and rigid numbering system also shows up quite clearly. Something I was completely unaware of until going through this exercise.

Plus there is an interesting parallel between soccer and hockey when it comes to picking numbers for the defence where both seem to prefer single digits that are not #1. Both leagues also like to give goalies the #1, though you can see the NHL’s three most popular goalie numbers are #30, #31, and #35 which I’m pleased to say are Bill Ranford, Grant Fuhr, and Andy Moog’s numbers respectively. I’ll imagine this is the leagues current players paying homage to the hero’s of my youth, though it is more likely has to do with tradition.

So click through and let us know what you discover in the graphics.



Posted in Visualization | 2 Comments

The Five Faces of Analytics

The chasm between Business and IT is well documented and has existed since the first punch-card mainframe dimmed the lights of MIT to solve the ballistic trajectory of WWII munitions.  Since that time, great leaps in data collection, storage, connectivity, and processing power have made IT infrastructure ubiquitous.  You’re not even in the game if you don’t have an IT group.


But the productivity gains have not kept pace with the investments.  The newest hype-driver, analytics, claims that it will finally deliver the goods.  But can it?

Some studies suggest that analytics projects have an 80% failure rate.  That’s abysmal.  In the next few articles, we’ll look at reasons for this failure.  We’ll start by describing the roles, then talk about process, and finally identify the characteristics of world-class analytics.

A helpful starting point is to imagine your analytics dream team.  Who would you hire, and what would their roles be?  I suggest that there are five distinct job descriptions:

Data Steward – this skillset is alive and well in most organizations.  Almost everyone has a data warehouse, talks about the ETL process, and has had discussions around the business rules of cleaning up and storing their data.  The data steward will use tools such as SQL Server, MySQL, Oracle, and if she’s a superstar, she’ll dabble in Python and web scraping and know the difference between Hadoop and Map Reduce.

Analytic Explorer – this skillset is a tough one to find.  It requires math, statistics, and modeling along with a healthy dose of creativity and skepticism. These are the people who can spin straw into gold or write tomorrow’s news today.  His job is to explore your data, combine it with sources outside the firewall, and distill it down to insights that will support your most critical decisions.  He’ll use tools such as Excel, R, MATLAB, ArcInfo, SAS, Tableau, and SPSS.  If he’s a superstar, he’ll know all about Bayes, Optimization, and the difference between precision, accuracy, and skill.

Information Artist – This is the role of a creative.  Her goal is to sell the results to the decision-maker.  And the lack of emphasis on this skillset is one of the reasons analytics is such a failure (and why Apple is such a success).  Edward Tufte – the godfather of data visualization – speculates that the lack of good data design caused the Challenger space shuttle tragedy.  Think of this person as being as crucial as your sales force.  In fact, that is their job – to sell the right answer.  Excel and PowerPoint can suffice, although the more skilled will use a variety of tools from Google Earth to Adobe Illustrator to D3.  If she’s a superstar, she’ll be as comfortable talking about the math behind the visuals as she is talking about the psychology behind her design.

Automator – If the Explorer finds the path through the dark forest to the fountain of youth, and the visualizer designs a beautiful bottle for the elixir,  then the Automator turns that path into an eight-lane highway and builds a factory to bottle that stuff as soon as it comes out of the ground.  His job is to operationalize the work of the Explorer and Visualizer.  He makes sure that results are timely and fast.  He adds scale. He might use traditional coding methods like C# or .NET or he might fiddle with Ruby or Objective C. Or he might even be the guru of Business Objects, Microstrategy, or D3.

The Champion – The champion stands with one foot in the land of “gut feel”, and the other planted firmly in the side of “evidence”.  She can speak the language of the geeks, and translate it to that of the battle-hardened general.  She believes strongly in data-driven decision making, but also recognizes the value of deep domain experience.  She’s tireless in her efforts to sculpt the processes of the organization to support analytics. She aims to harvest the brightest insights from the sharp young analysts and the cleverest hacks from the wily old veterans.  Her focus is adoption, and if she’s a superstar, she’ll make you believe that this analytics thing was your idea in the first place.

So that’s your dream team: a steward, an explorer, an artist, an automator and a champion.

But there’s a problem.  This team rarely exists in the wild.  Most companies hire the Data Steward, and then try to do the rest through a major software implementation.  Unfortunately, the software is not meant to explore and discover.  And it was designed by engineers who don’t understand the psychology of data visuals.  It’s like expecting your bookkeeper to be your CFO.  Sure they can both do accounting, but you won’t be happy with the results.

In other instances, organizations will try to shoehorn engineers into the roles “in their spare time”.  Again, with neither the training nor the time to explore the data or design the results, they’re doomed to fail.  These skillsets are distinct, and they shouldn’t be ignored.

So what’s the canny company to do?  If you’re extremely lucky, you’ll find the unicorn of the 21st century known as the Data Scientist, pay her a quarter million, and watch the magic happen. (A data scientist can do all five roles.) Or you can try to develop these skills in-house.  Or you can hire contractors – perhaps engaging a consulting firm to take on the Explorer or Visualizer roles for a time.  Or you can outsource the whole thing.

What’s important is that you recognize that each of these roles is necessary.  Neither software nor “Dave in Engineering” can replace them. Happy hunting.


Posted in Analytics | Comments Off

Data looks better naked

Edward Tufte introduced the concept of data-ink in his 1983 classic The Visual Display of Quantitative Information. In it he states “Data-ink is the non-erasable core of the graphic, the non-redundant ink arranged in response to variation in the numbers represented” (emphasis mine). Tufte asserts that in displaying data we should remove all non-data-ink and redundant data-ink, within reason, to increase the data-ink-ratio and create a sound graphical design.

Stephen Few convincingly argues that some redundancy is often more effective and we agree, however, most graphics don’t struggle with understatement. In fact, most contain a stunning amount of excess ink (or pixels). Rather than dressing our data up we should be stripping it down.

To illustrate how less ink is more effective, attractive and impactive we put together this animated gif. In it we start with a chart, similar to what we’ve seen in many presentations, and vastly improve it with progressive deletions and no additions.

And here is the slide deck if you want to go at your own pace.

The next time you are trying to improve a chart, consider what you can take away rather than what you can add.

“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away”
– Antoine de Saint-Exupery

Posted in Visualization | 7 Comments