Data Lies and Bowling!

 

For a long time now, I’ve been fascinated by the truth and context behind a data point rather than the data point in of itself.

 

By now I’m sure you’ve already heard about survivorship bias in data. “Survivorship bias or survival bias is the logical error of concentrating on entities that passed a selection process while overlooking those that did not”. The most famous example of this occurred during World War 2.

 

A brief summary of this is during the war there was a team that were brought together to look at damage to the fighter planes as look at how to reinforce the planes to ensure that they are as robust as possible. This team went to work looking for data points. They took the returning planes and counted the bullet hole placement across the planes. From this they could aggregate up and see all the hot areas of the planes with the most bullet holes.

 

They logically concluded that these were the areas that they should reinforce as they are the areas being shot most often. An engineer saw what was happening and highlighted that it is the areas with no bullet holes that really needed to be reinforced as these were the areas that when shot would be terminal so the planes simply didn’t return, hence no data points.

 

 

There is another contextual data perspective that often comes to mind in Ten pin bowling.

 

The 7-10 split, otherwise known as the "goal posts", "bedposts", or "snake eyes", where the bowler is left with the leftmost and the rightmost pin in the back row (the number 7 and number 10) to knock down with a single ball to achieve a spare is often known by those that play the game as the hardest shot to achieve. But when a journalist analysed nearly half a billion frames of data from PBA tournaments from 2003 through to 2022 they saw something rather surprising. From the initial ball bowlers got strikes in about 60 percent of those frames, leaving 180,000 when they knocked down fewer than 10 pins on their initial roll and needed to get a spare. He then analyzed the frequency at which spare arrangements (min. 50 attempts) were successfully completed to then compile a list of the hardest shots to complete.

 

Most were surprised to see that although the 7-10 split was near the top of the rankings it wasn’t at the top. In fact, it wasn’t even close. The 7-10 split came in third but with a percentage completion rate over double that of the “Greek Church” which was deemed the “Hardest” with a spare rate of just 0.3%. This though is a fantastic example of where the data alone doesn’t tell the full story. As mentioned if you ask most bowlers they would all agree that the 7-10 is hardest, so why the discrepancy?

 

Well once again it comes down to probability but not in the way you would think. Most bowlers know that with the 7-10 getting both of the pins is extremely difficult but getting even one is hard so they will take the risk and go for the spare knowing at most they will leave 1 on the lane if they make contact. With a Greek Church though rather than going for the spare and risk missing more pins the majority of bowlers will deliberately take the 3 pins on the right side sacrificing the 2 on the left. This means that you have far fewer bowlers taking out all 5 and hitting the spare. The outcome is that from a macro data stance the Greek Church spare is far rarer than the 7-10 spare but once you factor in the context the 7-10 spare is the harder shot.

 

 

 

This is all interesting but how does it relate to Talent Intelligence?

 

At its core it means to investigate and interrogate data points rather than following the data blindly. Let’s do a worked example: If you are looking at the impact of remote work on time to hire for a requisition it may be logical to take two requisitions that are the same job family, the same level, the same country/city, the same compensation and narrow down the only data point variable that differs being remote Vs in the office. The time to hire may be for example 45 days for in office and 35 days for the remote working requisition so on the face of it remote working would seem to be a 10-day difference and a good choice for a recommendation.

 

This may be true but equally there are a number of contextual variables that should also be factored in that may not show up through traditional data points. What is the team’s attrition rate? What is the team’s promotion rate? What is the referral rate? What is the experience level of the recruiter handling the vacancy? What is the overall requisition volume that the recruiter is handling? What are the funnel conversion metrics for the requisitions? Has the hiring manager been on holiday during the process? Even if you can capture all these data points there may be other contextual intangible bottlenecks in the process. What is the prioritization of hiring for the hiring manager? Is their team going through a reorganization? Do they have commitments meaning hiring is deprioritized?

 

Or think about a location strategy. From a data stance of talent supply, demand, cost, employer brand strength a location could look fantastic. But if you dig deeper on the site in question it could be that the transport links aren't great, that there is a toll road to get there, that the local coffee shops and restaurants aren't good, that a competitor has launched a new office that is better etc. All these contextual factors could be the reason sites could fail, and all these HAVE been reasons sites I have seen studied were failing. The context is vital!

 

This is not to say that the data is wrong, but rather to highlight how important it is to really look at the context for the data. To understand it not as a singular (data) point in isolation but rather as a point in a holistic landscape.

Back to blog

1 comment

Great work by Abraham Wald on the armour. A lot of his work now flows into the TI and wider data space.

Robert Dagge

Leave a comment