Thought provoking words from Hilary Mason about the future of data science

Hilary Mason is an important person in the world of data science and so her words are always worth listening to. This interview has some particularly thought provoking ideas.

Hilary Mason

As she rightly says “Things that maybe 10 or 15 years ago we could only talk about in a theoretical sense are now commodities that we take completely for granted. Hadoop existed, but was still extremely hard to use at that point. Now it’s something where I hit a couple buttons and a cloud spins up for me and does my calculations and it’s really lovely.“

My view is that it’s a lot more recently than 10 years that the data science toolkit has really entered the realms of the possible. Hand in hand is the fact that the majority of corporate technologists are unaware of how far data science has come and frankly disbelieving on the realms of the possible.

At Idax, we perform data science on identity and access management data, using unsupervised learning techniques to determine whether internal staff’s access rights are appropriate. As a result we tend to perform analytics on reasonably large data sets with hundreds of thousands of accounts and millions of permissions.

But the main observation from our clients is that for the non data scientist there’s still a lot of catching up to do. Of course, they love the results. Being able to dynamically determine a risk rating for all staff with no additional business knowledge being input is a huge benefit.

But their general unfamiliarity with the techniques means that firstly they can’t quite believe that their corporate entitlements database can be analysed in real-time on a machine no bigger than a high end gaming laptop. Secondly, that by using in memory databases and algorithm optimisation we can provide them with results across the whole domain in seconds and minutes rather than hours; and lastly, that the dirtier the data, the better the results.

As Mason says: “A lot of people seem to think that data science is just a process of adding up a bunch of data and looking at the results, but that’s actually not at all what the process is. To do this well, you’re really trying to understand something nuanced about the real world, you have some incredibly messy data at hand that might be able to inform you about something, and you’re trying to use mathematics to build a model that connects the two.“