It’s 2020 and we’re creating more data than ever. We each use the internet, in some way or another, for almost eveything we do. Listening to music, reading a book, playing a game, managing our money, talking to our friends and family – these are all tasks that used to be offline experiences that we’ve now converted in to the digital world.
The organisations delivering the services for these experiences have to shift massive amounts of data; they must, obviously, move the data the user requires to use the service (the content), but they must also move the data that the user generates by using that service (user data).
Content delivery is a huge subject and firms like Netflix have shown how to do it well; during the 2020 Covid-19 epidemic, Netflix was estimated to make up 11% of the world’s internet traffic. This is a pretty incomprehensible amount of data, and it’s thanks to the move to distributed content delivery networks (CDNs) and elastic cloud infrastrucure (AWS in Netflix’s case) that they’ve been able to scale to meet such a demand.
However, moving (and utilising) user data is a much less neatly catered-for area. That’s not to say that there aren’t good solutions, or that some organisations haven’t done it well – you just have to look at Google, a user data-driven company valued at over $1tn, to know that it can be done well – but the exact services and architectures to do it well are much less clear than with content delivery.
It’s a hotly contested area with many competing services, some old, and many new. It’s hard to pin point exactly what the area is called, but it seems to have been settled that ‘Data Warehouse as a Service’ and ‘Enterprise Data Cloud’ are terms the market likes. I’m not a huge fan (DWaaS, really?), but reports indicate these markets are big ($35bn and $135bn respectively) and are only going to get bigger and bigger every year. With this is mind, it’s obvious why so many organisations want to carve their piece of the pie.
The ‘Enterprise Data Cloud’ moniker seems to be the broader term that realises that we no longer want to just vaccum up lots of low quality, useless data and shove it on a bunch of dusty disks to be forgotten about. Organisations are waking up to the fact that user data is as valuable as their content; Wether you can directly monetise that user data, a la Google, or you realise indirect financial gain from it (via improved services, happier customers, targeted marketing, etc.) will vary between organisations – both options are valuable.
So now we don’t just care about ‘warehousing’ our data. We want to control exactly what data we pick up. We want to be able to normalise and ensure the quality of our data. We want to be able to enrich our data. We want to be able to analyse our data in real time. We want to train and run machine learning across our data. We want the data easily available to the people and teams that need it. We want to be able to scale up to cope with increased throughput, but we also want to be able to scale down to control cost. We want to do all of this, while mainting observability, tracebility and accountability. We also still want to warehouse data, too.
That is a very high level list of requirements, but they are the requirements that matter to almost any business shopping around in the ‘Enterprise Data Cloud’ market. It’s a varied list and satisfying every single one is no easy feat, if you’ve been around in this space, you’ll know that there is always a trade-off somewhere. So, is it possible to architect a platform that can satisfy all of those requirements with the tools available today?
I believe it is. So I’ll be doing a series of posts that will cover designing the architecture through to building out a proof of concept platform. Along the way I hope to throw in a few supplementary posts that go in to more depth on specific tools, their uses and some best practises.