If you didn’t catch the introduction to this series, you can check it out here: Modern Streaming Architectures – Intro.

Now that we’ve identified generally what we want to achieve, lets start putting something on paper. We’re still at a level abstracted from individual tools (the ‘how’) but we can identify the high level functions that need to happen (the ‘what’). So let’s start off as high-level as we possibly can, without just drawing a bubble labelled ‘cloud’.

The most important part of this entire architecture is the movement. It’s a streaming architecture, a streaming architecture implies that there is data in motion. I like this term a lot, ‘data in motion’. Catchy. When we visualise a moving process, we tend to use flow charts – and as a flow is basically a synonym for a stream, we may as well start there.

At this point, we are essentially distilling our requirements down in to an easily digestible visualisation that demonstrates movement. It’s not much use to the fingers-on-keyboards engineers, but everything has to start somewhere.

Side note, drawing good architecture diagrams is hard. This blog post over at the NCSC has some things to look out for.

It’s a very simple diagram, but a picture tells a thousand words, and we can start to tell our data in motion story. With this diagram, we can quite easily see where each section lines up with the requirements we set out originally. This will be very useful later on, when we start looking at specific tools, integrations and generally adding a lot of potentially confusing details; Ideally, we should be able to track our way right back up to this original, high level view, and relate our specific implementations back to each block, thus linking it to our original requirements.

Ingest

From left-to-right, we begin with Ingest. This might seem self explanitory, but let’s just define it for clarity, as there is some nuance; we must be able to retrieve data from many disparate data sources, and then unify this data into a common data pipeline. It’s not too uncommon for the second half of that definition to be missed – but it’s just as important as the first. The last thing you want to deal with when you are actually using your data is 20 different delivery mechanisms – there’s no business value there, it’s uneccessary overhead that is totally avoidable. So, firstly, we want to bring all that data in to our platform from a bunch of different sources; secondly, we want to then unify the delivery of all of that data once it is within the platform.

Normalise & Enrich

The goal of this step is increasing the value of the data we ingest, and there’s two distinct parts to that.

Firstly, we want to normalise the data. Essentially, we want all pieces of data from Source A to look the same. Why? Because, eventually, someone (or more likely, something) is going to be looking at that data and trying to understand it. We want that to be as simple and pain-free as possible; the less time we spend writing the jobs that analyse the data, the sooner we can begin the analysis (or writing the next job). We really don’t want our data science folks to be fighting against bad quality data, it’s a massive waste of their time.

Secondly, we want to enrich our data. Basically, we want to pack as much useful contextual information in with each piece of data as we can. You could argue that this could be done at the query side – have the analysts pull in the contextual information when they need it – but doing it here means we can write it once, doing all of the error handling, quality control, etc. and save our analysts a bunch of time. Remember, extensibility and reusability are vital – maybe you have one data science team now, but will that still be the case in 2 years time?

As a quick example, let’s say we have two data sources, A and B. Both sources are logs of user interactions with internal systems, with similar information but in different formats. They both have User ID, source IP and destination IP, but Source A calls it ‘user IP’ and source B calls it ‘source IP’. Now, if you want to compare IPs between the two sources, you’d have to remember that each source uses a different name. Not too bad for two sources, but when you have 100 sources and they each use a different name for the same piece of information, it gets a little tricky to remember. So we nromalise the data and use a common name for the same pieces of information; so we rename Source A’s ‘user IP’ to ‘Source IP’. Now our analysts don’t need to remember the difference – it’s always ‘Source IP’. Now let’s say our User ID field actually referes to an Active Directory user ID. That gives us a good way of tracking actions by the user, but let’s say we wanted to do it by which office the users are based in. That information is usually held in Active Directory, so whenever we see an AD User ID, we could pull out Name, Email and Office location from Active Directory, and enrich the data with the additional fields. Now our analysts have a bunch more ways to play with the data.

Stream Analysis

This is the bit that is still very new most organisations – most will have some kind of ‘analysis’ that is performed periodically, usually over static data – but very, very few are doing continual, real-time analysis over the data as it arrives. Which is a huge shame, because it really is quite cool.

In this step, we are actually going to start using our data to create business value – before it even reaches our data lake. Now, not every use case is a streaming one, so we need to be careful that we aren’t try to randomly move everything here, but there is so much that can be done here.

For example, we may have a security use case that looks at irregular user access to sensitive systems. So, perhaps we have a stream of logs for incoming connections to certain sensitive hosts within our system. Because we have enriched the data, we know who that user is, what they do and where they are, so we might look geographically, and ask “Why is a user from the Budapest dev team accessing a machine hosting Payroll in Ireland?” – now, we can do this within milliseconds of the event occuring, flag it for investigation, and lock down the user or host if its determined to be malicious before a serious incident occurs (hopefully). The same event will then still be available to our periodic batch jobs once it has been written to the data lake, perhaps for further pattern analysis or auditing.

Persist & Retain

One thing you might have noticed is that there’s two blocks that sound similar, ‘Persist’ and ‘Retain’. These terms are quite often used interchangably, but I maintain that they have a subtle difference. In this case, I use Persist for any data that is no longer transient, meaning it has reached our data lake, but it makes no assumption on how long it will stay there for. Retain is data that we have determined must remain in the data lake for an extended period of time. Often these two functions can be achieved with the same tooling, which further blurs the lines, but that is not always the case.

When we Persist data, this should be data that either still has some kind of use case related benefit, and/or it is data that we need to Retain, perhaps for compliance reasons. We can be relatively loose with what data we Persist, providing we accept that there must be a process for deleting data we don’t need. That might sound a bit obvious, but you’d be suprised how many organisations assume all data is equal, thus it must all be stored, forever – up until the disks are full and panic mode is engaged.

The reality is that some of the data you ingest will serve its purpose early on in the flow, either at the Normalise & Enrich or Stream Analysis stage, and then it is never needed again. Of course, we can’t always predict the future and occasionaly we discover a use for it later on down the road – so this is where that flexibility matters, and why the distinction between Persist and Retain becomes important. We can tolerate persisting some useless data, with the understanding that it has a limited life span, but we can’t tolerate retaining mountains of it.

Conclusion

This has been a brief explanation of the 5 key tenants of our architecture and what each step is trying to achive.

The next articles will break each block down individually, going in to specific on what, how and why – which tools, why chose them, how we use them, what we can do with them.

I’ll try to remember to add their links to this page…