Conversations with 200+ people at Kafka Summit
I spent 2 days at Kafka Summit 2024 and spoke with over 200 people. They were loooong days.
But itās an amazing way to completely surround yourself with people who are actually doing the job, and thatās such a fantastic opportunity to learn.
Thereās plenty of posts out there that summarize the talks, recap the keynote and hype the vendor releases, so Iām not going to talk about any of that. This post is just going to cover the conversations that I had with practitioners that attended the event. Given that was over 200 conversations, this is going to be my personal distillation of what we spoke about.
Of course, thereās some bias here: itās a Kafka event so itās generally folks with streaming on their mind, I have my own biases that will influence the natural path conversations take, and I was there representing my company. I believe the topics are still useful with these biases in mind.
The four main themes that I found myself discussing were:
- Operational Analytics
- Event Driven Architectures
- Unified ETL with Apache Flink
- Automation and Data as Code/Config (DaC)
Operational Analytics
I canāt count the amount of conversations I had around this one.
So-called āoperationalā systems tend to be the backends of a businessā application, often described with a combination of terms like relational, transactional, CRUD, ACID or documents - think databases like Postgres or MongoDB. Generally, they need to support a lot of key-based single-row READs, as well as single-row DELETE and UPDATEs.
The āanalyticalā systems, on the other hand, are powering the companiesā internal reporting and business intelligence. Here, the name of the game is generally āCan I stick Tableau on it and do crazy JOINs over a few TBs of data?ā. Itās going to power some pretty crazy queries that will scan a huge amount of historical data to try and answer a question.
However, many attendees are starting to see that operational systems, or the applications that use them, now want to utilize analytics. Most often, this is in the form of user-facing analytics, where the output of analytical queries can be served back to users within the application. This appears to be causing quite the headache, as there is typically a large divide between the āoperationalā and āanalyticalā teams or systems, and existing āoperationalā systems cannot handle the analytical workload.
Simply put, the operational team canāt just start running massive analytical queries over their Postgres database without tanking the service, and building a solution over the internal Snowflake is too slow & cumbersome.
āOperational analyticsā seemed to be the most popular term for this idea, which makes sense.
Event Driven Architectures
Event Driven Architectures (EDA) isnāt a new subject for a Kafka conference, Iād say itās probably one of the longest living themes (though definitely behind āKafka is a pain to manageā). In the past, Iāve found that EDA is usually something that smaller, newer businesses are very keen on, but remains largely unexplored by larger enterprises. It felt a bit different this year, with many of these conversations occurring with engineers from global giants.
The conversations were largely the same as they always were; folks are tired of pushing data into silos then building a bunch of glue around it to work out if they need to take action, and then more glue to actually take the action. The challenge here is that much of this āglueā requires domain knowledge of the systems on either side: you need to know where the data lives and how its stored, write logic against that model to determine if action is required, and then translate this into whatever the destination system needs. Because of the inter-domain nature of this glue, itās common that ownership of the glue itself is unclear. Itās also typically quite brittle, as systems on either side may change, and there is often a lack of communication to notify other teams of changes that may break things.
Instead, new data should be treated as events that represent something happening, be pushed onto a central bus, and downstream systems can subscribe to the stream(s) of events that they care about. This means that any downstream systems and team has a single, known and consistent integration point when needing access to data. This makes it much easier for each domain to self-service access to data, keep full ownership of their systems and make changes without breaking other consumers.
Of course, you can also become āevent drivenā, so rather than polling every 5 minutes and working out all of the actions you should have taken, you can start to adopt an āalways-onā pattern, where events trigger individual actions as they arrive. This is probably what jumps into most people minds when they think of the benefits or purpose of EDAā¦but what I took away from all of these conversations is that the structural benefits of decoupling event ingestion & distribution from the myriad of downstream systems is probably the more immediate benefit to a lot of teams.
Unified ETL with Apache Flink
Which brings us to Apache Flink, which was the hot topic of Confluent and many other vendors at Kafka Summit this year.
Most people, even those already deep into it, will agree Flink is a pretty complex system. However, it brings some really nice things with it. It has been built to be āstreaming firstā, making it a fantastic choice when working with streaming data. And while it excels at streaming, Flinkās design is flexible and allows it to work with batch data as well, meaning you can reuse knowledge across both paradigms. On top, it has a growing ecosystem of input and output integrations, a SQL abstraction called āFlink SQLā and has been tried and tested at a huge scale. (If you consider the EDA pattern described above, you might see how Flink positions quite nicely between a central streaming bus and anything else to the right hand side of it.)
The core purpose of Flink was to be a stream processing engine, but as is often the case, it seems the market is finding that thereās a different, and perhaps more widely applicable, fit for it. Its flexibility in working with both streaming & batch makes it attractive in the oft varied environments where ETL tooling becomes painful. The ability to write code is attractive to handle complex scenarios, while the comprehensive SQL abstraction makes it adoptable by teams who donāt have the resources to be effective with, or simply donāt need, that level of control. And the huge range of integrations make it a no brainer. Perhaps what weāll see is that teams initially adopt Flink to solve ETL, and then ramp into stream processing use cases in the future.
Automation and Data as Code/Config (DaC)
āDo you have a Terraform provider?ā
When Iām repping an event for Tinybird (a real-time data platform), I get asked this question a lot. People want to know if they can deploy & build with the platform using Terraform.
To be honest, Iām still surprised at the level of adoption Terraform has seen in data teams, but it seems quite well entrenched now. Though, I do wonder if that will change with IBMās acquisition of HashiCorp. Do people really want to be beholden to Olā Big Blue?
Iāve seen this idea of āData as Codeā (or āas Configā) deployed to spectacular results. Itās a pretty simple concept where every part of a Data Platform is defined in files - some folks call it Code, and others Config, generally itās the same idea - including the platform itself, schemas, integrations, queries, jobs and all artifacts of actual use cases.
Perhaps the most striking benefit is the impact it has on how people work.
If you are in a data team, or working with one, youāll probably be familiar with the pains around collaboration - who performs what work, who owns it, who reviews it, who supports it, etc. Some teams have gone for clear centralization, where the data team is the gatekeeper to everything ādataā, while others have gone the ādata meshā route, and federated as much as possible into domain teams.
Both approaches have their benefits, but in reality, neither approach has perfectly solved every pain.
By making this (mostly technical) change, both the centralized and decentralized ways of working are improved, but it opens a nice path to ācontrolled decentralizationā.
There are significant benefits to a data team being able to make well-informed decisions about data infrastructure, and centralizing knowledge and experience makes it easier to support and appropriately resource work. But weāre all aware of the discussion around ādomain expertiseā and the push for data teams to be ācloser to the businessā. This makes sense, but weāve been saying it for over a decade and it isnāt happening.
I think itās pretty unrealistic to expect that data teams will become self-sufficient in business domain knowledge. Businesses are too different, even within a single industry, let alone across industries (and most data engineers donāt stay in the same industry their whole career).
Instead, this DaC model allows us to centralize knowledge while federating responsibilities. The responsibility for defining and owning the data platform is given to the data team, who are best placed for it. However, that knowledge is centralized in a repository that is open to all. Similarly, the specifics of building use cases are federated to the teams who understand them best, but they also push that knowledge to the central repository. The repository now becomes a place for those two teams to collaborate. The business teams can work at their own pace to build functionality, while the data team can put guardrails in place that allow for most āoverheadā work (e.g. deploying) to be automated.