# 012 - Kafka

By
Alexander Zabielski

Last Modified: November 25th, 2025

Title: Adoption of AWS MSK Serverless for Event Streaming

Status: Proposed

# Context

# What is the background to this decision?

As we modernize our architecture to decouple components and improve scalability, we require a robust, high-throughput event streaming platform. Currently, our services often rely on synchronous communication or point-to-point integrations, creating tight coupling and limiting our ability to scale independent components.

We need a centralized event bus to facilitate asynchronous communication between our Platform Services and future Microservices. However, standing up a full, provisioned Apache Kafka cluster requires significant operational overhead (zookeeper management, broker sizing, rebalancing) and incurs high fixed costs, which may not be justifiable during the initial phases of our deployment where traffic patterns are variable or ramp-up is gradual.

# Decision

# What decision have you made?

We will adopt AWS MSK (Managed Streaming for Apache Kafka) Serverless as our standard event streaming platform.

  • Infrastructure Management: The MSK cluster and associated resources (topics, ACLs, configurations) will be strictly defined and managed via Terraform.

  • Repository: All Terraform infrastructure code for this implementation will reside in the following repository: https://dev.azure.com/BPOSEC/IBP2/_git/kafka.

  • Deployment Strategy: We will launch using the Serverless tier to minimize initial costs and operational overhead. We reserve the option to migrate to Provisioned MSK instances in the future should traffic patterns become consistent enough to justify the reserved cost or if we hit serverless throughput limits.

  • Migration Plan: Adoption will occur in a phased approach, starting with Platform Services integration, followed by a wider rollout to Microservices.

# Rationale

# Why did you choose this decision?

This decision balances architectural necessity with cost efficiency and operational simplicity.

  1. Factors that influenced the decision:

    • Cost Efficiency for Phased Rollout: MSK Serverless utilizes a pay-as-you-go pricing model. Since our initial deployment will not have full production load immediately, Serverless prevents us from paying for idle brokers in a provisioned cluster.

    • Operational Simplicity: MSK Serverless abstracts away the need to manage brokers, partitions, and storage scaling. This allows the team to focus on producing/consuming events rather than managing Kafka infrastructure.

    • Infrastructure as Code (IaC): Centralizing the configuration in the specific Azure DevOps repository (.../IBP2/_git/kafka) ensures that our event bus is version-controlled, auditable, and reproducible.

    • Scalability: The Serverless tier automatically scales compute and storage in response to traffic, handling the "bursty" nature of initial integrations.

  2. Evidence/Research: Analysis of our projected initial throughput suggests that a Provisioned cluster would be underutilized by >60% in the first 6-12 months. MSK Serverless eliminates this waste. Furthermore, AWS MSK allows for relatively straightforward migration from Serverless to Provisioned configurations if metrics later indicate that a Provisioned cluster is more cost-effective for sustained high loads.

  3. Strengths of the chosen solution:

    • Native AWS Integration: Seamless integration with IAM for authentication (replacing complex SASL/SCRAM management).

    • Elasticity: Automatic scaling prevents "out of disk space" errors common in self-managed Kafka.

    • Defined Migration Path: The decision explicitly acknowledges the evolution from Platform Services to Microservices, allowing us to validate the pattern on core services before wide distribution.

# Implications

# What are the implications of this decision?

This decision introduces a new core component to our architecture and requires specific workflows.

  1. People/Training:

    • Development Team: Needs training on Kafka concepts (Producers, Consumers, Topics, Partitions, Offset management).

    • DevOps/Infra Team: Must maintain the Terraform modules in the BPOSEC/IBP2/_git/kafka repository and manage the Azure DevOps pipelines associated with it.

  2. Process Adjustments:

    • Topic Management: Requests for new Kafka topics must go through a Pull Request process in the kafka repository rather than ad-hoc creation.

    • Phased Migration:

      • Phase 1: Integrate Platform Services (e.g., Logging, Audit, Identity).

      • Phase 2: Integrate Business Domain Microservices.

    • Cost Monitoring: Regular reviews of MSK Serverless costs must be established to determine the tipping point for converting to Provisioned instances.

  3. Tooling:

    • Terraform: The primary tool for all MSK lifecycle management.

    • Azure DevOps: Hosting the repository and CI/CD pipelines.

    • IBP Core Libraries: Shared library for Kafka producer/consumer logic to standardize event handling across services.

  4. Risks:

    • Throughput Limits: MSK Serverless has hard limits on write/read throughput per partition. If a specific Platform Service exceeds this, we may be forced to migrate to Provisioned sooner than expected.

    • IAM Complexity: Configuring Kafka clients to use AWS IAM can be trickier than standard username/password auth, potentially delaying initial integration.

    • Vendor Lock-in: Heavy reliance on MSK Serverless specific features (like IAM auth) makes moving to a generic Kafka provider harder in the future.

# Trade-Offs

# What are the pros and cons of this decision?

  • Benefits:

    • Zero Cluster Operations: No patching, rebalancing, or disk management.

    • Cost aligned with Usage: We only pay for the data we stream and retain.

    • High Availability: Multi-AZ replication is built-in by default.

    • Secure: IAM integration eliminates the need to manage static secrets for Kafka access.

  • Drawbacks:

    • Configuration Limits: Less control over specific Kafka broker configurations (e.g., message.max.bytes limits) compared to Provisioned.

    • Cost at Scale: At very high, constant throughput, Serverless becomes more expensive than Provisioned.

    • Cold Starts/Latency: While generally low, Serverless architectures can theoretically introduce slight latency variability compared to a warmed, over-provisioned cluster.

# Key Evaluation Metrics

# How will success be measured?

Define clear criteria to measure whether the decision solves the intended problems.

  • Cost Efficiency: Track cost per GB processed. Success is defined as remaining cheaper than a minimal Production Provisioned cluster (e.g., 3x m5.large) during the migration phase.

  • Uptime/Availability: Achieve 99.9% availability for the event bus.

  • Integration Velocity: Successful onboarding of the identified "Platform Services" within the first quarter.

  • Lag Metrics: Consumer lag should remain consistently low, indicating the Serverless scaling is keeping up with burst traffic.

# Conclusion

# What is the final recommendation?

We recommend the immediate adoption of AWS MSK Serverless managed via Terraform in the BPOSEC/IBP2/_git/kafka repository. This approach minimizes our financial and operational risk during the initial migration of Platform Services. It provides a modern, scalable event bus without the burden of managing infrastructure, allowing the team to focus on the application logic required to decouple our services. We will re-evaluate the cost-benefit of moving to Provisioned instances once the Microservices migration phase reaches maturity.

# References (Optional)