# 009 - IBP Platform Websocket

By
Mina Wahba

Last Modified: Aug 27th, 2025

Title: Adoption of Serverless WebSocket Architecture

Status: Implemented


# Context

We needed a scalable, secure, and reliable way to deliver real-time messages from backend services to clients over WebSocket. The solution had to support:

  • One-way communication (server → client only)
  • Topic-based subscriptions
  • User-specific messaging
  • Offline message persistence and delivery
  • Multi-environment deployment
  • Serverless infrastructure for scalability and cost-efficiency

The IBP WebSocket platform was designed to meet these needs using AWS services including API Gateway, Lambda, DynamoDB, SQS, and SNS.


# Decision

We adopted a fully serverless WebSocket architecture with the following components and design choices:

  • WebSocket API Gateway: Manages connection lifecycle with routes $connect, $disconnect, extendTTL, and ping.
  • DynamoDB: Stores connection metadata and topic subscriptions using multiple rows per connection to avoid scans and GSIs.
  • SNS: Two topics for broadcast and user-specific messages, allowing external services to publish messages.
  • Lambda: Functions for connection management, message routing, and offline message delivery.
  • SQS: Queues offline messages for reliable and scalable delivery.

# Rationale

This architecture was chosen to ensure scalability, reliability, and operational simplicity. Serverless components reduce infrastructure overhead and allow for automatic scaling. Using multiple rows in DynamoDB avoids the need for GSIs and scan operations, improving performance. SQS decouples message delivery from connection handling, enhancing reliability and fault tolerance.


# Implications

People/Training:

  • Team members will need familiarity with AWS serverless services: API Gateway (WebSocket), Lambda, DynamoDB, SQS, and SNS.
  • Onboarding documentation should include examples for message publishing and client connection flows.

Process Adjustments:

  • CI/CD pipelines updated to accommodate the new deployment model via Terraform.
  • API Gateway configurations adjusted to utilize the new Lambda authorizer.

Tooling:

  • Ubuntu workspaces with proper tooling installed.
  • Consolidated infrastructure-as-code via Terraform.

Risks:

  • Operational Complexity: Managing multiple rows per connection in DynamoDB increases data modeling complexity.
  • Connection Management: Persistent connections require careful TTL extension and ping handling to avoid premature disconnects.
  • Debugging: Tracing message delivery across SNS → Lambda → WebSocket may require enhanced observability tooling.

# Trade-Offs

Polling: Clients repeatedly check for updates.

  • X Inefficient, introduces latency, and doesn’t scale well in serverless environments.

Push-based services (SNS, SES, MQTT via IoT Core):

  • ✓ Great for mobile, email, or device notifications.
  • X Not ideal for instant, on-screen updates in web applications.

WebSockets:

  • ✓ Best suited for real-time, in-app notifications.
  • ✓ Enables instant updates directly to the user’s screen.
  • ✓ Integrates smoothly into a serverless, event-driven architecture.
  • X Requires persistent connection management (e.g., TTL extension, keep-alive pings).
  • X Adds complexity in handling offline users and reconnection logic.

# Architecture Design and Decisions

# DynamoDB

# Why Multiple Rows Per Connection (No GSI, No Scan)

Instead of scan operations or Global Secondary Indexes (GSI), we use NO scans, NO GSI, but multiple (3+) writes/updates per connection.

# Problem 1: User ID Lookup by Connection ID

  • Challenge: disconnect and increase_connection_ttl functions need to retrieve user ID from connection ID.
  • Initial Approach: Global Secondary Index (GSI) for connection ID lookups.
  • Solution: Store two rows per connection (connection_id ↔ user_id) to avoid expensive GSI operations.

# Problem 2: Topic-Based Message Broadcasting

  • Challenge: send_message_to_all function needs to find all connections subscribed to a specific topic.
  • Initial Approach: Store topics as a list attribute in a single connection row, requiring scan operations.
  • Solution: Store one row per topic subscription to enable efficient topic-based queries.
We also cannot query by `connection_id` efficiently with `PK=connection#{user_id}#{connection_id}` since DynamoDB requires the full PK or a prefix , and We don't have the `user_id` when we only have `connection_id` in the `increase_connection_ttl` and `disconnect` functions

# SQS

# SQS vs Direct Integration Decision

# Why SQS Instead of Direct Integration?

We use SQS as an intermediary between the connect Lambda and teh offline message delivery for several important reasons:

  • Faster Connect: Users connect immediately without waiting
  • Reliability: DLQ, retries, no single point of failure
  • Scalability: Non-blocking, parallel processing, auto-scaling
  • Operational: Built-in monitoring, message isolation, easier debugging
  • Memory Safe: No timeout risks, efficient resource usage

# SNS and Lambda concurency usage

Kafka Integration (Planned): For high-throughput, batch message processing, we plan to integrate with Kafka event-driven architecture: https://readme.ricohibp.com/-architectural-decisions/005-federated-architecture/


# Key Evaluation Metrics

  • Security: Authentication enforced via Bearer tokens; topic access scoped by realm and user ID.
  • Reliability: Offline message queuing via SQS ensures delivery even during client disconnects.
  • Developer Productivity: External services can publish messages using simple SNS integration; onboarding time reduced through clear client and server examples.

# Conclusion

The serverless WebSocket architecture meets the requirements for scalable, secure, and reliable one-way messaging. It leverages AWS services to minimize operational overhead and maximize performance. The design decisions around DynamoDB and SQS ensure efficient message delivery and connection management, even in offline scenarios.


# References (Optional)