# 011 - Migrating Bulk Import ETL

By
Alexander Zabielski

Last Modified: October 29th, 2025

Title: Migrating Bulk Import ETL from AWS Glue to Lambda Orchestrated by Step Functions

Status: Implemented

# Context

# What is the background to this decision?

Our current process for handling bulk data imports relies heavily on AWS Glue ETL jobs. While Glue provides powerful distributed processing capabilities, its use for smaller, more frequent data processing and import tasks has created several operational and cost challenges:

  • High Latency/Cold Starts: Glue jobs often experience significant cold start times (several minutes) due to spinning up the Spark environment, which negatively impacts the latency of our time-sensitive bulk imports.
  • Inefficient Cost Structure: The minimum billing duration for Glue often leads to overpaying for jobs that complete quickly (e.g., 5-10 minutes).
  • Operational Complexity: Managing the different environments, dependencies, and monitoring for a mix of Spark-based Glue jobs alongside our standard Python Lambda functions adds complexity for development and infrastructure teams.
  • Code Management: The infrastructure definition for Glue jobs is often separated from the core application code, complicating versioning and deployment.

This decision is needed to achieve a more cost-effective, faster, and operationally consistent serverless architecture for our bulk import processing workflows.

# Decision

# What decision have you made?

We will migrate the functionality of existing AWS Glue ETL jobs responsible for bulk import processing to a new serverless architecture utilizing AWS Lambda functions for granular processing steps. These Lambda functions will be orchestrated by AWS Step Functions to manage the complete, end-to-end import workflow, replacing the existing Glue-based process.

The entire infrastructure for this new workflow (Lambda, Step Functions, and supporting resources) will be defined and deployed using Terraform, and all application code will be versioned within standard code repositories.

# Rationale

# Why did you choose this decision?

The combination of Lambda and Step Functions offers significant advantages for our bulk import use case, particularly when processing times are relatively short and the distributed power of Spark is not strictly necessary.

  1. Factors that influenced the decision:

    • Cost Efficiency: Lambda bills per millisecond, dramatically reducing costs compared to Glue's per-minute billing for short-duration jobs.
    • Performance: Lambda has sub-second cold start times (especially with provisioned concurrency), ensuring our import workflows begin almost instantly, significantly lowering overall processing latency.
    • Granularity and Control: Step Functions allows us to break down the import process (e.g., download, validate, transform, load) into small, independent, and observable Lambda steps. This improves debugging and error handling.
    • Standardized IaC: By managing the full stack (Lambda code, Step Function State Machines) via Terraform, we ensure a single, consistent, and auditable Infrastructure as Code definition.
    • Consistent Environment: Standardizing on Python within Lambda removes the need to maintain and troubleshoot a separate Spark environment for small ETL tasks.
  2. Evidence/Research: Industry benchmarks show that for ETL jobs requiring less than 15 minutes of execution time and not needing massive parallel processing (which is typical for our bulk imports), the Lambda/Step Functions pattern is significantly more cost-effective and faster than Glue. Step Functions provides superior visual workflow monitoring and automatic retry logic compared to native Glue job runners.

  3. Strengths of the chosen solution:

    • Serverless Orchestration: Step Functions provides a robust, visual, and highly durable state machine for coordinating complex workflows, including branching, parallel execution, and sophisticated error handling.
    • Operational Consistency: Consolidates our serverless compute workloads under Lambda, simplifying monitoring and maintenance.
    • Deployment Automation: Leveraging Terraform ensures the Lambda code and the Step Function state machine definition are deployed atomically, reducing configuration errors.

# Implications

# What are the implications of this decision?

This change impacts multiple areas, requiring new tools and refined processes.

  1. People/Training:

    • Architecture & Development Teams: Must be trained on designing and implementing complex Step Function State Machines (ASL - Amazon States Language).
    • Infrastructure Teams: Must master using Terraform for deploying and managing Lambda functions and Step Function resources, including defining complex IAM roles for orchestration.
    • Code Management: Developers will use standard code repositories (e.g., Azure DevOps) to manage the Lambda function code, which will be packaged and deployed by Terraform.
  2. Process Adjustments:

    • CI/CD Pipeline: The build and deployment pipeline will be refactored to execute Terraform for IaC deployment. This pipeline must handle packaging Lambda code from the repository and referencing the deployment artifacts in the Terraform configuration.
    • Deprecation: A phased plan will be developed to identify, migrate, and then retire the existing AWS Glue jobs and their associated resources.
    • New Design Review: All new bulk import processes must be designed and reviewed using the Step Functions pattern.
  3. Tooling:

    • Terraform: Mandatory IaC tool for deployment.
    • Azure DevOps Repositories: Mandatory source control for all Lambda function logic.
    • AWS Step Functions Console: Will become the primary interface for monitoring the status and history of import workflows.
  4. Risks:

    • Lambda Limits: Lambda functions have limitations on execution duration (15 minutes) and memory/storage. Large, highly complex ETL tasks might still necessitate Glue or other solutions.
    • Step Functions Complexity: Designing, testing, and maintaining large, intricate State Machines can introduce new complexity if not done with clear standards and modularity.
    • Initial Migration Effort: Significant effort is required to refactor existing Glue jobs, which may involve rewriting Python/PySpark code into pure Python for Lambda.

# Trade-Offs

# What are the pros and cons of this decision?

  • Benefits:

    • Significant Cost Reduction for short-running import jobs.
    • Faster Execution due to near-zero cold start times.
    • Improved Observability through visual Step Functions workflow monitoring.
    • Consistent IaC via Terraform management of all components.
    • Simplified Code Management using standard repositories for Lambda code.
    • Higher Reliability through built-in state management and retry logic in Step Functions.
  • Drawbacks:

    • Initial Refactoring Work: Requires rewriting Spark/PySpark code into standard Lambda-compatible Python. Even possible ground-up redesign of workflows.
    • Lambda Constraints: Inability to handle jobs that require more than 15 minutes of compute time in a single step or require the Spark ecosystem. Which we should not need for typical bulk imports.
    • New Skill Set: Requires development and operations teams to learn Step Functions/ASL (Amazon State Language).
    • State Machine Overhead: Over-engineering simple workflows with Step Functions can introduce unnecessary complexity.

# Key Evaluation Metrics

# How will success be measured?

Define clear criteria to measure whether the decision solves the intended problems.

  • Cost Reduction: Measure the month-over-month reduction in AWS Glue billing costs for the migrated workflows (target: 80%+ reduction).
  • Execution Latency: Track the average end-to-end execution time for bulk import processes, targeting a reduction by eliminating Glue cold start times.
  • Adoption Rate: Percentage of new bulk import workflows implemented using the Lambda/Step Functions pattern within the next six months.
  • Error Visibility: Measure the average time required to diagnose a failed import job, expecting a reduction due to the clear step-by-step logs provided by Step Functions.

# Conclusion

# What is the final recommendation?

We strongly recommend proceeding with the migration to the Lambda/Step Functions architecture for bulk import ETL. This shift provides a modern, event-driven, and highly cost-efficient solution that aligns with serverless best practices. The operational consistency provided by managing all code via repositories and infrastructure via Terraform will significantly improve our team’s agility and confidence in the deployment process.

The next steps include:

  1. Identifying a low-risk Glue job for a Proof of Concept (PoC) migration.
  2. Establishing Terraform modules for Step Functions and Lambda deployment.
  3. Training the team on State Machine design principles.
  4. Planning the phased migration of existing Glue jobs.

# References (Optional)

  • AWS Step Functions Developer Guide
  • AWS Lambda Best Practices