Last Modified: February 4th, 2026
#
ADR 013: Autocomplete Platform Selection
Title: Platform Selection for Autocomplete Functionality in IBP Orders
Status: Proposed
#
Context
#
What is the background to this decision?
IBP Orders requires autocomplete functionality across multiple use cases within the UI for various entities such as orders and users based on different attributes. Currently, IBP does not have a platform-level search solution to support these requirements.
Key Business Drivers:
- User Experience: Best practices recommend p99 response times < 200ms for autocomplete. Nielsen group studies show that 0.1 seconds creates the feeling of instantaneous response—critical for direct manipulation UI patterns.
- Feature Requirements: The solution must support:
- Wildcard/contains searching (prefix searching alone is insufficient)
- Multiple entity indexes (Orders and Recipients/Users)
- Complex search queries with filtering, grouping, and sorting capabilities
- High performance at scale
Technical Challenges:
- Database-level solutions (PostgreSQL pg_trgm, MySQL n-gram) are CPU-intensive and do not scale adequately for the required performance targets and search requirements
- Existing database infrastructure cannot reliably meet p99 < 50ms server response time targets needed to achieve the 200ms end-to-end goal
#
Decision
#
What decision have you made?
We will deploy self-hosted Typesense as the autocomplete platform for IBP Orders, running on AWS ECS with a High-Availability (HA) cluster configuration. (Pre prod will be a single task)
Infrastructure
graph TB subgraph "Client Layer" User[👤 End Users] CF[☁️ Cloudflare CDN] Angular[🅰️ Angular App] end subgraph "AWS Cloud - us-east-1" subgraph "Edge/API Layer" APIGW[🚪 API Gateway<br/>REST API] LambdaAuth[λ IBP Authorizer<br/>Token Validation<br/>User Identity] end subgraph "Processing Layer" LambdaProxy[λ Proxy Function<br/>Add Tenant Filters<br/>Query Transformation] end subgraph "VPC - [10.0.0.0/16](http://10.0.0.0/16)" subgraph "Load Balancing" ALB[⚖️ Application Load Balancer<br/>Internal<br/>Health Checks] end subgraph "Availability Zone 1a" ECSTask1[🐳 ECS Fargate Task 1<br/>Typesense Node<br/>1 vCPU, 4 GB RAM<br/>Leader/Follower] EFS1[📁 EFS Mount] end subgraph "Availability Zone 1b" ECSTask2[🐳 ECS Fargate Task 2<br/>Typesense Node<br/>1 vCPU, 4 GB RAM<br/>Follower] EFS2[📁 EFS Mount] end subgraph "Availability Zone 1c" ECSTask3[🐳 ECS Fargate Task 3<br/>Typesense Node<br/>1 vCPU, 4 GB RAM<br/>Follower] EFS3[📁 EFS Mount] end EFSStorage[(💾 EFS File System<br/>Shared Storage<br/>Multi-AZ)] end subgraph "Supporting Services" CloudWatch[📊 CloudWatch<br/>Logs & Metrics] Secrets[🔐 Secrets Manager<br/>API Keys] VPCLink[🔗 VPC Link<br/>Private Integration] end end subgraph "Raft Cluster Communication" Raft[⚡ Raft Consensus Protocol<br/>Port 8107<br/>Leader Election & Replication] end %% User Flow User -->|HTTPS| CF CF -->|Cached Assets| Angular Angular -->|API Requests<br/>+ Auth Token| APIGW %% API Gateway Flow APIGW -->|Authorize Request| LambdaAuth LambdaAuth -->|Validated<br/>User Context| APIGW APIGW -->|Forward Request<br/>+ User Info| LambdaProxy %% Lambda Proxy Flow LambdaProxy -->|Add tenant_id filter<br/>Transform query| VPCLink VPCLink -->|Private Network| ALB %% Load Balancer Flow ALB -->|Round Robin<br/>Health Check| ECSTask1 ALB -->|Round Robin<br/>Health Check| ECSTask2 ALB -->|Round Robin<br/>Health Check| ECSTask3 %% EFS Storage ECSTask1 -.->|Mount /data| EFS1 ECSTask2 -.->|Mount /data| EFS2 ECSTask3 -.->|Mount /data| EFS3 EFS1 -.->|Multi-AZ Replication| EFSStorage EFS2 -.->|Multi-AZ Replication| EFSStorage EFS3 -.->|Multi-AZ Replication| EFSStorage %% Raft Communication ECSTask1 <-->|Raft Peering<br/>8107| Raft ECSTask2 <-->|Raft Peering<br/>8107| Raft ECSTask3 <-->|Raft Peering<br/>8107| Raft %% Supporting Services LambdaProxy -.->|Get Config| Secrets ECSTask1 -.->|Send Logs/Metrics| CloudWatch ECSTask2 -.->|Send Logs/Metrics| CloudWatch ECSTask3 -.->|Send Logs/Metrics| CloudWatch ALB -.->|Send Metrics| CloudWatch %% Response Flow (dotted lines for clarity) ECSTask1 -.->|Search Results| ALB ECSTask2 -.->|Search Results| ALB ECSTask3 -.->|Search Results| ALB ALB -.->|Response| LambdaProxy LambdaProxy -.->|Filtered Results| APIGW APIGW -.->|JSON Response| Angular %% Styling classDef userLayer fill:#e1f5ff,stroke:#01579b,stroke-width:2px classDef apiLayer fill:#fff3e0,stroke:#e65100,stroke-width:2px classDef computeLayer fill:#f3e5f5,stroke:#4a148c,stroke-width:2px classDef storageLayer fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px classDef networkLayer fill:#fce4ec,stroke:#880e4f,stroke-width:2px classDef supportLayer fill:#f5f5f5,stroke:#424242,stroke-width:2px class User,CF,Angular userLayer class APIGW,LambdaAuth apiLayer class LambdaProxy,ECSTask1,ECSTask2,ECSTask3 computeLayer class EFSStorage,EFS1,EFS2,EFS3 storageLayer class ALB,VPCLink networkLayer class CloudWatch,Secrets,Raft supportLayer
Autocomplete Flow
sequenceDiagram
actor User
participant CF as Cloudflare<br/>(Angular App)
participant APIGW as API Gateway
participant Auth as Lambda Authorizer
participant Proxy as Lambda Proxy
participant ECS as ECS Fargate<br/>(Typesense)
User->>CF: Types in search box
CF->>CF: Debounce input (300ms)
CF->>APIGW: POST /search<br/>Authorization: Bearer {token}<br/>Body: {query: "prod"}
APIGW->>Auth: Validate token
Auth->>Auth: Verify JWT<br/>Extract user_id, tenant_id
Auth-->>APIGW: Auth context<br/>{user_id, tenant_id, role}
alt Authentication Failed
APIGW-->>CF: 401 Unauthorized
CF-->>User: Show error
else Authentication Successful
APIGW->>Proxy: Forward request<br/>+ auth context
Proxy->>Proxy: Build Typesense query<br/>filter_by: tenant_id=123
Proxy->>ECS: POST /collections/products/documents/search<br/>X-TYPESENSE-API-KEY: {admin_key}<br/>Body: {q: "prod", filter_by: "tenant_id:=123"}
ECS->>ECS: Search index with filter
ECS-->>Proxy: Search results<br/>[{id: 1, name: "Product A"}]
Proxy-->>APIGW: 200 OK<br/>Filtered results
APIGW-->>CF: Results
CF-->>User: Display suggestions
end
#
Rationale
#
Why did you choose this decision?
This decision prioritizes long-term cost efficiency, operational control, and alignment with existing AWS infrastructure while maintaining excellent performance.
1. Performance Excellence
- Self-hosted Typesense delivers p99 latency of 20-50ms (same AZ) for wildcard searches—well below the 50ms server-side target
- Supports all required query types: prefix, contains, fuzzy matching, and complex filtering
- Performance scales linearly with infrastructure; same latency characteristics as cloud version
- Cache hit rates of 70-85% on typical autocomplete patterns further improve user experience
- HA cluster eliminates single points of failure for internal tool reliability
2. Exceptional Cost Optimization
Supports 3.5M records efficiently with predictable costs for foreseeable growth to 10M+ records
3. Infrastructure Alignment
- Leverages existing AWS account, VPC, and IAM infrastructure
- ECS deployment aligns with current DevOps tooling and CI/CD pipelines
- Data remains in-house within controlled AWS environment
- Direct integration with existing observability stack (CloudWatch, DataDog, etc.)
4. Operational Control
- Full control over performance tuning, ranking algorithms, and index configuration
- Can implement custom analyzers and filters specific to IBP Orders domain
- Data remains within organizational control—no third-party dependency
- Direct access to all operational metrics and logs for debugging
- Ability to scale independently of SaaS provider tiers
5. Feature Set Alignment
- Provides typo tolerance (2 typos), faceted search, complex filtering, and sorting—all required capabilities
- Supports all IBP Order use cases: wildcard searching, multiple entity indexes, and complex grouping
- Open-source foundation enables custom feature development if needed
- No artificial limits on query throughput or data size
6. Future Flexibility and Scalability
- Linear scaling path: add more powerful instances or multi-region HA without architectural changes
- Open-source nature prevents vendor lock-in; can fork or migrate to OpenSearch if business needs diverge
- Skills and infrastructure investments directly benefit AWS ecosystem knowledge
- Option to implement advanced features (custom ranking, ML-based search, etc.) without SaaS limitations
7. Team Fit
- Leverages existing AWS and DevOps expertise within organization
- Infrastructure-as-Code (Terraform/CloudFormation) templates integrate with current practices
- Straightforward API comparable to cloud alternatives reduces integration complexity
- Excellent documentation and active open-source community support
#
Implications
#
What are the implications of this decision?
1. People/Training
- Infrastructure expertise required: Team members (DevOps/SRE) need familiarity with ECS, auto-scaling, and infrastructure monitoring
- Developers need 4-8 hours for Typesense API and integration best practices (same as cloud option)
- Recommend 40-80 hours of architectural and implementation work upfront
- Plan 1-2 hours/month ongoing maintenance per DevOps engineer
2. Process Adjustments
- Infrastructure as Code: Develop Terraform/CloudFormation templates for cluster provisioning, backup, and disaster recovery
- Data Pipeline: Establish ETL process to sync Order and User entities to Typesense (daily or event-driven)
- Relevance Tuning: Initially configure relevance settings for different entity types; plan quarterly reviews based on user feedback
- Monitoring & Alerting: Integrate with CloudWatch, DataDog, or similar; set alerts for cluster health, disk space, memory usage, and query latency p99 exceeding 100ms
- Backup Strategy: Implement automated daily snapshots to S3; document recovery procedures
- Security: Manage API key rotation, network policies, and RBAC within VPC
3. Tooling
- AWS ECS: Container orchestration for Typesense deployment
- Terraform/CloudFormation: Infrastructure as Code for reproducible deployments
- Typesense JavaScript client library for backend/frontend integration
- AWS Systems Manager: Secrets management for API keys and credentials
- CloudWatch/DataDog: Monitoring, logging, and alerting
- Data sync tooling: Develop ETL using Lambda, Glue, or managed message queues (SQS/SNS)
- Optional: Add front-end autocomplete component library (e.g., instantsearch.js-compatible solutions)
4. Risks and Mitigation
#
Trade-Offs
#
What are the pros and cons of this decision?
Benefits:
- ✅ Superior Long-Term Cost: $200-400/month regardless of search volume; breaks even vs. Typesense Cloud after ~10 months; dramatically cheaper than Algolia ($280-830/month) at scale
- ✅ Excellent Performance: 20-50ms p99 latency consistently beats requirements; scales linearly with infrastructure investment
- ✅ Full Operational Control: Custom tuning, data sovereignty, and zero vendor lock-in; infrastructure remains within organizational control
- ✅ Scalability: Linear scaling path without tier constraints; can handle 10M+ records and 1000+ QPS with vertical scaling
- ✅ AWS Integration: Native AWS services (ECS, CloudWatch, IAM, VPC) simplify operations; leverages existing DevOps expertise
- ✅ Future Flexibility: Open-source foundation enables forking, custom development, or seamless migration to OpenSearch if needed
- ✅ No Vendor Lock-In: Skills and investments transfer directly to broader DevOps/AWS ecosystem
- ✅ Modern Features: Built-in typo tolerance, faceted search, and complex query support with option for custom enhancements
Drawbacks:
- ❌ Higher Initial Setup Time: 40-80 hours upfront for architecture, IaC, and cluster provisioning (vs. 3-6 hours for cloud)
- ❌ Operational Overhead: Requires 1-2 hours/month maintenance (monitoring, patching, scaling decisions); lower than OpenSearch but higher than cloud solutions
- ❌ Infrastructure Complexity: Must manage HA across AZs, failover, backup/recovery, security patches, and incident response
- ❌ Operational Risk: Infrastructure failures become team responsibility; requires runbook documentation and incident response training
- ❌ DevOps Expertise Required: Team must have AWS, ECS, and observability stack competency; knowledge silos create risk
- ❌ Scaling Complexity: Requires proactive monitoring and planned scaling; unexpected traffic spikes may impact latency until scaled (vs. automatic scaling in cloud)
- ❌ No Built-in Analytics: Must develop custom tracking for search patterns, zero-result queries, and user behavior
- ❌ Single Region (Initial): Cross-region latency 60-120ms; multi-region HA requires significant additional infrastructure
#
Key Evaluation Metrics
#
How will success be measured?
Define clear criteria to determine if this decision solves the intended problems:
Scaling Decision Gate (Quarterly Review):
- If p99 latency > 100ms or CPU > 80%: Upgrade to next tier (e.g., t3.medium → r6g.large)
- If search volume growth > 50% YoY: Plan vertical scaling; evaluate multi-region HA if global expansion needed
- If infrastructure costs exceed budget by > 15%: Review query patterns and optimize indexing strategy
#
Cost Analysis - Self-Hosted Typesense
#
Infrastructure Costs
#
Scaling Cost Impact
#
Comprehensive Cost & Capacity Comparison
#
Small Cluster
#
Medium Cluster
#
Large Cluster
#
Conclusion
#
What is the final recommendation?
Deploy self-hosted Typesense on AWS ECS as IBP Orders' autocomplete platform.
This decision prioritizes long-term value creation and operational control while maintaining excellent performance:
- Superior Economics: $200-400/month infrastructure cost with minimal maintenance (fargate instance is managed by AWS)
- Operational Control: Full transparency and customization; data remains within organizational control
- Technical Soundness: Exceeds all performance and feature requirements (< 50ms p99 latency); scales linearly to 10M+ records
- AWS Alignment: Leverages existing infrastructure, expertise, and tooling; no vendor lock-in
- Opensearch Evaluation: Evaluate moving to Opensearch once IBP Search is implemented
Why benefits outweigh challenges:
- $200-300/month fixed cost is dramatically cheaper than Algolia ($280-830/month) and AWS Opensearch and provides better long-term value than Typesense Cloud
- Performance targets (< 50ms p99) are exceeded; HA configuration ensures reliability for internal tool
- DevOps overhead (1-2 hours/month) is reasonable given cost savings and organizational AWS expertise
- Open-source foundation and AWS integration enable future optimization without vendor constraints
#
Success Criteria
- ✅ Infrastructure deployed and tested across 2 AZs with automated failover
- ✅ Launch Orders autocomplete with p99 < 50ms in production
- ✅ Achieve < 5% zero-result search queries after relevance tuning
- ✅ Maintain > 99.5% uptime (HA validation during testing)
- ✅ Infrastructure cost tracking within 10% of $200-400/month forecast
- ✅ Operational team reports manageable 1-2 hours/month maintenance burden
- ✅ Feature-complete delivery within 12 weeks from decision
#
References (Optional)
- Performance Benchmarks: Tables 1-11 in solution_analysis.md provide detailed comparative analysis
- Typesense Documentation: https://typesense.org/docs/ and https://typesense.org/docs/guide/high-availability.html
- AWS EC2 Pricing: https://aws.amazon.com/ec2/pricing/on-demand/
- Nielsen Norman Group Response Time Study: https://www.nngroup.com/articles/website-response-times/
- Cost Comparison Models: See solution_analysis.md TCO calculations
- Operational Complexity Analysis: Detailed in Operational Complexity section of solution_analysis.md
- AWS ECS Best Practices: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
- Terraform AWS Provider: https://registry.terraform.io/providers/hashicorp/aws/