How the ilert Team Achieved a Seamless Migration from Community MySQL to AWS RDS Aurora with Minimal Customer Impact
As our customer base and data demands grew exponentially over the years, scaling our database infrastructure became imperative. Our vision was to set up an active-active database architecture that would ensure regional independence and exceptional service quality globally. Here’s an in-depth look at how our team managed to migrate our production data to AWS RDS Aurora, incorporating cutting-edge strategies to minimize impact during the transitional phase.
Understanding the Challenge
Facing limitations with our existing Community MySQL setup, we needed a scalable, high-availability solution that could handle our increasing load and improve global data access. Our aim was to implement an active-active configuration with AWS RDS Aurora to facilitate regional independence and enhance global service delivery.
Step 1: Strategic Pre-Migration Planning
Preparation was the initial key step. Our team meticulously examined the existing database system, charting out all dependencies and specifications. We enlisted all infrastructure components using Terraform, which not only facilitated a smoother setup in AWS but also ensured consistency across our environments—crucial for reducing potential errors during migration.
Our team undertook an exhaustive planning process, evaluating every aspect of our existing database configurations and querying demands. This analysis helped us precisely define the compute and storage specifications for our Aurora setup.
Step 2: Configuring AWS RDS Aurora and Read-Only Services
We set up the AWS RDS Aurora cluster, ensuring it met all our specifications for high performance and reliability. To ensure a smooth switch, we set up read-only services connected to an Aurora read replica. This step was vital as it allowed our services to continue operating on a read-only basis without any disruption when the main database was temporarily unavailable.
Parallel to this, we configured NGINX Ingress inside our Kubernetes clusters, which played a pivotal role in managing traffic during the migration. By defining specific rules in NGINX Ingress, we directed traffic between our normal and read-only service instances, maintaining service availability even when the main database was in a critical state.
Here is an example for the service-a ingress rule:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
labels:
app: service-a
ingress-class: nginx
name: service-a
namespace: default
spec:
ingressClassName: nginx
rules:
- host: '*.ilert.com'
http:
paths:
- backend:
service:
name: service-a
port:
number: 9999
path: /
pathType: Prefix
tls:
- hosts:
- '*.ilert.com'
And here is an example script to manage the traffic between read-only and normal instances:
# Move traffic to readonly instances
kubectl patch service service-a -p '{"spec":{"selector": {"app": "service-a-readonly"}}}'
# Move traffic to back to write instances
kubectl patch service service-a -p '{"spec":{"selector": {"app": "service-a"}}}'
Step 3: Timing and Executing the Migration
To minimize customer impact, we scheduled the migration during our lowest traffic period. The switch to "read-only" mode for the main database lasted only 4 minutes. During this window, our applications were seamlessly interacting with the read-only services connected to the Aurora replica, ensuring continuous availability of data for reading purposes.
Simultaneously, we initiated the final synchronization of the last batch of data from the MySQL database to the Aurora database. At the end of this process, the Aurora cluster was promoted to handle both read and write operations.
Step 4: Switching Over with Minimal Disruption
Following the successful synchronization and promotion of the Aurora cluster, we switched the live traffic from the read-only instances back to the normal service instances, now pointing to the newly promoted Aurora cluster. This switch was handled delicately through updated NGINX Ingress rules, which redirected all traffic to the new Aurora setup, now capable of handling both read and write operations.
Step 5: Monitoring and Optimization Post-Migration
Post-migration, our team engaged in meticulous monitoring to ensure the system was functioning as expected. We paid close attention to performance metrics such as query efficiency, CPU usage, and storage utilization. Continuous optimizations were applied to ensure that our queries were fully leveraging Aurora’s advanced capabilities.
Conclusion
Migrating to AWS RDS Aurora with just a 4-minute read-only window exemplifies our team's commitment to operational excellence and minimal customer impact. Our detailed preparation, use of sophisticated tools like Terraform, and strategic execution enabled us to not only enhance our database performance but also prepare our global infrastructure to better serve our customers through an innovative active-active setup.
Today, we are successfully operating our Aurora Database Cluster in an active-active configuration across two independent regions, spanning six availability zones. This configuration not only boosts performance and ensures higher availability but also reduces latency for our global customer base.
Looking ahead, we are planning to scale our operations even further, enhancing our infrastructure's resilience and efficiency. Our journey with AWS Aurora is a testament to our ongoing commitment to leveraging cutting-edge technology to deliver the best possible service to our customers.