The Art of the迁移: How I Moved 150K Users to a New Database Without Anyone Noticing
The Art of the Migration: How I Moved 150K Users to a New Database Without Anyone Noticing
At Widespace, our WidePay platform was running on MySQL 5.7 — a solid database, but one that had accumulated years of technical debt: no foreign keys on critical tables, inconsistent indexing, and a character set that caused subtle bugs with Arabic names.
We needed to migrate to PostgreSQL 15. While the users slept.
Why We Couldn't Just Stop the World
WidePay processes financial transactions. That means:
- Zero downtime tolerance: The Saudi workaround for "the system is down" is calling your manager at 2 AM
- Data integrity is non-negotiable: Losing a single transaction record could mean regulatory penalties
- Arabic text support: MySQL's utf8 (not utf8mb4, just utf8) was silently corrupting Arabic names andaddresses
The 8-Week Migration Play
Weeks 1-2: Shadow Writes
We set up a dual-write system where every write to MySQL also wrote to PostgreSQL. The PostgreSQL writes were asynchronous and non-blocking — if they failed, we logged but didn't crash.
async def write_transaction(data):
# Primary write to MySQL (synchronous, must succeed)
mysql_result = await mysql.insert(data)
# Shadow write to PostgreSQL (async, logged on failure)
try:
await pg.insert(data)
except Exception as e:
logger.error(f"Shadow write failed: {e}")
# Alert on-call, but don't crash the transaction
return mysql_result
This let us verify that PostgreSQL could handle our write patterns without any risk to production data.
Weeks 3-4: Consistency Verification
We built a nightly job that compared MySQL and PostgreSQL row counts and checksums. The initial results were humbling:
- 3,218 rows had different data between the two databases
- 47 rows were missing entirely from PostgreSQL
- 12 rows had Arabic text corruption in MySQL that was "fixed" during migration
We fixed every discrepancy before moving forward.
Weeks 5-6: Read Migration
We switched 10% of read traffic to PostgreSQL using a feature flag. Then 25%. Then 50%. Then 100%.
Each increase had a monitoring dashboard showing:
- Query latency (P50, P95, P99)
- Error rate
- Data consistency metrics
PostgreSQL was 40% faster on our most common queries. This wasn't surprising — our MySQL tables had no proper indexes for the query patterns we'd evolved over three years.
Week 7: The Big Switch
On a Friday night (lowest traffic in our Saudi market), we:
- Enabled maintenance mode for writes (reads still served from PostgreSQL)
- Drained all pending MySQL writes
- Ran a final consistency check
- Switched writes to PostgreSQL
- Monitored for 4 hours
- Declared victory at 3 AM and went to sleep
Week 8: MySQL Decommission
We kept MySQL running in read-only mode for two weeks as a safety net. Nobody needed it. We turned it off on a Saturday and deleted the EC2 instance that Sunday.
The Three Things That Saved Us
- Shadow writes BEFORE switching — We caught 47 write-path bugs before any real data was at risk
- Automated consistency checks — A human would never have caught all 3,218 discrepancies
- Feature flags for traffic shifting — Being able to roll back to MySQL in seconds gave everyone confidence
What I'd Do Differently
I'd start the consistency verification from day one instead of week 3. Those shadow writes were writing data incorrectly for two weeks, and we had to replay and re-verify everything. Lesson learned: verify as you shadow, not after.
Comments (0)
Sign in to leave a comment