Engineering LeadershipJul 22, 2024 · 6 min read

On-Prem to Cloud Migration With Under an Hour of Downtime

Big-bang migrations fail loudly. Here's the incremental, reversible approach I use to move legacy systems to the cloud while the business keeps running.

Oshri Cohen

Chief Product & Technology Officer

Migrate< 1 hour downtime

Every catastrophic migration story starts the same way: a weekend cutover, a rollback plan nobody tested, and a Monday morning that becomes a week. I've re-platformed legacy enterprise systems to Kubernetes across Azure, AWS and GCP and kept downtime under an hour. Luck had nothing to do with it. I just never let anyone do a big bang.

Reversible, always

The governing rule is that every step must be reversible. If a change can't be rolled back in minutes, it gets broken into smaller changes until it can. That single constraint shapes everything else.

Strangle, don't replace, route traffic to new services incrementally behind a proxy, leaving the legacy path live.
Dual-write and verify, write to old and new data stores in parallel, comparing results before you trust the new one.
Shadow traffic, replay production load against the new system with no user impact until it earns confidence.
Cut over a slice, move one tenant, one region, one feature at a time, with an instant route back.

Downtime is a function of batch size. Shrink the batch and the risk shrinks with it.

The cutover that wasn't an event

By the time the "final" cutover arrives, almost everything already runs on the new platform. The remaining switch is small, you've rehearsed it, and you can still walk it back. That's why the sub-hour window holds up in practice instead of being a number you hope for. On one year-long transit re-platform we hit zero downtime doing exactly this.

<1hr

Downtime on enterprise on-prem to cloud cutovers

Downtime on a year-long logistics re-platform

Clouds in production, Azure, AWS, GCP

The cloud isn't the hard part anymore. Doing the move without betting the business on a single weekend is, and that's entirely a function of how you sequence the work. Sequencing migrations like this is core to the fractional CTO work I do. If you have one looming, let's talk →

On-Prem to Cloud Migration With Under an Hour of Downtime

Reversible, always

The cutover that wasn't an event

Oshri Cohen

How to Improve DORA Metrics: Low to Elite in 90 Days

How to Become an AI-Native Company: Rebuild the Operating Model

What Kind of CTO Do You Need? A Guide by Company Stage