Stabilising peak trading for a $300M+ pure-play online retailer

Sector

Retail & E-Commerce

Engagement type

Fractional CTO — 6 months

Systems & platforms

AWSMagentoFastlyRedisDatadogLaunchDarkly

Business context

The previous Black Friday had not gone well. A widely-circulated outage during the Friday evening email push took the site offline for just over an hour and cost a quantifiable amount of revenue, a less quantifiable amount of brand equity, and the CTO their job.

The business itself was in good shape — a pure-play online retailer doing mid-nine-figures in annual revenue, with a board-level commitment to “double down on trading” heading into the back half of the year. The replacement search for a permanent CTO was underway, but the board didn’t want to gamble the biggest trading window of the year on a new hire who would still be finding the coffee machine in October.

The challenge

The site wasn’t broken in any single obvious way. It was broken in eleven small ways that only manifested under load — a cache stampede on the category pages, a third-party tag that blocked the critical rendering path, a database connection pool sized for a quieter era, a feature flag service with no circuit breaker. Each issue was survivable on its own. Together, they were a queue of outages waiting for enough traffic to trip them.

The engineering team knew most of this. They didn’t have permission — or bandwidth — to do anything about it while also shipping the roadmap.

My role

Fractional CTO, four days a week for the lead-up to peak, then scaled back to two days post-Boxing Day for the 90-day review and handover to the permanent hire. Full accountability for trading readiness, with a direct line to the CEO and a standing slot at the weekly exec meeting.

What I did

Scope cut (week 1). Cancelled or deferred every feature in flight that wasn’t already in user testing. Not popular, but non-negotiable — the platform needed to be hardened, not extended, for the next twelve weeks. Re-framed the trading window as a product in its own right, with its own backlog and its own definition of done.

Load testing with teeth (weeks 2–5). Established a repeatable load test against production-like infrastructure, calibrated against the previous peak’s traffic curve plus 40%. Every sprint shipped a test result, a list of what broke, and a fix. The first run fell over at 30% of target. The last run held at 140%.

Architecture trims. Introduced edge caching on Fastly for the full category and product detail page tier. Added Redis as a proper first-class cache with well-defined TTLs and explicit invalidation — not the previous arrangement where half the codebase assumed Redis existed and the other half assumed it didn’t. Put LaunchDarkly kill switches in front of every third-party integration on the critical path so marketing’s next “quick tag add” couldn’t take the checkout down.

Incident response rebuild. Rewrote the runbooks. Ran game-day exercises where we deliberately broke production in the staging environment and timed the team’s response. The first drill took 47 minutes to diagnose. By the fourth, the team was ack-to-root-cause in under 10.

Contract renegotiation. Renegotiated the CDN and observability contracts — both were on auto-renewing terms from two growth stages ago. Saved enough to fund the peak engineering effort without asking the board for more money.

Outcomes

The four trading events of Q4 passed without a single unplanned outage. Checkout conversion was up 11% year-on-year during the peak window — not all attributable to the platform, but the year-over-year reduction in abandoned checkouts from latency and error states was measurable. P95 checkout latency dropped from 4.8 seconds to under one. Infrastructure cost per order fell 28% despite handling record volume.

The permanent CTO came in during January to a platform that was stable, a team that had just shipped the hardest quarter of its life, and a list of well-documented things to build next. That was the point.

Outcomes

Zero unplanned outages across Black Friday, Cyber Monday and Boxing Day

Checkout conversion up 11% year-on-year during peak

P95 checkout latency reduced from 4.8s to 900ms

Cost of infrastructure per order down 28%

Incident response time cut from 40 minutes to under 8