Cloud-Native Migration Without the Guesswork: A Dependency-First Approach
The reason is almost always the same: the team understood the new architecture perfectly and understood the old system much less well than they thought.
This post is about reversing that order. Before the architecture diagram, before the infrastructure decisions, before a single line of new code: understand what the system you are replacing actually does. That understanding is not a preliminary step you get through quickly. It is the work. Everything else follows from it.
The Dependency Mapping Problem
Legacy APIs accumulate behavior the same way cities accumulate roads: incrementally, in response to immediate pressures, without a master plan. An endpoint added to support a mobile app in 2017 is still being called by three internal services and a batch job that nobody documented. A response format that was a temporary workaround became load-bearing when six downstream consumers adapted to it.
The documented architecture of a legacy API, if documentation exists at all, describes the system as it was designed. The access logs describe the system as it is used. Those two things diverge over time, and the divergence is where migrations fail.
Dependency mapping is the process of closing that gap. It requires three inputs:
Access logs, at least 90 days of them. Not just aggregate traffic numbers. The full request log: endpoint, caller identity, timestamp, response code, latency. You are looking for the calls that only happen on the 1st of the month, at 3am on Sundays, or once per quarter during a settlement run. These are the ones that will break silently if you miss them during the cutover.
Actual response shapes, not documented ones. Pull production responses for each endpoint and compare them against the specification. Legacy APIs drift. The documentation says a field is a string; production returns null for 2% of requests. Consumers have adapted to that null. Your new service needs to replicate it until those consumers are updated, or the migration will produce failures that are impossible to reproduce in staging.
A caller inventory. Cross-reference your access logs against every known consumer. Internal services, batch jobs, external integrations, mobile clients on old versions that cannot be force-updated. Every caller you cannot identify is a risk you are carrying into production.
For large polyglot codebases spanning mainframe COBOL, distributed Java services, and cloud-native extensions, this cross-system dependency tracing is where static analysis tools earn their place. Platforms like SMART TS XL construct comprehensive dependency graphs across languages and platforms, surfacing call relationships and data lineage that manual log analysis alone cannot reliably recover, particularly in environments where knowledge has walked out the door with retiring engineers.
Why the Proxy Layer Comes Second
Once you have an accurate dependency map, the Strangler Fig pattern becomes genuinely useful. The proxy sits between all callers and both the legacy and new API, routing traffic based on a configuration you can change without a deployment.
const express = require('express');
const { createProxyMiddleware } = require('http-proxy-middleware');
const app = express();
async function buildRouter() {
const config = await loadRoutingConfig(); // loaded from a feature flag store
app.use('/api/v1/orders', (req, res, next) => {
const target = config.orders.useNew ? 'new' : 'legacy';
const proxy = createProxyMiddleware({
target: target === 'new'
? process.env.NEW_API_URL
: process.env.LEGACY_API_URL,
changeOrigin: true,
on: {
proxyReq: (proxyReq) => {
proxyReq.setHeader('X-Api-Source', `proxy-${target}`);
}
}
});
return proxy(req, res, next);
});
}
buildRouter().then(() => app.listen(3000));
The X-Api-Source header matters more than it looks. It gives you a metric in both the legacy and new service logs that tracks exactly how much traffic is still arriving via the old path. When that number reaches zero across all time windows, including the monthly batch jobs you found in the dependency mapping step, the cutover window is open.
But none of this works if the proxy is placed in front of a system you do not yet understand. A proxy in front of a known system is a migration tool. A proxy in front of an unknown system is a liability that routes traffic to failures you cannot predict.
Canary Deployment: The Mechanics That Matter
Canary deployment during an API migration means routing a percentage of production traffic to the new service while the legacy endpoint handles the rest. The operational detail that most example implementations skip is deterministic routing by caller identity.
function routeRequest(req, config) {
const endpoint = normalizeEndpoint(req.path);
const endpointConfig = config[endpoint];
if (!endpointConfig?.canaryEnabled) return 'legacy';
// Pin the same user to the same backend for the duration of the canary.
// Random routing per request produces state inconsistencies in
// any API that has side effects.
const callerId = req.headers['x-user-id'] || req.headers['x-service-id'];
if (callerId) {
const bucket = stableHash(callerId) % 100;
return bucket < endpointConfig.canaryPercentage ? 'new' : 'legacy';
}
return Math.random() * 100 < endpointConfig.canaryPercentage ? 'new' : 'legacy';
}
If a caller's requests are split between backends within the same session, you get state inconsistencies: a write on the new service the legacy system does not see, or vice versa. Pinning a caller to one backend eliminates this class of error.
The progression should be tied to observed error rates, not to a schedule. A reasonable sequence is 1%, 5%, 10%, 25%, 50%, 100%, with a hold at each step until the new service's error rate matches the legacy baseline at equivalent load. A spike at any step rolls the percentage back; you investigate before proceeding.
One step that dramatically reduces the risk of this progression is running the new service in shadow mode before enabling the canary at all. Shadow mode means the proxy sends every production request to both the legacy and new service, but only returns the legacy response to the caller. The new service processes the request and logs the result. You diff the outputs. This surfaces behavioral discrepancies under real production load before any user is affected.
async function shadowRequest(req, legacyProxy, newServiceUrl) {
// Return legacy response immediately
const legacyResponse = await forwardToLegacy(req, legacyProxy);
// Fire-and-forget to new service for comparison
forwardToNew(req, newServiceUrl)
.then(newResponse => {
logDiff({
endpoint: req.path,
callerId: req.headers['x-user-id'],
legacyStatus: legacyResponse.status,
newStatus: newResponse.status,
bodyMatch: deepEqual(legacyResponse.body, newResponse.body),
timestamp: Date.now()
});
})
.catch(err => log.error('Shadow request failed', { err, path: req.path }));
return legacyResponse;
}
The diff log from shadow mode is one of the most valuable artifacts a migration team can produce. It tells you, before a single user is affected, exactly where the new service diverges from the legacy behavior. Some of those divergences are intentional improvements. Others are bugs. The log makes the distinction visible.
Idempotency Across the Cutover Boundary
During a migration, requests can be processed by both the legacy and new service. A request times out at the proxy but was already processed upstream. A client retries on a network error without knowing whether the original succeeded. A canary routing decision changes between a client's retry attempts.
The idempotency key pattern handles this. The client generates a unique key per logical operation and sends it as a header. Both services store processed keys and return the cached response for duplicates.
async function idempotencyMiddleware(req, res, next) {
const key = req.headers['x-idempotency-key'];
if (!key) return next();
try {
const cached = await redis.get(`idem:${key}`);
if (cached) {
const { status, body } = JSON.parse(cached);
return res.status(status).json(body);
}
} catch (err) {
// Redis failure should not block the request
log.warn('Idempotency cache read failed', { err, key });
return next();
}
const originalJson = res.json.bind(res);
res.json = async (body) => {
if (res.statusCode < 500) {
await redis.setex(
`idem:${key}`,
86400,
JSON.stringify({ status: res.statusCode, body })
).catch(err => log.warn('Idempotency cache write failed', { err, key }));
}
return originalJson(body);
};
next();
}
Two implementation notes that are frequently overlooked:
The 24-hour TTL is a default, not a rule. If your dependency mapping found batch jobs with weekly retry windows, or settlement processes that retry across month boundaries, your TTL needs to cover those windows. A TTL shorter than the longest legitimate retry window means duplicate processing is still possible for exactly the callers you least want it to happen to.
The statusCode < 500 condition means server errors are not cached. If the new service fails, the next retry gets a fresh attempt. A cached 500 would make the failure look permanent to the client even after the underlying issue is fixed.
What the Patterns Actually Miss
The Strangler Fig pattern, canary deployment, and idempotency handling are all well-documented. The failure mode that is less documented is the one that happens not because the patterns were applied incorrectly, but because the system being migrated encoded undocumented business logic that nobody knew was there.
This is not a theoretical risk. Legacy APIs that have been running in production for years accumulate conditional branches that reflect business decisions made and forgotten. A status code of 7 produces a different response shape. A specific combination of parameters triggers a code path that was intended as a temporary workaround in 2019 and became permanent when the tickets to remove it were never prioritized. A user class that was migrated from an older system carries a flag that routes them through a different calculation.
None of these appear in tests, because nobody knew to write the test. None appear in documentation, because nobody wrote that either. They appear in production, two weeks after a migration that passed every check.
The practical response is to treat the instrumentation of the legacy API as a first-class deliverable, not a monitoring afterthought. Before building the new service, capture the full request and response shapes of every endpoint under production load. Log every conditional branch that gets exercised. Build a behavioral profile of what the system actually does.
Shadow mode, described in the canary section above, is part of this. But behavioral profiling should begin earlier, before the proxy is deployed, as pure observation of the legacy system under real traffic. The goal is not monitoring. It is organizational memory: making explicit the tacit knowledge that lives only in the system's behavior and in the heads of engineers who may not be available when the migration encounters the edge case they would have recognized.
Versioning the Transition
One decision that shapes the long-term maintenance cost of a migration is how to handle API versioning at the proxy layer. The temptation is to map the legacy paths directly to the new service's versioned endpoints:
// Proxy rewrites legacy path to versioned new API
app.use('/api/orders', createProxyMiddleware({
target: process.env.NEW_API_URL,
changeOrigin: true,
pathRewrite: { '^/api/orders': '/api/v1/orders' },
on: {
proxyReq: (proxyReq) => {
proxyReq.setHeader('X-Api-Source', 'legacy-proxy');
}
}
}));
This works. It also creates a rewrite layer in the proxy that you will be maintaining indefinitely until every legacy client is updated. A cleaner position is to treat the proxy itself as the versioning layer for the duration of the transition. Clients do not change. The new API is clean and versioned from the start. When the migration completes and legacy clients are updated or retired, the rewrite rules are removed and the proxy is decommissioned.
The X-Api-Source header in the rewrite above is how you track progress toward that decommission. When it stops appearing in the new service's logs across all time windows, including the off-hours operations you found in the dependency mapping step, the proxy's job is done.
The Cutover Decision
The decision to route 100% of traffic to the new service should be driven by four signals, not by a project deadline.
Error rate parity. The new service's error rate at full traffic should be within the same range as the legacy system's at equivalent load. Not zero errors: the legacy system probably did not have zero errors either.
Latency equivalence. Compare the 99th percentile, not just the median. Legacy systems often have consistent median latency but occasional outliers. The new system may trade those outliers differently. Know the shape of the distribution before the cutover, not after.
Business metric stability. For APIs that sit in transactional flows, downstream business metrics should hold through the entire canary window. A migration that degrades a conversion rate or an order completion rate is not a successful migration, regardless of whether the API is technically correct.
Legacy traffic at zero. The X-Api-Source metric goes to zero across all relevant time windows, including the periodic operations identified in the dependency mapping step.
After the cutover, the legacy system stays running but unrouted for a defined standby period, typically 30 days. The proxy rules remain in place, pointing entirely to the new service. The legacy system is the rollback path, and you want it warm until you are confident the rollback will not be needed.
A Realistic Timeline
Organizations that do this well follow a consistent sequence:
Before any code changes: 90-day access log analysis, contract documentation for all endpoints, caller inventory with contact information for external consumers.
Proxy deployment: Introduce the proxy routing 100% to legacy. Verify behavior is unchanged against the baseline metrics you have just established.
Behavioral profiling: Instrument the legacy API for learning, not just monitoring. Run for at least two full business cycles before building anything.
New service development: Build against the documented contracts, with idempotency middleware and structured logging from day one.
Shadow mode: Run the new service against production traffic without routing any users to it. Diff outputs. Investigate every discrepancy.
Canary progression: 1%, 5%, 10%, 25%, 50%, each with a hold through at least one full business cycle. Longer if the dependency mapping found monthly or quarterly operations.
Cutover and standby: 100% to the new service, legacy warm but unrouted for 30 days, then decommission.
The total timeline from proxy deployment to decommission is typically 10 to 18 weeks for a moderately complex legacy API, assuming the dependency mapping and behavioral profiling were completed first. Teams that skip those steps typically discover what they skipped somewhere around the 10% canary mark.
From Guesswork to Evidence
Cloud-native migration is not primarily an infrastructure problem. The infrastructure decisions are well-understood, and the patterns that handle traffic routing, versioning, idempotency, and rollback are all documented and battle-tested.
The hard part is the knowledge problem: understanding the system you are replacing well enough to replicate its behavior faithfully while you replace it. That understanding does not come from documentation, because the documentation has not kept up. It comes from the system's own behavior under production load, and from the tooling and discipline to capture that behavior systematically before the migration begins.
Teams that invest in that capture phase do not eliminate risk. They do something more useful: they convert unknown risks into known ones. Known risks can be planned for, tested against, and handled. Unknown risks appear as production incidents two weeks after a successful migration.
The dependency map, the behavioral profile, the shadow mode diff log: these are not preliminary steps on the way to the real work. They are the real work. The proxy configuration and the canary routing and the idempotency middleware are how you execute once you actually know what you are doing.

