PQC Engineering Series: Deep Dive 5
Crypto-Agility Is a Runtime Property, Not a Compliance Checkbox
Post-quantum cryptography is often discussed as if the hard part were choosing the right algorithm.
That framing is dangerously incomplete.
The first finalized NIST post-quantum cryptography standards gave the industry a concrete cryptographic foundation: FIPS 203 for ML-KEM, derived from CRYSTALS-Kyber; FIPS 204 for ML-DSA, derived from CRYSTALS-Dilithium; and FIPS 205 for SLH-DSA, derived from SPHINCS+. These standards matter. They create a stable target for implementation, certification, procurement, and migration planning. They also remove one of the laziest excuses the industry had: “we are waiting for standardization.” (NIST)
But standards do not migrate systems.
Algorithms do not rotate themselves through legacy services, constrained devices, certificate authorities, service meshes, firmware pipelines, mobile applications, industrial gateways, HSMs, backup archives, telemetry streams, audit logs, and forgotten internal APIs that nobody has touched since the last infrastructure reorganization.
The real post-quantum problem begins after the algorithm is selected.
Crypto-agility is usually described as the ability to replace one cryptographic algorithm with another. That definition is too weak. It sounds clean on paper, which is exactly how bad engineering ideas usually enter polite society. A system is not crypto-agile because a configuration file contains an enum named Algorithm. It is not crypto-agile because a vendor dashboard lets someone select “PQC-ready” from a dropdown. It is not crypto-agile because a library claims support for ML-KEM while the surrounding protocol still assumes ECDH-shaped semantics, fixed certificate sizes, static identity binding, and classical operational timelines.
A system is crypto-agile only if it can safely change its cryptographic assumptions while preserving its security invariants.
That is a much harder property.
It means the system can move from one algorithm family to another without breaking identity, authentication, authorization, confidentiality, integrity, auditability, availability, rollback safety, operational recovery, or policy enforcement. It means migration is not treated as a one-time deployment event, but as a controlled state transition across the entire lifecycle of the system.
Crypto-agility is not a checkbox.
It is a runtime property.
The distinction matters because post-quantum migration is not merely about replacing RSA, ECDSA, ECDH, or classical Diffie-Hellman. It is about changing the assumptions under which distributed systems establish trust. Those assumptions appear everywhere: in handshake transcripts, certificate chains, session negotiation, firmware manifests, device provisioning, key wrapping, data retention policies, build pipelines, artifact signing, remote attestation, secure boot, message queues, service identity, and long-term archival confidentiality.
The algorithm is the visible artifact. The dependency graph is the real battlefield.
A system that cannot describe where cryptography is used cannot migrate safely. A system that cannot describe why cryptography is used cannot preserve semantics during migration. A system that cannot describe what security property each cryptographic primitive is supposed to enforce is not crypto-agile. It is simply cryptographically decorated.
This is where many post-quantum readiness programs quietly fail.
They begin with inventory. Inventory is necessary, but inventory alone is not enough. A cryptographic inventory that says “RSA-2048 is used here” is useful, but incomplete. Used for what? Transport authentication? Firmware signing? Database encryption? Token verification? Key wrapping? Long-term document integrity? Internal mTLS? Customer-facing TLS? Offline license validation? Root CA operations? Recovery keys? Legacy VPN tunnels? Inter-device trust?
The same primitive can represent completely different risk depending on context.
RSA used in an ephemeral transport context does not create the same migration pressure as RSA used to protect data that must remain confidential for twenty years. ECDSA used for short-lived service certificates does not carry the same implications as ECDSA embedded into firmware trust anchors deployed across industrial devices with fifteen-year lifespans. A signature algorithm used in CI/CD artifact provenance does not have the same operational constraints as a signature algorithm burned into a boot ROM trust model.
Post-quantum readiness requires semantic inventory, not just cryptographic inventory.
That means every cryptographic usage must be mapped to the property it protects, the asset it binds, the lifetime of that asset, the migration path available, the rollback behavior, the failure mode, and the authority responsible for changing it. Without that semantic layer, the organization does not have crypto-agility. It has a spreadsheet with anxiety attached.
The core engineering question is not “do we support ML-KEM?”
The question is: can the system transition from one key establishment assumption to another without violating the security model of the protocol?
That transition must be modeled explicitly.
In a classical ECDH-based system, the protocol may assume certain key sizes, message sizes, CPU profiles, latency boundaries, certificate structures, and handshake flows. ML-KEM changes some of those operational properties. The cryptographic operation is different. The protocol semantics are different. The failure modes are different. The performance profile is different. The implementation risks are different. Even when the formal security goal appears similar, the system context around that goal changes.
This is especially important in hybrid deployments.
Hybrid cryptography is often presented as a safe intermediate state: combine classical and post-quantum mechanisms, derive a shared secret from both, and preserve security as long as at least one component remains secure. Conceptually, that is attractive. Operationally, it is a minefield wearing a lab coat.
A hybrid key exchange must prevent downgrade. It must bind the negotiated algorithms into the transcript. It must ensure both parties agree on the same cryptographic context. It must handle partial support without silently falling back into classical-only mode. It must define what happens when the PQC component fails, when the classical component succeeds, when peers disagree on supported groups, when middleboxes interfere, when old clients appear, and when policy requires stronger guarantees than compatibility allows.
The failure mode is not merely “connection failed.”
The dangerous failure mode is “connection succeeded under weaker assumptions than the system operator believed.”
That is where compliance-driven crypto-agility collapses. A compliance view asks whether the system supports an approved algorithm. A systems view asks whether every reachable execution path preserves the intended security property. Those are not the same thing. One produces procurement comfort. The other produces survivable infrastructure.
Crypto-agility must be understood as a state machine.
At minimum, a crypto-agile system has states such as classical-only, hybrid-supported, hybrid-required, PQC-preferred, PQC-required, deprecated-classical, revoked-classical, and emergency-rollback. Each state has allowed transitions. Each transition has preconditions. Each precondition depends on observed deployment coverage, client capability, certificate issuance readiness, telemetry confidence, operational fallback policy, and risk acceptance.
This is not bureaucracy. This is how one prevents migration from becoming a distributed outage generator.
For example, moving from hybrid-supported to hybrid-required should not be a marketing decision. It should depend on measurable properties: which clients support the required KEM, which services have been upgraded, which certificates or credentials encode the correct capabilities, which dependencies still terminate TLS using old stacks, which devices cannot be patched, and which communication paths are exempted under documented compensating controls.
Even rollback must be treated as a cryptographic state transition.
A rollback that re-enables classical-only negotiation may restore availability while destroying the security property the migration was meant to enforce. A rollback that preserves hybrid negotiation but disables strict policy may create a downgrade window. A rollback that changes certificate validation behavior may invalidate audit assumptions. A rollback that reintroduces legacy trust anchors may resurrect old attack surfaces.
The system must know the difference between operational recovery and security regression.
This is why crypto-agility belongs in runtime policy, not just build-time configuration.
A runtime crypto-agility layer should be able to answer concrete questions while the system is operating. Which cryptographic policies are active? Which peers negotiated which algorithms? Which services are still accepting deprecated primitives? Which certificates were issued under legacy assumptions? Which devices cannot support the required transition? Which data flows remain exposed to store-now-decrypt-later risk? Which logs prove that a specific security boundary was enforced at a specific time?
Without runtime visibility, migration becomes theater.
The industry loves theater, naturally. Dashboards, maturity models, readiness percentages, executive summaries, and the sacred green checkmark. But post-quantum migration is hostile to vague confidence. Either the system can prove what happened at the cryptographic boundary or it cannot.
This is especially true for long-lived systems.
In cloud-native environments, migration can sometimes be accelerated through automated deployment, service mesh policy, certificate rotation, and centralized observability. That is still difficult, but at least the infrastructure pretends to be modern. In industrial systems, aerospace systems, medical devices, embedded platforms, and IIoT deployments, the situation is worse in a way that deserves its own small monument to human optimism.
Devices may remain in the field for ten, fifteen, or twenty years. Some cannot receive frequent firmware updates. Some rely on constrained processors, limited memory, fragile bootloaders, narrow communication channels, or vendor-specific protocols. Some use cryptographic assumptions embedded during manufacturing. Some are physically inaccessible. Some are connected to operational processes where downtime is expensive or dangerous.
For these systems, post-quantum migration cannot be improvised later.
The cryptographic lifecycle must be designed before deployment. Device identity must be renewable. Firmware verification must support algorithm transition. Secure boot chains must account for larger signatures and different verification costs. Provisioning must avoid permanent dependence on a single primitive. Remote attestation must bind not only device state, but also cryptographic capability. Recovery procedures must avoid reintroducing legacy trust as a convenience mechanism.
The brutal truth is simple: many systems deployed today are already accumulating post-quantum technical debt.
That debt will not appear as a bug report. It will appear as inability to migrate, inability to prove exposure, inability to rotate identity, inability to update trust anchors, inability to preserve confidentiality guarantees for retained data, and inability to distinguish secure fallback from silent downgrade.
The post-quantum threat is not only future quantum computation.
It is present architectural rigidity.
Store-now-decrypt-later makes this even more urgent. Any data encrypted today with quantum-vulnerable key establishment may be captured now and decrypted later when sufficient quantum capability exists. This changes the meaning of confidentiality over time. Data with long-term sensitivity must be protected before the cryptanalytic capability arrives, not after a press release announces that the threat has become convenient enough for budget approval.
This is why migration timelines must be tied to data lifetime.
If a dataset must remain confidential for twenty years, and migration takes five years, then waiting until quantum attacks are practical is already failure. The relevant question is not “when will a cryptographically relevant quantum computer exist?” The relevant question is “will the protected data still matter when one does?”
PQC readiness is therefore a time-horizon problem.
Every system should classify cryptographic dependencies by the lifetime of the protected asset, the expected lifetime of the deployed component, the difficulty of migration, and the exposure value to a future adversary. A database backup has a different risk profile from a live API session. A firmware signing root has a different risk profile from an internal short-lived access token. A satellite component, industrial sensor, or medical device has a different migration profile from a containerized backend service.
Crypto-agility must encode those differences.
A mature crypto-agile architecture should include at least five layers.
The first layer is discovery. The system must identify cryptographic primitives, libraries, protocols, certificates, keys, hardware dependencies, and trust anchors.
The second layer is semantics. The system must understand what each cryptographic use protects and what property would fail if the primitive became unsafe.
The third layer is policy. The system must define which algorithms, modes, key sizes, certificate profiles, negotiation paths, and fallback behaviors are allowed in each operational context.
The fourth layer is enforcement. The system must prevent invalid cryptographic states at runtime, not merely document them after the fact.
The fifth layer is evidence. The system must produce logs, attestations, traces, or audit artifacts showing that the intended policy was actually enforced.
Without all five, crypto-agility remains aspirational.
The most dangerous missing layer is usually enforcement.
Organizations often know what they would like their systems to do. They may even document approved cryptographic baselines. But the runtime system continues accepting deprecated algorithms, old certificates, weak negotiation paths, unmanaged keys, legacy clients, and unverified dependencies because enforcement would break something. And breaking something would require ownership. And ownership would require engineering. A tragic sequence, obviously.
This creates the classic security gap: policy says one thing, runtime permits another.
Post-quantum migration makes this gap worse because partial deployment is unavoidable. There will be old clients and new clients. Old devices and new devices. Classical services and hybrid services. PQC-capable libraries and wrappers that expose only classical abstractions. Certificates with different profiles. HSMs with different firmware support. Gateways translating between worlds. During this period, the system will contain mixed cryptographic states.
Mixed states are not automatically insecure.
Unmodeled mixed states are.
A partial migration is safe only if the system knows which combinations are allowed, which are temporary, which are forbidden, and which require compensating controls. Otherwise, the migration becomes an accidental protocol negotiation experiment, and the adversary gets to grade the paper.
This is where formal methods become practical rather than decorative.
Crypto-agility can be specified as a set of invariants. For example: no session may be established using a deprecated key exchange after a policy cutoff date; no firmware artifact may be accepted unless signed by an approved algorithm under an active trust root; no hybrid negotiation may complete unless both classical and post-quantum components are transcript-bound; no rollback may re-enable classical-only mode without explicit emergency policy; no device identity may remain valid beyond its cryptographic capability declaration; no retained high-sensitivity data may be encrypted under a key hierarchy rooted only in quantum-vulnerable establishment.
These are not philosophical statements. They are properties that can be modeled, tested, monitored, and in some cases formally verified.
A runtime crypto-agility architecture should expose cryptographic state as a first-class object. Not as scattered library calls. Not as buried configuration. Not as tribal knowledge locked inside the brain of the one senior engineer everyone is afraid to disturb. Cryptographic state should be inspectable, policy-bound, versioned, auditable, and tied to system identity.
That means a service should not merely “use TLS.” It should declare and expose the cryptographic policy under which it accepts connections. A device should not merely “support secure boot.” It should expose its trust root version, accepted signature schemes, firmware policy, rollback constraints, and update capability. A CI/CD pipeline should not merely “sign artifacts.” It should bind signatures to provenance, build context, algorithm policy, key lifecycle, and verification environment.
Cryptography must stop being hidden plumbing.
In post-quantum systems, cryptography is part of the control plane.
This shift has architectural consequences.
Key management systems must support algorithm transition, not only key storage. Certificate authorities must handle new profiles, hybrid certificates where applicable, policy signaling, and lifecycle constraints. Service meshes must expose negotiated cryptographic properties, not just connection success. Observability systems must treat cryptographic downgrade as a security event. Asset management must connect cryptographic dependencies to business-critical functions. Incident response must include cryptographic rollback analysis. Governance must define who is allowed to weaken policy and under what conditions.
This is why “PQC-ready” as a vendor claim is mostly useless without precise context.
Ready for what? ML-KEM in TLS? ML-DSA for signatures? Hybrid negotiation? Firmware verification? HSM-backed key operations? Certificate issuance? Code signing? Long-term archival protection? Embedded verification? Remote attestation? Regulated environments? Constrained devices? Multi-region service identity? Offline verification? Emergency rotation?
A meaningful claim must specify the boundary.
A system is not PQC-ready in the abstract. It is PQC-ready for a particular security property, within a particular protocol, under a particular operational model, with a particular migration path, and with evidence that the property holds under realistic failure modes.
Anything less is branding.
The deeper issue is that cryptographic migration is usually treated as a project, but it is actually a capability. Projects end. Capabilities remain. If an organization treats PQC migration as a temporary initiative, it may replace some algorithms and still remain fragile. The next cryptographic transition will produce the same pain. The next deprecation event will reveal the same unknown dependencies. The next protocol weakness will require the same emergency inventory. The next compliance demand will trigger the same spreadsheet ritual.
Crypto-agility should be built as permanent infrastructure.
That infrastructure should allow the system to absorb cryptographic change without requiring institutional panic. It should make cryptographic dependencies visible. It should make policy enforceable. It should make unsafe states observable. It should make migration measurable. It should make rollback accountable. It should make exceptions explicit. It should make trust transitions boring.
Boring is good.
In security-critical systems, boring means the failure modes were considered before they became incidents.
The post-quantum transition is often described as a race against future quantum computers. That is only partially true. It is also a race against system complexity, organizational denial, vendor opacity, undocumented dependencies, long-lived devices, and operational inertia. Quantum computing may be the external pressure, but architectural unreadiness is the internal weakness.
A cryptographically relevant quantum computer does not need to break every system at once.
It only needs organizations to have spent the previous decade pretending migration was a procurement option.
The right engineering response is not panic. Panic produces bad architecture with expensive logos. The right response is disciplined system design.
Treat cryptographic primitives as replaceable components, but do not pretend replacement is enough. Treat protocols as state machines. Treat identity as lifecycle-bound. Treat key establishment as a negotiated security boundary. Treat signatures as operational commitments. Treat certificates as policy-bearing artifacts. Treat firmware trust as long-term infrastructure. Treat logs and evidence as part of the security model. Treat rollback as a dangerous transition, not an innocent recovery button.
Above all, treat crypto-agility as something the system must do while running.
Because that is where the real test happens.
Not in the compliance document.
Not in the architecture slide.
Not in the vendor announcement.
Runtime is where peers negotiate. Runtime is where old clients appear. Runtime is where policy exceptions accumulate. Runtime is where fallback paths execute. Runtime is where attackers probe the difference between what the system claims and what the system accepts.
A post-quantum-ready system is not the one that has merely imported ML-KEM, ML-DSA, or SLH-DSA. Those standards are necessary foundations, and the industry is better for having them. But the presence of standardized algorithms does not automatically create migration safety, downgrade resistance, lifecycle discipline, or operational resilience. (NIST)
A post-quantum-ready system is the one that can change its cryptographic assumptions without losing control of its security properties.
That is crypto-agility.
Not a checkbox.
A runtime property.
And for long-lived, distributed, security-critical systems, it may become one of the most important engineering properties of the next decade.
