Engineering Full Application Platform Resilience with Quarkus, Java, GitOps, Ansible, and OpenShift

Engineering Full Application Platform Resilience with Quarkus, Java, GitOps, Ansible, and OpenShift

Introduction

The original Application Platform Resilience Framework (APRF) article explains the framework: resilience has to be designed across capacity, workload design, placement, lifecycle, disruption handling, application hardening, guarded chaos, and governance with observability.

This guide shows what that looks like in code. It uses a small but real three-worker, three-zone Azure Red Hat OpenShift (ARO) implementation, repeatable proof tests, dated evidence, and a scorecard across the same eight disciplines.

The cluster is large enough to make worker loss, zone placement, maintenance, alerting, and policy enforcement real. The point is to show how a platform and application team can keep a business service working when nodes fail, dependencies slow down, maintenance starts, or policy has to intervene.

Platform Baseline

The baseline assumptions are as follows:

  • Three-worker, three-zone ARO cluster — makes worker and zone failure domains real, not notional.
  • Private-cluster access through a bastion path — checks run the same way an operator or auditor would.
  • User-workload monitoring — alerts and governance act on live signals.
  • GitOps publication and reconciliation — every change, normal or corrective, flows through one auditable delivery path.

On the captured baseline, that environment stood at 3/3 workers Ready, 33 cluster operators healthy, 7 Argo applications Synced and Healthy, and 5/5 routed checks returning 200.

The Workloads

The business model is intentionally simple: a catalog service owns a product record, and an experience service turns that product into the user-facing product view. The public catalog route is GET /products/{productId}. The public experience route is GET /product-views/{productId}, which calls the matching catalog service behind the scenes. Both Java pairs also expose bounded fault-test routes such as GET /product-views/{productId}/fault-test/slow and GET /products/{productId}/fault-test/error, so the test suite can make the dependency slow, failing, timing out, or flaky without inventing a separate demo system. The Python basic-app is simpler: it gives the platform an easy health, metrics, quota, policy, and GitOps self-heal target.

At the functionality level, the workloads interact like this:

Article content

This diagram is only about application functionality. It intentionally leaves out pods, Services, Argo CD, monitoring, scheduling, and failure-domain placement; those appear later when the article moves from business behaviour to platform evidence.

The implementation uses three workload groups. Each one has a reader-visible job in the story:

  • Python basic-app — namespace basic-app, 2 replicas, no PDB (PodDisruptionBudget). This is the operational spine: quota and default limits, network-policy boundaries, route and monitoring wiring, GitOps ownership markers, and a simple "alive" route check. It anchors the governance and self-heal story.
  • Quarkus JVM pair (catalog-jvm + experience-jvm) — 3 replicas per service, PDB minAvailable: 2 on both. This is the traditional bytecode-based Java reference: dependency path, planned maintenance, lifecycle proof, and the direct side of D6 Hardened application patterns.
  • Quarkus native pair (catalog-native + experience-native) — 3 replicas per service, PDB minAvailable: 2 on both. This is the ahead-of-time compiled Java path: the same broad business shape, much lower memory use, a hardened downstream client, and the full alert-to-GitOps corrective-action chain.

The live tests show where Quarkus native matters, and what tradeoff comes with it:

  • Lower steady memory. The Quarkus native pair used 264.0 Mi of live memory versus 618.0 Mi for the Quarkus JVM pair — a live-memory ratio of about 0.43.
  • Cleaner hard-worker-loss result. In D7 Guarded chaos, the hard worker loss test removed a worker without a polite drain. The JVM pair completed 468/480 requests and recorded 12 transport errors, with routed p95 bounded at 237.0 ms for successful requests. The native pair completed 480/480 requests, recorded 0 transport errors, and held routed p95 at 234.2 ms. The cleaner result may also reflect the native pair's hardened dependency path: timeout, retry, circuit breaker, bulkhead, and fallback give the route more controlled behaviour while the platform is replacing capacity.
  • Full D8 Governance and observability corrective action. The native pair does not only prove that an alert can fire. The proof keeps traffic on the public experience-native route, lets the latency alert enter pending and firing, pushes a small GitOps change that enables cache on catalog-native, measures the improvement, and then restores the authored baseline. In that corrective window, routed p95 fell from 325.7 ms to 16.3 ms with zero errors.
  • Native build tradeoff. The native pair pays for those runtime benefits with a heavier build path. Ahead-of-time compilation is more resource- and time-consuming than a normal JVM build, and it can require more attention to reflection, native-image compatibility, and build-container capacity.

This Demo APRF Implementation At a Glance

Before moving discipline by discipline, here is the operating shape of the implementation: workloads, the basic test cycle, and the alert-to-action path used in D8 Governance and observability.

Workloads and traffic

Requests enter through the public routes, land on Services, and reach the pods behind those Services. The two Argo CD instances sit beside that path because they do not carry user traffic; they deliver and reconcile the platform and workload state around it.

Article content

Three things on purpose: request flow on the left and centre, delivery control through the two Argo CD instances, and monitoring around the outside. Request path serves users; the other paths change or observe the system.

APRF Test Suite Flow

Each proof test starts with one question: "What would prove that this resilience claim is true?" It then introduces one controlled pressure or fault, measures the routed and operational outcome, and restores the authored baseline.

Technically, the loop is implemented by a containerised APRF runner and Ansible roles. The tests use Kubernetes and OpenShift APIs through the documented bastion path, generate bounded HTTP traffic with hey, apply or restore GitOps changes through the runtime repository and Argo CD, and write dated Markdown reports plus centralised converge logs.

Article content

Every routed check talks to a public OpenShift Route rather than to individual pods, so the result reflects what a real caller would observe while the fault is in flight.

The scorecards are a compact way to read those results. Each discipline has a weight, each workload earns a score from its evidence, and the summary shows where the platform, JVM pair, and native pair are strongest. The scores do not replace the evidence; they help the reader navigate it.

The useful reading questions stay simple:

  1. What is the resilience claim?
  2. What controlled fault or pressure condition is introduced?
  3. What happens on the routed path that users would feel?
  4. What happens operationally in the platform around it?
  5. Does the system return cleanly to the authored baseline?

Key choices for this APRF reference implementation

  1. Rolling updates preserve serving posture. maxSurge: 0 means the rollout does not create extra temporary pods. maxUnavailable: 1 means only one existing replica may be unavailable during the rollout. On each three-replica Java service, that keeps two serving replicas in place during updates without depending on spare capacity.
  2. Spread protects against worker loss first. DoNotSchedule is a hard rule: if two replicas of the same service would land on the same worker, Kubernetes must reject that placement. ScheduleAnyway is softer: the scheduler should prefer zone spread, but it can still place the pod if a strict zone rule would block a small three-worker rollout.
  3. Corrective action follows the normal delivery path. In this article, the normal delivery path is GitOps: write the intended change to Git, let Argo CD apply it, measure the route, and use the same path to restore the baseline. D8 Governance and observability uses that path for the corrective cache change instead of making an untracked manual cluster patch.
  4. The preferred correction is predictable. The public D8 Governance and observability path starts with bounded reconfiguration (cache enable on catalog-native) rather than sudden scale-out. A smaller safe fix is better than a larger fix that may not land cleanly while the cluster is already under stress.

Alert to action through governance and observability

This is the most operational part of the implementation. The reference implementation does not stop at detecting a problem. It also acts through the normal GitOps path, measures the effect of that action, and returns safely to the authored baseline. The detailed proof appears later in the section "Discipline 8: Governance and Observability Are Enforced"; this diagram is the short version.

Article content

With that shape in mind, the rest of the article walks through the eight APRF disciplines one by one: what each discipline is trying to prove, what the test does, and what the evidence showed.

Discipline 1: Capacity Is Mathematically Defensible

Capacity comes first. If requests, limits, and allocatable headroom are wrong, every later resilience claim is weak. Capacity must be measured, not assumed, and it must hold up under N-1 conditions rather than only at full strength.

The more useful question is not "what fits when everything is healthy?" but "what still fits when a worker is lost, maintenance starts, or another workload starts consuming the headroom you were counting on?"

During the live tests

Article content

The key D1 Capacity result in one picture: the important workload is admitted, and the expendable one is refused.

In simple terms: If a business-critical service needs room during a busy period, the cluster still protects that service instead of letting background filler work crowd it out. Capacity stays an engineered operating margin.

Steady-state resource envelope

The worker pool used three standard Azure Standard_D8s_v5 VMs under the hood, each with 8 vCPU and 32 GiB RAM before OpenShift/Kubernetes reservations. That gave this APRF test-suite run plenty of CPU and memory headroom: the full reference implementation requested less than one CPU core and well under 2 GiB of memory:

  • basic-app stayed intentionally tiny: 2 pods, each requesting 50m CPU and 128 Mi memory.
  • catalog-jvm used 3 pods, each requesting 100m CPU and 256 Mi memory.
  • experience-jvm used 3 pods, each requesting 100m CPU and 256 Mi memory.

OpenShift scheduler decision under bounded pressure

The scheduler-pressure lab turns capacity from a spreadsheet claim into an admission decision made by the Kubernetes scheduler. The test first fills part of the cluster with low-priority placeholder pods, then asks whether one more important pod can still land, and finally checks that extra low-priority work is refused.

To create that pressure, the test introduced a low-priority filler load of about 3 CPU cores and 12 GiB memory, asked the cluster to place a protected workload that still needed another 0.5 core and 0.75 GiB (admitted cleanly), and then asked for one more low-priority pod requesting about 4625m CPU beyond the bounded plan. That extra pod was refused with Insufficient cpu, no worker was selected, and Kubernetes also reported that preemption would not make room for it.

The pressure lab uses two priority classes so Kubernetes can tell the two test pod types apart. aprf-capacity-placeholder-low is assigned to the filler and overflow pods. aprf-capacity-protected-high is assigned to the protected pod. The protected pod is expected to be scheduled; the later overflow pod is expected not to be scheduled.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: aprf-capacity-placeholder-low
value: 100
preemptionPolicy: PreemptLowerPriority

---

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: aprf-capacity-protected-high
value: 1000000
preemptionPolicy: PreemptLowerPriority        

The result is exactly what a platform team wants to see. Under real headroom pressure, the Kubernetes scheduler still places the protected high-priority pod and refuses the expendable low-priority overflow pod. D1 Capacity carries 14% in the scorecards because weak capacity assumptions undermine every later claim.

Discipline 2: Only Disciplined, Never Accidental Overcommitment

D2 Overcommitment is the governance twin of D1 Capacity. Capacity says what should fit. Disciplined overcommitment says how shared headroom is controlled: requests and limits define pod size, ResourceQuota caps the namespace, LimitRange supplies safe defaults, and priority classes decide which work is protected first. The proof then checks that the user-facing route is still healthy while lower-priority overflow work is refused.

The cluster is allowed to get busy, but it is not allowed to get busy in a way that silently steals room from the main service path.

The manifest pieces that make the boundary real

The Python basic-app namespace carries the simplest visible governance boundary. It is not the main Java business path; it is the easiest place to show the platform rule in a small, readable form: this namespace has a hard envelope, and pods inside it get safe defaults when they do not declare their own size.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: basic-app
spec:
  hard:
    # Maximum total limits allowed across this namespace.
    limits.cpu: "1"
    limits.memory: 1Gi
    # Maximum number of pods allowed in this namespace.
    pods: "10"
    # Maximum total requested capacity allowed in this namespace.
    requests.cpu: 500m
    requests.memory: 512Mi        
apiVersion: v1
kind: LimitRange
metadata:
  name: basic-app
spec:
  limits:
    - default:
        # Default limit applied when container omits its own limit.
        cpu: 250m
        memory: 256Mi
      defaultRequest:
        # Default request used by scheduler for placement decisions.
        cpu: 50m
        memory: 128Mi        

The quota sets the hard namespace envelope. The LimitRange stops "tiny request, huge actual usage" behaviour from slipping in as an unspoken default. Together they make overcommitment explicit: the cluster may share spare capacity, but each namespace still has a known boundary.

During the live tests

Article content

A bounded routed load is kept on the main Java path while the test adds governed placeholder work through the Kubernetes API. The placeholder work is low-priority and expendable: it represents background work that may use spare room, but must not steal capacity from the user-facing service.

After that pressure is in place, the proof tries to create one more low-priority pod. That is the admission test. If the cluster still has safe room for expendable work, Kubernetes may place it. If not, the pod should stay pending rather than pushing the main route into trouble.

The most important D2 Overcommitment outcome is shared across both Java pairs: under bounded extra pressure, the main route stayed healthy and the excess work was the part that was constrained first.

  • Routed requests served — both pairs 96/96; hey sent 96 HTTP requests to the public route in each run, and every request completed successfully.
  • Fallback count — both pairs 0; in this context, fallback would mean the application had to return a controlled degraded response because a dependency was unhealthy. D6 Hardened application patterns explains that behaviour in detail.
  • Circuit-open count — both pairs 0; a circuit opens when the application temporarily stops calling a dependency that is repeatedly failing or too slow. D6 Hardened application patterns shows where the native path uses that protection deliberately.
  • Overflow request — refused at the quota boundary in both runs while the main user-facing route kept serving normally

The latency check used the same business route shape on both Java pairs: experience-* /product-views/{productId}. It was measured in the uncached mode so the result still included the catalog call behind the product view, instead of being hidden by a fast cache hit. On that aligned path, p95 latency was 254.9 ms on the JVM pair and 249.6 ms on the native pair. Both pairs kept the routed path healthy while the quota boundary rejected expendable overflow work. Finally, the proof asks the API server directly whether the next low-priority pod would be admitted, and the correct answer is "no". D2 Overcommitment carries 10% in the scorecards.

Discipline 3: Failure Domains Are Respected

Replica count alone is not a resilience story. The replicas also need to be distributed sensibly, and the allowed traffic paths between them need to respect the same architecture.

On this baseline that becomes a practical shape: three replicas where the discipline matters, hard spread across workers, soft spread across zones, and explicit allow and deny paths through NetworkPolicy.

During the live tests

Article content

In the native run, both catalog-native and experience-native landed one pod on each worker and one pod in each zone, with hostname_skew: 0 and zone_skew: 0. The direct JVM zone proof showed the same authored spread contract and the same final one-pod-per-worker, one-pod-per-zone posture.

Placement and traffic results

The placement result answers the physical failure-domain question: if one worker or zone has trouble, are all replicas of the same service exposed to that same fault? The traffic result answers the logical failure-domain question: can callers use only the intended service path, or can they bypass it with a direct shortcut?

Placement, native pair:

  • catalog-native — 3 replicas across 3 workers and 3 zones, hostname skew 0, zone skew 0
  • experience-native — 3 replicas across 3 workers and 3 zones, hostname skew 0, zone skew 0

Traffic-path checks under NetworkPolicy:

  • basic-app → experience-native — allowed and succeeded
  • experience-native → catalog-native — allowed and succeeded
  • basic-app → catalog-native (the direct shortcut) — denied and timed out

That last result matters. D3 Failure domains is not only about spreading pods out. It is also about forcing traffic through the correct upstream service rather than letting any workload talk directly to any other.

How the manifests express the placement and traffic rules

The spread rules live on each Java pair Deployment template. This excerpt is the part that tells OpenShift how to place replicas for one of the services:

spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          # Spread replicas across workers first.
          topologyKey: kubernetes.io/hostname
          # Do not place the pod if worker spread would be broken.
          whenUnsatisfiable: DoNotSchedule
        - maxSkew: 1
          # Prefer spreading replicas across availability zones.
          topologyKey: topology.kubernetes.io/zone
          # Prefer (soft) zone spread, but do not block.
          whenUnsatisfiable: ScheduleAnyway        

Worker spread is hard, so two replicas should not land on the same worker. Zone spread is preferred, so the scheduler tries to spread across zones without turning every rollout into a scheduling dead end.

The traffic rule lives in NetworkPolicy. Each namespace starts from default-deny ingress, then adds an explicit allow-list. Two policies make the diagram above true: one lets basic-app reach experience-native, and the other lets experience-native reach catalog-native. The policy name is the same in both namespaces because Kubernetes names NetworkPolicy objects per namespace; the namespace is what makes each rule distinct.

The experience-native policy allows user-route traffic and the intended basic-app -> experience-native path:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-and-metrics-ingress
  namespace: experience-native
spec:
  podSelector:
    matchLabels:
      # This policy protects experience-native pods.
      app.kubernetes.io/name: experience-native
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              # experience-native pods can call each other.
              kubernetes.io/metadata.name: experience-native
        - namespaceSelector:
            matchLabels:
              # basic-app may call experience-native.
              kubernetes.io/metadata.name: basic-app
        - namespaceSelector:
            matchLabels:
              # Public route traffic through the OpenShift router.
              kubernetes.io/metadata.name: openshift-ingress
      ports:
        - port: 8080
          protocol: TCP
    - from:
        - namespaceSelector:
            matchLabels:
              # Same namespace metrics or local callers.
              kubernetes.io/metadata.name: experience-native
        - namespaceSelector:
            matchLabels:
              # Cluster monitoring may scrape metrics.
              kubernetes.io/metadata.name: openshift-monitoring
        - namespaceSelector:
            matchLabels:
              # User-workload monitoring scrape application metrics.
              kubernetes.io/metadata.name: openshift-user-workload-monitoring
      ports:
        - port: 8080
          protocol: TCP        

The catalog-native policy follows the same structure, so the excerpt below shows only the application-path part. It allows user-route traffic and the intended experience-native -> catalog-native path. Notice that basic-app is not on this allow-list:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-and-metrics-ingress
  namespace: catalog-native
spec:
  podSelector:
    matchLabels:
      # This policy protects catalog-native pods.
      app.kubernetes.io/name: catalog-native
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              # Catalog-native pods can call each other.
              kubernetes.io/metadata.name: catalog-native
        - namespaceSelector:
            matchLabels:
              # experience-native may call catalog-native.
              kubernetes.io/metadata.name: experience-native
        - namespaceSelector:
            matchLabels:
              # Public route traffic enters through the router.
              kubernetes.io/metadata.name: openshift-ingress
      ports:
        - port: 8080
          protocol: TCP        

That is why the direct basic-app -> catalog-native check should time out: basic-app is not on the catalog allow-list. The intended application path is visible in policy as well as in code. D3 Failure domains carries 12% in the scorecards.

Discipline 4: Lifecycle Is Graceful

Graceful lifecycle behaviour is about startup, readiness, rollout semantics, in-flight behaviour, shutdown, and recovery.

A normal deployment, restart, or node drain should not turn into visible instability simply because a service takes too long to become ready, drops traffic too early, or comes back unevenly.

During the live tests

Article content

The important part of D4 Lifecycle is the middle window. The service is still up, but it is temporarily living on 2-of-3 healthy replicas while the replacement pod starts and proves it is ready to serve routed traffic.

With maxSurge: 0, the rollout does not add a fourth normal serving pod. One old replica is selected for replacement first. It may still finish in-flight work during the grace period while the replacement starts, but it is no longer the replica the route should depend on for new traffic. The route stays healthy because the other old replicas keep serving until the new replica passes readiness and joins the service.

The lifecycle contract in YAML

Both Java implementations share the same rollout shape:

apiVersion: apps/v1
kind: Deployment
spec:
  # 3 replicas let the service keep a 2-of-3 serving posture while 1
  # replica is being replaced.
  replicas: 3
  strategy:
    rollingUpdate:
      # Do not create extra temporary pods during rollout.
      maxSurge: 0
      # Allow only one existing replica to be unavailable at a time.
      maxUnavailable: 1
  template:
    spec:
      # Give requests time to finish before the old pod exits.
      terminationGracePeriodSeconds: 30        

The probes show where readiness has to become true before the route gets the pod back:

startupProbe:
  # Give new pod 12 checks, 5 seconds apart, to finish startup.
  failureThreshold: 12
  periodSeconds: 5
  httpGet:
    path: /q/health/ready

readinessProbe:
  # Keep the pod out of route traffic until it is ready to serve.
  failureThreshold: 6
  periodSeconds: 10
  httpGet:
    path: /q/health/ready        

Only one replica changes at a time, and the old pod gets a grace period to drain before it disappears. On a three-worker baseline, that keeps the service in a predictable 2-of-3 posture instead of gambling on extra temporary capacity.

Lifecycle timing insights

These timings measure operator-visible stability: pods ready, public route healthy, and repeated route checks passing.

For the user-facing experience route, the native service reached steady routed readiness faster: 22.8 s versus 35.0 s for the JVM service. That is the lifecycle timing that matters most to a user-facing path: the native experience service returned to a stable public route sooner while still using the same one-replica-at-a-time rollout contract.

Those numbers are operational rollout timings. The D4 Lifecycle point is that both Java pairs keep the one-replica-at-a-time rollout contract and return to stable routed service.

The direct JVM lifecycle proof also probed graceful termination directly: the pod was deleted during a real in-flight request, and the request still completed with HTTP 200 in about 1.33 s. That matters because lifecycle is not only about starting cleanly; it is also about leaving cleanly. A pod can be removed for rollout or maintenance without cutting off a request that is already being served. D4 Lifecycle carries 12% in the scorecards.

Discipline 5: Disruption Is Controlled

Planned disruption counts. A platform that survives random faults but behaves badly during ordinary maintenance is not yet mature.

D5 Controlled disruption is about voluntary disruption: drains, upgrades, controlled rollouts. It is different from D7 Guarded chaos. The question here is not "can the system survive a sudden failure?" but "can the platform perform routine maintenance without violating the availability promise it claims to operate under?"

The contract is simple and readable: three replicas; PodDisruptionBudget minAvailable: 2 preserves the same 2-of-3 story; worker-level spread stays hard. Planned maintenance is allowed to make the service smaller for a while, but not small enough to stop being trustworthy.

During the live tests

Article content

The proof drains one worker with a selector limited to the application pair being checked. The service must keep enough replicas serving during the drain and return to full strength after the worker comes back.

Planned drain, protected capacity, and return to full strength

Each Java service has three replicas and a PodDisruptionBudget requiring two available replicas during voluntary disruption. This example shows the native experience service:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  annotations:
    # Argo CD applies this after the namespace and workload exist.
    argocd.argoproj.io/sync-wave: "64350"
  # Each service has its own PDB. 
  name: experience-native
spec:
  # During a voluntary e.g. `oc adm drain`, keep two replicas
  # available. On a 3-replica service, maintenance may remove
  # one pod.
  minAvailable: 2
  selector:
    matchLabels:
      # This label ties the budget to the experience-native pods.
      app.kubernetes.io/name: experience-native        

Planned disruption may remove one replica, not two. The bounded drain is scoped to the checked catalog and experience services:

oc adm drain <worker-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=15m \
  --pod-selector='app.kubernetes.io/name in (<catalog-service>,<experience-service>)'        

The selector keeps the test focused: drain one worker, watch the two services, check the public route, then confirm full recovery.

The observed state matched the contract:

Article content

Both pairs stayed at 2-of-3 during the drain and returned to 3-of-3 after the worker came back. basic-app also stayed in the route set, showing that the wider public baseline stayed healthy while the targeted pair was being maintained. D5 Controlled disruption carries 12% in the scorecards.

Discipline 6: Application Patterns Are Hardened

When one service depends on another, what happens to the user when that dependency slows down or fails?

D6 Hardened application patterns is about resilience inside the application, not only in the platform. The experience service calls catalog; the proof checks whether that caller handles a slow or failing dependency safely.

The two Java pairs make two application contracts visible:

  • the Quarkus JVM pair is the minimal direct baseline
  • the Quarkus native pair is the degraded-successful hardened path

Here, degraded-successful means the route still returns a successful response, but the payload clearly says it came from fallback instead of live catalog data. The user receives a controlled lower-fidelity answer, not a raw dependency failure.

The native path uses MicroProfile Fault Tolerance annotations, implemented by SmallRye in Quarkus, around the downstream catalog call:

  • Retry: tries a short-lived dependency failure again before giving up. This helps with brief network blips, pod handoffs, or one unlucky slow call.
  • Timeout: sets a hard wait budget. If catalog does not answer quickly enough, the experience service stops waiting instead of holding the user request open.
  • Circuit breaker: watches repeated failures. When the dependency is clearly unhealthy, it stops sending every request into the same bad path for a short time and lets the fallback handle the response.
  • Bulkhead: limits how many requests can be inside this dependency path at the same time, so a slow catalog call cannot consume all caller capacity.
  • Fallback: returns a controlled degraded response when the protected catalog call cannot complete.

Together, they form a layered defense: limit concurrency, bound waiting time, retry short failures, fail fast during sustained trouble, and return a safe fallback instead of exposing a raw dependency failure.

Article content

Where this behaviour lives in code

The D6 code sits in the two experience workloads because they are the callers exposed through the public route. The downstream catalog workloads provide the normal and fault-test product data paths. In Java terms: ExperienceJvmService shows the direct JVM caller, ExperienceNativeService routes native requests, ResilientCatalogService contains the native hardening annotations, and each pair uses its own CatalogGateway client to call catalog.

The public fault-test route /product-views/{productId}/fault-test/{faultType} uses faultType to choose the catalog behaviour to exercise: slow, error, timeout, flaky, or normal. In the JVM baseline, that choice is passed directly to the catalog client:

public ExperienceStatus faultTest(String faultType) {
  CatalogStatus status = switch (faultType.toLowerCase()) {
      case "slow" -> catalogGateway.getSlowStatus();
      case "error" -> catalogGateway.getErrorStatus();
      case "timeout" -> catalogGateway.getTimeoutStatus();
      case "flaky" -> catalogGateway.getFlakyStatus();
      default -> catalogGateway.getStatus();
  };
  return envelopeFor(status, "fault-test-" + faultType.toLowerCase());
}        

In the native path, the same slow dependency shape is wrapped by ResilientCatalogService.callSlow():

// Retry short transient failures before returning fallback.
@Retry(maxRetries = 3, delay = 50)

// Stop waiting after 200 ms so one slow catalog call does not hang the request.
@Timeout(200)

// If repeated calls fail, stop calling catalog and fail fast to fallback.
@CircuitBreaker(requestVolumeThreshold = 6, failureRatio = 0.80, delay = 500)

// Limit concurrent calls inside this dependency path.
@Bulkhead(15)

// Controlled degraded resp. of protected call cannot complete.
@Fallback(fallbackMethod = "fallbackCatalog")
public CatalogStatus callSlow() {
    return catalogGateway.getSlowStatus();
}        

The JVM path deliberately shows the dependency behaviour directly. The native path wraps the same dependency shape with bounded waiting, retry, circuit breaking, concurrency isolation, and fallback.

JVM pair proof: direct dependency behaviour

For the JVM pair, steady hey traffic hits the public experience-jvm route. The experience service calls catalog-jvm through the selected fault-test path. There is no retry, timeout, circuit breaker, bulkhead, or fallback around that call, so downstream trouble appears directly in the routed behaviour.

Article content

JVM results:

With injected delay, the JVM route showed the delay directly. Routed p95 rose to 708.0 ms, compared with 254.9 ms on the aligned normal route without injected trouble. That is expected for the direct baseline: the caller waits on the slow catalog path.

There is no fallback or circuit breaker in this JVM path, so fallback rate and circuit-open count stay at 0. Under injected downstream errors, that means the route exposes direct failures rather than switching to a protected response. The error-phase p95 was 113.6 ms; that lower latency only means some failing calls ended quickly, not that the user experience was better.

The JVM response body confirms the simple contract: no fallback marker, a catalog timeout mode, and about 250 ms of catalog-side work. The route remained tied to the live dependency behaviour.

Native pair proof: hardened caller behaviour

For the native pair, the same style of hey traffic hits the public experience-native route. The experience service calls catalog-native through ResilientCatalogService, so the dependency call is protected before the route response is returned.

Article content

Native results:

This is the useful native behaviour: users still receive a controlled response from the route, while the experience service avoids piling more work onto an unhealthy catalog service. After the circuit delay, the caller can try catalog again automatically.

With injected delay, the native route mostly moved into that protected path. Fallback rate was 97.5%, circuit-open count was 117, and routed p95 was 972.8 ms, compared with 249.6 ms on the aligned normal route without injected trouble. The higher p95 includes resilience work such as retries, timeout handling, and fallback. The important signal is that the caller identified sustained trouble and returned controlled fallback responses.

With injected downstream errors, the same protection showed up more sharply: fallback rate was 97.5%, circuit-open count was 117, and routed p95 was 344.6 ms. The route kept returning a controlled response while the circuit breaker reduced repeated calls into the unhealthy catalog path.

The native response body confirms the protected contract: it is marked as a fallback response, and the circuit-open state explains why the caller stopped sending every request into the same failing catalog path.

This is why D6 Hardened application patterns is application resilience rather than platform resilience alone. The native path does more work to keep the caller safe when the dependency is unhealthy: bound the call, isolate it, stop repeating known-bad calls, and return a degraded-successful response. D6 Hardened application patterns carries 15% in the scorecards.

Discipline 7: Guarded Chaos Through Pod Loss And Hard Node Loss

Chaos is useful when it stays close to the resilience claims the team cares about. The active reference baseline keeps D7 Guarded chaos narrow on purpose: loss of one pod, and hard loss of one worker through Machine deletion with machine.openshift.io/exclude-node-draining=true.

Guarded is important here. These are not big random failures added to look dramatic. Each fault is bounded, repeatable, and tied to a specific engineering question. Chaos with governance builds confidence; chaos without control is recklessness.

Article content

Single-pod loss

Both Java pairs served 96/96 requests successfully:

  • Quarkus JVM pair — p95 343.3 ms, no transport errors, no HTTP 5xx
  • Quarkus native pair — p95 254.8 ms, no transport errors, no HTTP 5xx

This is the "ordinary bad day" fault: one pod disappears, but the deployment still has enough healthy replicas and routing remains stable.

Hard worker loss

This is the sharper test. The Machine is annotated to skip polite node draining and then deleted:

oc annotate machine <machine-name> machine.openshift.io/exclude-node-draining=true --overwrite
oc delete machine <machine-name>        

Quarkus JVM pair:

The JVM pair kept serving most traffic while the worker disappeared, but the route still saw a small interruption window: 468/480 requests completed and 12 transport errors were recorded. Routed p95 stayed bounded at 237.0 ms for the successful requests. The platform observed the failed node gone after 62.4 s, a replacement worker became Ready 58.6 s later, and catalog-jvm returned to healthy service in 26.8 s.

Quarkus native pair:

The native pair kept the route cleaner through the same failure shape: 480/480 requests completed, with 0 transport errors and routed p95 234.2 ms. The failed node was observed gone after 57.2 s, a replacement worker became Ready 46.8 s later, and catalog-native recovered in 15.8 s.

Read the fields this way:

  • Requests served: whether user-like traffic kept getting responses while one worker disappeared.
  • Transport errors: whether the route had connection-level misses during the failure window.
  • p95: latency for successful routed requests during the disruption.
  • Node gone observed: how long it took the platform to recognise the deleted worker as gone.
  • Replacement worker Ready: how long the infrastructure repair took after the old worker was gone.
  • Catalog recovery: how quickly the affected downstream service returned to a healthy pod set.

The native pair is stronger in this proof because it completed every routed request, avoided transport errors, and recovered its catalog service faster.

Without exclude-node-draining=true, the platform would first try to drain the node cleanly and the test would start to look more like D5 Controlled disruption. D7 Guarded chaos wants the sharper failure: the worker disappears, the application has to keep going, and the infrastructure has to repair itself.

The fault is real, but the user-facing path stays within a bounded and measured operating envelope while the platform repairs itself. D7 Guarded chaos carries 15% in the scorecards because it shows most directly whether the system keeps serving under real loss.

Discipline 8: Governance and Observability Are Enforced

D8 Governance and observability is where the operational value becomes most visible. In the beginning of the article, we introduced the short pattern: detect a problem, act through GitOps, measure the effect, and restore the authored baseline. This section shows that loop in detail.

The question is no longer "can we observe a problem?" It is "can we observe it, enforce the right constraints, restore drift, act through the normal delivery path, and return to baseline without relying on manual intervention?"

Article content

How Route-Latency Alerts Were Defined And Proved

Both Java implementations had direct alert proofs. The alerts live in PrometheusRule manifests beside the workloads. They watch p95 latency from http_server_requests_seconds_bucket on the public experience routes:

The monitoring path has three parts:

  • Argo CD delivers the workload namespaces, Deployments, Services, Routes, ServiceMonitors, and PrometheusRules from Git.
  • User-workload monitoring is enabled on the application namespaces with openshift.io/user-monitoring: "true", so OpenShift's user-workload Prometheus is allowed to scrape those namespaces.
  • A ServiceMonitor tells Prometheus where to scrape metrics, for example /q/metrics on the experience-native Service every 30 s. The PrometheusRule then turns those scraped metrics into alert states.

kind: ServiceMonitor
metadata:
  name: experience-native
  labels:
    openshift.io/user-monitoring: "true"
spec:
  endpoints:
    - path: /q/metrics
      interval: 30s
  namespaceSelector:
    matchNames:
      - experience-native
---
kind: PrometheusRule
metadata:
  name: experience-native
  labels:
    openshift.io/user-monitoring: "true"
spec:
  groups:
    - name: experience-native.rules
      rules:
        - alert: ExperienceNativeRouteLatencyHigh
          # JVM uses same shape, but watches its fault-test route.
          expr: >-
            histogram_quantile(
              0.95,
              sum by (le) (
                rate(http_server_requests_seconds_bucket{
                  namespace="experience-native",
                  uri="/product-views/{productId}",
                  method="GET"
                }[2m])
              )
            ) > 0.20
          for: 2m        

ExperienceJvmFaultRouteLatencyHigh and ExperienceNativeRouteLatencyHigh are not separate OpenShift object kinds. They are Prometheus alert rule names inside PrometheusRule resources. The OpenShift object kind is PrometheusRule; the alert: field names the alert Prometheus evaluates.

The two alert names describe which route is being watched:

  • ExperienceJvmFaultRouteLatencyHigh means: the JVM experience-jvm fault-test route stayed above the p95 latency threshold. This route is used for the direct JVM alert exercise with injected latency.
  • ExperienceNativeRouteLatencyHigh means: the native experience-native product-view route stayed above the p95 latency threshold. This is the route improved by the D8 corrective action.

Using the wall-clock observations from the proof reports here:

  • ExperienceJvmFaultRouteLatencyHigh — first observed pending at about 49 s, first observed firing at about 164 s
  • ExperienceNativeRouteLatencyHigh — first observed pending at about 27 s, first observed firing at about 142 s

Read those states this way:

  • Pending: the route p95 crossed the threshold; OpenShift Console may group this as an active alert.
  • Firing: the high-latency condition lasted long enough to count as an operational alert, not a one-sample spike. With for: 2m, this cannot happen until the condition stays true for roughly two minutes.
  • Cleared: the route returned below the threshold after the load ended. There is no extra for: delay on clear, but the two-minute PromQL rate window and polling cadence affect when clear is observed.

The alert rules detect real route latency, wait long enough to avoid instant noise, and clear when the service recovers.

Article content

The OpenShift Console view makes the rule readable without asking the reader to inspect YAML first: this is a user-workload Prometheus alert, it watches the experience-native product-view route, it uses a 200 ms p95 threshold, and it requires the condition to hold for 2m.

Article content

After the condition stayed above the threshold long enough, the same rule moved to Firing. This is the operational signal used by the D8 Governance and observability proof before the corrective GitOps change is applied.

The native corrective timeline

The alert-only proof shows detection. The native corrective proof shows the full loop: detect, change, measure, and restore. The proof kept a small steady flow on the public experience-native /product-views/{productId} route while the alert watched the latency condition behind that route:

Article content

Before-and-after effect on the routed path

The corrective change itself was intentionally small: it set APRF_CATALOG_CACHE_ENABLED in catalog-native-config from false to true, then synced only the affected ConfigMap and Deployment. The native catalog service stopped recomputing the same bounded /products/{productId} response on every request and served it from cache for the duration of the corrective window. The restore step then set the same key back to false.

Article content

The route stayed healthy in both phases, but it moved into a much better operating envelope under the same bounded traffic.

Article content

The metric view shows the same story from the user's route. The blue route-p95 line is above the 200 ms threshold before correction. During the corrective window it drops sharply, then returns when the cache is restored to the authored disabled baseline. If the same load continues after restore and p95 stays above 200 ms for the full alert window, the alert should trigger again. That is expected: the proof demonstrates a governed, reversible correction, not a permanent decision to keep product data cached.

The corrective mutation

The native corrective mutation is deliberately small. During correction, the Git change enables cache. During restore, the same key returns to false:

apiVersion: v1
kind: ConfigMap
metadata:
  name: catalog-native-config
data:
  # Correction: "true" for the bounded cache window.
  # Restore: "false" for the authored baseline.
  APRF_CATALOG_CACHE_ENABLED: "false"        

Argo was asked to sync only the two affected resources:

sync_resources:
  - group: ""
    kind: ConfigMap
    namespace: catalog-native
    name: catalog-native-config
  - group: apps
    kind: Deployment
    namespace: catalog-native
    name: catalog-native        

The action is still declarative and auditable, but it is also intentionally fast. The proof does not wait for a broad eventual sync if a bounded explicit sync can carry the same authored change through sooner. The same path later restores the fresh baseline.

Article content

The Argo CD view is the governance part of the proof. The change is not a console edit on a live pod. It is a synced catalog-native-config ConfigMap in the catalog-native namespace, carrying the bounded corrective value APRF_CATALOG_CACHE_ENABLED: 'true' through the GitOps-managed application.

Why restore is part of the business claim

The product data is intentionally business-like. catalog-native returns a product id, name, current price, stock level, and last-updated timestamp:

public record Product(
        String productId,
        String productName,
        double currentPrice,
        int stockLevel,
        Instant lastUpdated
) {}        

That matters because caching is a tradeoff. It improves latency during the corrective window, but product price and stock can become stale while cache is enabled. That may be acceptable for a bounded incident response, but it should not silently become the permanent behaviour.

The proof only counts as complete because the authored baseline is restored through the same GitOps path afterwards. The operational claim is:

  • the system can change
  • the system can improve
  • the system can avoid turning a temporary latency fix into a permanent stale product-data policy
  • the system can return to the authored baseline

D8 Governance and observability carries 10% in the scorecards because it turns the rest of the disciplines into an auditable operational story rather than a static configuration view.

What The Scorecards Show

The scorecards summarise how much direct evidence each workload has across the eight APRF disciplines. Higher scores mean the workload has stronger proof for capacity, placement, lifecycle, disruption handling, application hardening, chaos, and operational governance.

Score summary

  • Python basic-app — canonical APRF score 72.4, runtime-emphasis score 71.6, band developing. It is the governance spine: quota, default limits, network policy, route, metrics, alerts, and GitOps ownership are visible. Its lighter score is expected because it is intentionally not a downstream-service resilience example; D6 Hardened application patterns is therefore lighter for this workload.
  • Quarkus JVM pair — canonical score 84.3, runtime-emphasis score 84.3, band strong. It is the clear bytecode-based Java reference and the direct dependency baseline for D6 Hardened application patterns. It carries strong direct evidence in capacity, lifecycle, controlled disruption, and guarded chaos. It also makes the failure posture easy to read: under injected dependency errors, it exposes direct routed failures rather than hiding them behind fallback.
  • Quarkus native pair — canonical score 87.6, runtime-emphasis score 89.0, band strong. It leads overall because it combines low memory use, hardened dependency behaviour, the cleanest hard-worker-loss result, and the full D8 alert-to-GitOps corrective-action loop. The useful supporting signals are concrete: 264.0 Mi live memory versus 618.0 Mi for JVM, 480/480 requests and 0 transport errors under hard worker loss, catalog recovery in 15.8 s, and D8 corrective p95 improvement from 325.7 ms to 16.3 ms.

The comparison is best read as "native carries the strongest complete proof package," not as "JVM is weak." The JVM pair is intentionally useful because it is the direct baseline. The native pair ends up ahead because several advantages reinforce each other: lower memory, direct spread and service-path evidence, stronger chaos continuity, hardened dependency behaviour, and the only full D8 Governance and observability corrective chain.

How the weighting worked in this APRF demo

The weighting gives extra influence to the disciplines that decide whether a system keeps serving during real trouble:

  • D1 Capacity — 14% — real headroom, bounded limits, scheduler admission under pressure
  • D2 Overcommitment — 10% — governed sharing of spare capacity without leaking into the main service path
  • D3 Failure domains — 12% — worker and zone spread, plus the correct allowed service paths
  • D4 Lifecycle — 12% — startup, restart recovery, and the time spent in the reduced 2-of-3 window
  • D5 Controlled disruption — 12% — planned drains and maintenance with PodDisruptionBudget protection
  • D6 Hardened application patterns — 15% — timeouts, retries, circuit breaking, fallback, and clean recovery
  • D7 Guarded chaos — 15% — bounded pod loss and worker loss under live routed traffic
  • D8 Governance and observability — 10% — policy enforcement, alerting, self-heal, and bounded corrective GitOps action

How Your Organisation Can Benefit From APRF

The original APRF article explains the maturity model. This implementation is the next step — use it as a practical pattern:

  1. Start with one small workload and prove the platform spine: route, metrics, namespace policy, quotas, and GitOps ownership.
  2. Add one realistic service dependency and prove the operating envelope: capacity, placement, rollout, planned drain, and route continuity.
  3. Add application-level resilience where it matters: timeout, retry, circuit breaker, bulkhead, fallback, and dependency fault tests.
  4. Add bounded chaos only after the steady state is clear: pod loss first, hard worker loss second.
  5. Close the loop with observability and governance: alert, apply a small GitOps correction, measure the result, and restore the authored baseline.

The value is different for each audience:

  • Platform and SRE teams get proof that capacity, spread, drains, monitoring, and worker loss behave as designed.
  • Application teams get concrete code examples for direct dependency paths and hardened dependency paths.
  • Architecture and leadership teams get scorecards that connect runtime choices to measured operational behaviour, not preference or assumptions.

Conclusion

APRF becomes useful when resilience claims are turned into runnable proof. Capacity, lifecycle, disruption handling, application hardening, chaos, and observability all matter, but the value is in seeing them work together on real routes, real manifests, and dated evidence.

This implementation gives that shape. The basic-app anchors the governance spine. The Quarkus JVM pair gives a clear bytecode-based direct baseline. The Quarkus native pair shows the stronger complete proof package here: lower memory use, hardened dependency handling, cleaner hard-worker-loss behaviour, and the full alert-to-GitOps corrective-action loop.

You are welcome to explore the APRF article’s companion repository, published under the MIT License: https://github.com/mikhailknyazev/aprf

Thank you!


To view or add a comment, sign in

More articles by Michael Knyazev, PhD

Others also viewed

Explore content categories