Azure Red Hat OpenShift (ARO) — Design Considerations
Table of Contents
- Deployment Model
- Network Planning
- Access Control & Identity
- Scalability & High Availability
- Storage Integration
- Security
- Logging & Monitoring
- Disaster Recovery & Backup
- Day 2 Operations
- Shared Responsibility Model
- Cost Management
1. Deployment Model
Options
| Model |
API Server |
Ingress |
Outbound |
Use Case |
| Public |
Public IP |
Public LB |
Public LB |
Dev/test, non-sensitive workloads |
| Private |
Private IP (control plane subnet) |
Internal LB |
Public LB (default) |
Production, regulated workloads |
| Private without Public IP |
Private IP |
Internal LB |
UDR (User Defined Routing) |
Highest security, air-gapped/disconnected |
Recommendation for Regulated / Enterprise Environments
- Private cluster without public IP (
--apiserver-visibility Private --ingress-visibility Private --outbound-type UserDefinedRouting)
- API server is only accessible from within the VNET or peered networks
- Egress traffic is routed through Azure Firewall or NVA via UDR — no public IP is provisioned on the cluster
- Access to the OpenShift console and API requires a jump box, VPN, or Azure Bastion within the peered network
- Custom domain is recommended (e.g.,
aro.example.com) with enterprise-managed TLS certificates for both ingress and API server
Key Decisions
| Decision |
Options |
Recommendation |
| Outbound type |
LoadBalancer / UserDefinedRouting |
UDR — required for no public IP |
| Custom domain |
Default (*.aroapp.io) / Custom |
Custom domain — enterprise branding and certificate control |
| Pull secret |
With / Without Red Hat pull secret |
With — enables Operator Hub and Red Hat registry access |
| Deployment method |
Azure Portal / CLI / Terraform / ARM |
Terraform — infrastructure-as-code, repeatable, auditable |
| Subscription model |
Single / Separate for Prod & Non-Prod |
Separate subscriptions — per-subscription resource limits, security isolation, billing separation |
| Cluster per environment |
Shared / Dedicated per env |
Dedicated clusters for Prod and Non-Prod — workload isolation, independent upgrade cycles |
Bastion / Jump Box
For a private cluster, a bastion host is required for administrative access:
- Azure Bastion in the Hub VNET, or a dedicated jump box VM in a peered subnet
- Jump box specs: 4 vCPU, 8 GB RAM, 100 GB disk, RHEL 9+
- Required CLIs on jump box:
az (Azure CLI), oc (OpenShift CLI), terraform, helm
- Alternative: Azure VPN Point-to-Site for developer access to private API
Cluster Limits
| Parameter |
Limit |
| Max worker nodes |
250 |
| Max pods per node |
250 |
| Control plane nodes |
3 (fixed, managed by ARO) |
| Min worker nodes |
3 |
| Cluster cannot be moved |
Between regions or subscriptions after deployment |
2. Network Planning
VNET & Subnet Design
ARO requires a VNET with two dedicated subnets (control plane and worker nodes). Both subnets must be minimum /27, but /23 is recommended for production.
| Component |
CIDR Example |
Minimum Size |
Notes |
| ARO VNET |
10.100.0.0/16 |
/22 |
Can use existing VNET |
| Control plane subnet |
10.100.0.0/23 |
/27 |
Service endpoints: Microsoft.ContainerRegistry |
| Worker node subnet |
10.100.2.0/23 |
/27 |
Service endpoints: Microsoft.ContainerRegistry |
| Pod network (overlay) |
10.128.0.0/14 |
/18 |
Non-routable, internal to OVN-Kubernetes SDN. Each node gets a /23 (512 IPs) |
| Service network |
172.30.0.0/16 |
/16 |
Cluster-internal service IPs |
Key Rules
- Pod and Service CIDRs must not overlap with the VNET address range or any peered network
- Pod CIDR minimum /18 — each node is allocated a /23 subnet (512 pod IPs per node, not changeable)
- Plan for growth: if you expect 50 worker nodes, the worker subnet needs at least 50 IPs + buffer
Private Link Endpoints
For a regulated / enterprise environment, all Azure PaaS services should be accessed via Private Link endpoints to prevent data traversing the public internet:
| Azure Service |
Private Endpoint Required |
Subnet |
| Azure Key Vault |
Yes |
Dedicated Private Endpoints subnet |
| Azure Container Registry |
Yes |
Dedicated Private Endpoints subnet |
| Azure Storage (Blob, Files) |
Yes |
Dedicated Private Endpoints subnet |
| Azure SQL / CosmosDB |
Yes |
Dedicated Private Endpoints subnet |
| Azure Monitor (Log Analytics) |
Yes (AMPLS) |
Dedicated Private Endpoints subnet |
| Azure Service Bus / Event Hub |
Yes |
Dedicated Private Endpoints subnet |
- Create a dedicated subnet (e.g.,
/24) in the spoke VNET for Private Link endpoints
- Register Private DNS Zones (e.g.,
privatelink.vaultcore.azure.net) in the Hub VNET and link to spoke
- Disable public access on all PaaS services
Ingress Control
| Approach |
Description |
Recommendation |
| Default OpenShift Router |
HAProxy-based ingress controller, deployed on worker nodes by default |
Use as primary ingress |
| Azure Application Gateway + WAF |
L7 load balancer with Web Application Firewall, SSL offloading, URL-based routing |
Recommended for external-facing apps — deploy in a dedicated subnet in the spoke VNET; WAF policy provides OWASP rule sets, bot protection, and rate limiting |
| Internal Load Balancer |
Private ingress (--ingress-visibility Private) |
Required for private cluster |
| Custom ingress controller |
NGINX, Traefik, etc. |
Only if specific features needed |
Egress Control
For a private cluster with UDR:
- Azure Firewall or NVA in the Hub VNET controls all egress traffic
- Route table on worker/control plane subnets with default route (0.0.0.0/0) pointing to the firewall
- Required egress destinations (proxied through ARO service — no explicit firewall rules needed):
arosvc.azurecr.io (system container images)
management.azure.com (Azure APIs)
login.microsoftonline.com (authentication)
monitor.core.windows.net (Geneva monitoring)
- Optional egress destinations (require explicit firewall allow rules):
registry.redhat.io, quay.io, cdn*.quay.io — Red Hat container registry and Operator Hub
registry.access.redhat.com, registry.connect.redhat.com — certified operators
mirror.openshift.com — cluster updates
api.openshift.com — update graph
- For disconnected/air-gapped: mirror required images to an internal Azure Container Registry
Connectivity to On-Premises & Other VNETs
| Connectivity |
Method |
Notes |
| On-premises |
Azure ExpressRoute or Site-to-Site VPN |
ExpressRoute recommended for regulated environments — dedicated, private connection |
| Other Azure VNETs |
VNET Peering |
ARO spoke peered to Hub VNET; Hub peers to other spokes |
| DNS resolution |
Azure Private DNS Zones + conditional forwarding |
Forward on-prem domains to on-prem DNS; ARO uses CoreDNS with configurable forwarding |
Landing Zone Integration
- Deploy ARO in a Hub-Spoke topology aligned with Azure Landing Zone best practices
- ARO cluster in a dedicated spoke VNET
- Hub VNET contains: Azure Firewall, VPN/ExpressRoute Gateway, Azure Bastion, DNS
- Network Security Groups (NSGs) are auto-created and managed by ARO — do not modify
- Private Link is used by Microsoft/Red Hat SRE to manage the cluster
Recommended Architecture — Private ARO with Internal & External Apps
The following architecture shows a private ARO cluster (no public IP) serving both internal-only apps (accessed by employees via corporate network) and external-facing apps (accessed by customers via the internet).
There are two approaches to expose internet-facing applications from a private ARO cluster:
Approach A: Custom Domain at Cluster Level
Set a custom domain during cluster installation using the --domain flag. This replaces the default *.aroapp.io domain for all cluster routes (console, API, application routes).
- Set at creation time:
az aro create ... --domain example.com
- All routes use this domain:
*.apps.example.com
- Organization manages DNS (Azure DNS or corporate DNS) and TLS certificates
- Cannot be changed after cluster creation
- Reference: https://cloud.redhat.com/experts/aro/custom-domain-private-cluster/
Approach B: Additional IngressController with Dedicated Domain
Create a second IngressController post-install with its own domain and its own Azure Load Balancer. This is the recommended approach for exposing specific internet-facing apps while keeping the default router for internal traffic.
- Default IngressController remains internal (corporate traffic)
- Additional IngressController gets a dedicated domain (e.g.,
*.api.example.com) and its own Load Balancer
- The additional IngressController’s Load Balancer can be External (public IP) — even though the cluster itself has no public IP on its nodes. The
--ingress-visibility Private flag only applies to the default IngressController
- Reference: https://cloud.redhat.com/experts/aro/additional-ingress-controller/
| Aspect |
Approach A (Custom Domain) |
Approach B (Additional IngressController) |
| When to set |
Cluster creation (cannot change later) |
Post-install (can add/remove anytime) |
| Scope |
All cluster routes (console, API, apps) |
Only routes matching routeSelector |
| Internet exposure |
Still needs a separate mechanism (App Gateway, Front Door, or additional IngressController) |
Built-in — additional IngressController can have its own public LB |
| Flexibility |
One domain for everything |
Different domains for different app groups |
| Recommendation |
Use for custom branding of the cluster domain |
Recommended for exposing specific apps to the internet |
Note: Approaches A and B are complementary, not mutually exclusive. You can set a custom domain at cluster creation (Approach A) AND create additional IngressControllers (Approach B) for specific internet-facing apps.
Architecture Diagram
INTERNET
│
┌────────────────┼────────────────┐
│ (Option 1) │ │ (Option 2)
│ │ │
┌────────▼────────┐ │ ┌────────▼────────┐
│ Azure Front │ │ │ Public DNS │
│ Door + WAF │ │ │ (example.com) │
│ (global L7) │ │ │ │
└────────┬─────────┘ │ └────────┬────────┘
│ Pvt Link │ │
▼ │ ▼
┌──────────────┐ │ ┌───────────────────────┐
│ App Gateway │ │ │ Public LB │
│ + WAF v2 │ │ │ (from additional │
│ (optional) │ │ │ IngressController) │
└──────┬───────┘ │ └───────────┬───────────┘
│ │ │
▼ │ ▼
═══════════════════════════════════════════════════════════════════════
HUB VNET (10.0.0.0/16)
═══════════════════════════════════════════════════════════════════════
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐
│Azure Firewall│ │ VPN/Express │ │Azure Bastion │ │ DNS │
│(10.0.1.0/26) │ │ Route GW │ │(10.0.3.0/26) │ │ Forwarder │
│ │ │(10.0.2.0/27) │ │ │ │ │
│ • Egress │ │ │ │ • Admin │ │ • Private │
│ filtering │ │ • On-prem │ │ access to │ │ DNS │
│ • FQDN rules │ │ connect │ │ jump box │ │ Zones │
└──────┬───────┘ └──────────────┘ └──────────────┘ └───────────┘
│ UDR (0.0.0.0/0)
│
═══════════════════════════╤══════════════════════════════════════════
│ VNET Peering
═══════════════════════════╧══════════════════════════════════════════
SPOKE VNET (10.100.0.0/16)
═══════════════════════════════════════════════════════════════════════
│
┌────────────────────────────────────────────────────────────────┐
│ ARO CLUSTER (Private, No Public IP) │
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Control Plane │ │ Worker Node Subnet (10.100.2.0/23) │ │
│ │ Subnet │ │ │ │
│ │ (10.100.0.0/23) │ │ ┌───────────┐ ┌───────────────┐ │ │
│ │ │ │ │ Workers │ │ Infra Nodes │ │ │
│ │ • 3x Control │ │ │ (general) │ │ (post-install)│ │ │
│ │ Plane Nodes │ │ │ │ │ │ │ │
│ │ • API Server │ │ │ • App │ │ • Router(s) │ │ │
│ │ (Private IP) │ │ │ pods │ │ • Prometheus │ │ │
│ │ │ │ │ │ │ • Logging │ │ │
│ │ │ │ │ │ │ • Registry │ │ │
│ └─────────────────┘ │ └───────────┘ └───────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Default IngressController │ │ │
│ │ │ Internal LB (10.100.2.x) │ │ │
│ │ │ → internal apps only │ │ │
│ │ └─────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Additional IngressController│ │ │
│ │ │ Public LB or Internal LB │ │ │
│ │ │ → internet-facing apps │ │ │
│ │ └─────────────────────────────┘ │ │
│ └─────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
┌────────────────────────────────────────────────────────────────┐
│ App Gateway Subnet (10.100.4.0/24) — OPTIONAL │
│ ┌──────────────────────────────────────┐ │
│ │ Azure Application Gateway + WAF v2 │ │
│ │ • Backend pool → Internal LB IP │ │
│ │ • WAF rules (OWASP, bot, rate limit) │ │
│ │ • Path-based routing to services │ │
│ └──────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
┌────────────────────────────────────────────────────────────────┐
│ Private Endpoints Subnet (10.100.5.0/24) │
│ • Key Vault • ACR • Storage Account │
│ • Log Analytics • SQL/Cosmos • Service Bus │
└────────────────────────────────────────────────────────────────┘
│
┌────────────────────────────────────────────────────────────────┐
│ Jump Box Subnet (10.100.6.0/27) │
│ • Admin VM (RHEL 9, oc/az/terraform/helm CLI) │
│ • Accessed via Azure Bastion from Hub │
└────────────────────────────────────────────────────────────────┘
Traffic Flows
External-facing apps — Two options depending on security requirements:
Option 1: Additional IngressController with Public LB (simpler, no Front Door)
Internet → Public DNS → Public LB (Additional IngressController)
→ OpenShift Router (external) → external-app pods
- Simplest approach — no extra Azure services required
- Additional IngressController creates its own Azure Public Load Balancer
- TLS terminated at the Router (edge or passthrough)
- Add Azure DDoS Protection Standard on the VNET for DDoS mitigation
- Suitable when WAF is not required or handled at application level
Option 2: Front Door + App Gateway (enterprise-grade, WAF at edge)
Internet → Azure Front Door (WAF, TLS, DDoS) → Private Link → App Gateway (WAF v2)
→ Internal LB (Additional IngressController) → OpenShift Router → external-app pods
- Azure Front Door provides global L7 load balancing, DDoS protection, and edge WAF
- Application Gateway provides regional WAF with OWASP rules and URL-based routing
- Both the additional IngressController and default IngressController use Internal LBs in this option
- End-to-end TLS: Front Door → App Gateway → Router → pod (re-encrypt or passthrough)
- Recommended for apps requiring WAF, global distribution, or regulatory-mandated edge security
Internal-only apps (e.g., staff portals, back-office systems):
Corporate network → ExpressRoute/VPN → Hub VNET → Peering → Spoke VNET
→ Internal LB (Default IngressController) → OpenShift Router → internal-app pods
- No internet exposure — only reachable from corporate network
- OpenShift Routes with
host: staff.internal.example.com route to the correct service
- DNS: internal apps resolve via Azure Private DNS Zones linked to corporate DNS
Admin / Developer access:
Admin laptop → Azure Bastion → Jump Box VM → oc login https://api.aro.example.com:6443
- API server has no public IP — only accessible from within the VNET
- Developers can also use Azure VPN Point-to-Site for
oc CLI access
Choosing an Internet Exposure Option
| Criteria |
Option 1: Public LB |
Option 2: Front Door + App Gateway |
| Complexity |
Low |
High |
| Cost |
Low (LB only) |
High (Front Door + App Gateway) |
| WAF |
No (unless added separately) |
Yes (edge + regional WAF) |
| DDoS protection |
Azure DDoS Protection Standard |
Built into Front Door |
| Global load balancing |
No (single region) |
Yes (multi-region) |
| TLS offloading layers |
1 (Router) |
3 (Front Door → App Gateway → Router) |
| Suitable for |
Internal APIs, B2B, limited internet exposure |
Customer-facing portals, mobile apps, regulatory-mandated WAF |
Recommendation for regulated environments: Start with Option 1 (Public LB) for B2B APIs and internal-facing internet services. Use Option 2 (Front Door + App Gateway) for customer-facing apps that require WAF and DDoS protection.
Separating Internal and External Traffic
By default, ARO creates a single default IngressController (router) with one Internal Load Balancer (on a private cluster). To separate internal and external traffic, deploy an additional IngressController with its own domain and Load Balancer.
| IngressController |
Domain |
Load Balancer |
Serves |
default |
*.apps.internal.example.com |
Internal LB (Private IP) |
Internal apps — corporate access only |
external |
*.apps.example.com |
Public LB (for Option 1) or Internal LB fronted by App Gateway (for Option 2) |
Internet-facing apps |
# Example: additional IngressController for external apps (Option 1 — Public LB)
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: external
namespace: openshift-ingress-operator
spec:
domain: apps.example.com
replicas: 2
endpointPublishingStrategy:
type: LoadBalancerService
loadBalancer:
scope: External # creates a Public Azure LB
routeSelector:
matchLabels:
exposure: external
nodePlacement: # optional: place on infra nodes if created
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
tolerations:
- key: node-role.kubernetes.io/infra
effect: NoSchedule
# Example: additional IngressController for external apps (Option 2 — Internal LB behind App Gateway)
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: external
namespace: openshift-ingress-operator
spec:
domain: apps.example.com
replicas: 2
endpointPublishingStrategy:
type: LoadBalancerService
loadBalancer:
scope: Internal # Internal LB — App Gateway forwards to this
routeSelector:
matchLabels:
exposure: external
nodePlacement:
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
tolerations:
- key: node-role.kubernetes.io/infra
effect: NoSchedule
Application teams label their Routes to select the appropriate IngressController:
# External-facing route
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: customer-api
labels:
exposure: external # picked up by 'external' IngressController
spec:
host: api.apps.example.com
to:
kind: Service
name: customer-api-svc
tls:
termination: reencrypt
---
# Internal-only route (no exposure label → handled by 'default' IngressController)
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: staff-portal
spec:
host: staff.apps.internal.example.com
to:
kind: Service
name: staff-portal-svc
tls:
termination: edge
When to Use a Custom Ingress Controller
The default OpenShift Router (HAProxy-based IngressController) handles the vast majority of use cases. However, there are scenarios where deploying a custom ingress controller (NGINX, Traefik, Kong, etc.) alongside or instead of the default router is warranted:
| Scenario |
Why Custom Ingress |
Recommended Controller |
| gRPC / HTTP/2 full support |
Default HAProxy router has limited gRPC streaming support; custom controllers provide native gRPC load balancing and header-based routing |
NGINX Ingress, Envoy (via Gateway API) |
| Advanced traffic management |
Canary releases with weighted traffic splitting (e.g., 90/10), A/B testing by header/cookie, circuit breaking, retry policies |
Traefik, NGINX Ingress, Istio Gateway |
| API Gateway features |
Rate limiting per API key, OAuth2/JWT validation at the ingress level, request/response transformation, API versioning |
Kong Ingress, APISIX |
| Multi-tenant isolation |
Dedicated ingress per tenant with independent TLS, rate limits, and WAF policies — beyond what routeSelector offers |
NGINX Ingress (one instance per tenant namespace) |
| Kubernetes Gateway API |
Organization wants to adopt the newer Gateway API standard instead of OpenShift Routes or Ingress resources |
Envoy Gateway, Istio, Traefik |
| TCP/UDP passthrough |
Non-HTTP protocols (databases, MQTT, custom TCP) that need L4 load balancing directly into the cluster |
NGINX Ingress (TCP/UDP ConfigMap), MetalLB (bare-metal only — not applicable to ARO) |
| Mutual TLS (mTLS) at ingress |
Client certificate authentication required at the edge (e.g., B2B API, mutual authentication compliance) |
NGINX Ingress, Istio IngressGateway |
When NOT to use a custom ingress controller:
- Standard HTTP/HTTPS routing with path/host-based rules → default Router handles this
- TLS termination (edge, re-encrypt, passthrough) → default Router supports all modes
- Internal-only services → default Router with Internal LB is sufficient
- Basic rate limiting → use Azure Front Door or Application Gateway WAF instead
- If the only reason is “familiarity with NGINX” → the default Router works the same way via Routes
Deployment considerations for custom ingress on ARO:
- Custom ingress controllers run as regular pods on worker (or infra) nodes
- They create their own Azure Load Balancer (Internal or External) — on a private cluster, use Internal LB
- ARO’s default Router remains active and manages OpenShift console and OAuth routes — do not disable it
- Resource overhead: each custom ingress controller consumes CPU/memory; avoid deploying multiple controllers unless justified
- If using both the default Router and a custom controller, clearly partition which domains/routes each handles to avoid conflicts
3. Access Control & Identity
Authentication
| Method |
Description |
Recommendation |
| Microsoft Entra ID (Azure AD) |
OIDC integration with Entra ID |
Primary recommendation — SSO, MFA, Conditional Access |
| OpenShift built-in (htpasswd) |
Local user/password store |
Break-glass only — emergency admin access |
| LDAP |
Direct LDAP/AD bind |
Alternative if Entra ID OIDC is not feasible |
| kubeadmin |
Default cluster admin account |
Disable after Entra ID is configured — or restrict and store credentials securely |
Authorization (RBAC)
| Level |
Mechanism |
Notes |
| Cluster RBAC |
OpenShift ClusterRoles & ClusterRoleBindings |
Map Entra ID groups to OpenShift roles |
| Project/Namespace RBAC |
OpenShift Roles & RoleBindings |
Per-project access control |
| Azure RBAC |
Azure role assignments on ARO resource |
Controls who can manage the ARO resource in Azure (not in-cluster access) |
Recommended RBAC Design
| Role |
Entra ID Group |
OpenShift Role |
Description |
| Platform Admin |
SG-ARO-PlatformAdmin |
cluster-admin |
Full cluster access (limited members) |
| Platform Operator |
SG-ARO-PlatformOps |
Custom platform-operator |
Manage nodes, storage, monitoring — no app access |
| Application Admin |
SG-ARO-AppTeam-<name> |
admin (namespace-scoped) |
Full control within assigned namespaces |
| Developer |
SG-ARO-Dev-<name> |
edit (namespace-scoped) |
Deploy and manage apps in assigned namespaces |
| Viewer |
SG-ARO-Viewer |
view (namespace-scoped) |
Read-only access for audit/compliance |
Break-Glass Account
- Create a dedicated
break-glass htpasswd user with cluster-admin role
- Store credentials in Azure Key Vault with access auditing enabled
- Use only when Entra ID is unavailable
- Monitor usage via OpenShift audit logs — alert on any break-glass login
Service Accounts & Workload Identity
- Use OpenShift service accounts for pod-to-pod and pod-to-API communication
- Azure Workload Identity for pods that need to access Azure services (Key Vault, Storage, SQL, etc.) without storing credentials:
- Federated identity credential: links a Kubernetes service account to an Azure Managed Identity
- Pod mounts a projected service account token → exchanges it for an Azure AD token via OIDC
- Eliminates stored secrets for Azure service access — critical for enterprise security posture
- Configure per application namespace: one Managed Identity per application team
- Use User-Assigned Managed Identities (not system-assigned) for better lifecycle management
4. Scalability & High Availability
Multi-AZ Deployment
- Deploy ARO across 3 Availability Zones (where supported) for both control plane and worker nodes
- Control plane nodes are automatically distributed across AZs by ARO
- Worker nodes: specify
--worker-count and ensure MachineSet per AZ for even distribution
Node Sizing & Types
| Node Type |
Suggested VM Size |
Count |
Notes |
| Control plane |
Standard_D8s_v3 (8 vCPU, 32 GiB) |
3 |
Managed by ARO — not configurable post-creation. SRE will resize if overutilized — keep 2x vCPU quota available |
| Worker (general) |
Standard_D16s_v5 (16 vCPU, 64 GiB) |
6-18 |
Application workloads. Size based on enterprise reference (16 vCPU, 64 GB per node) |
| Worker (infra) |
Standard_D8s_v5 (8 vCPU, 32 GiB) |
3 |
Not created by default — must create post-install via new MachineSets. Hosts router, monitoring, registry. Label as node-role.kubernetes.io/infra to avoid OCP subscription costs |
| Worker (infra-logging) |
Standard_E8s_v5 (8 vCPU, 64 GiB) |
3 |
Not created by default — optional dedicated nodes for logging stack (Loki/EFK) if high log volume. Memory-optimized |
| Worker (GPU) |
Standard_NC series |
As needed |
For AI/ML workloads |
Reference sizing from enterprise deployment:
- Non-Prod: ~230 vCPU, ~456 GB memory → 18 worker nodes (D16, 16 vCPU, 64 GB) + 3 infra + 3 infra-logging
- Prod: ~125 vCPU, ~403 GB memory → 12 worker nodes + dedicated DB workers + 3 infra + 3 infra-logging
- Add 1 extra worker per AZ (3 total) as failover capacity
Infrastructure Nodes
Important: ARO does not create infrastructure nodes by default. All worker nodes created at provisioning time are general-purpose workers. Infrastructure nodes must be manually created post-installation by creating new MachineSets with the infra label and taints.
Infrastructure nodes run platform services and do not count toward OpenShift subscription costs:
| Component |
Move to Infra Nodes |
Notes |
| OpenShift Router (HAProxy) |
Recommended |
Ingress controller — runs on worker nodes by default |
| Prometheus / AlertManager |
Recommended |
Platform monitoring — runs on worker nodes by default |
| Grafana |
Recommended |
Dashboards |
| Loki / Elasticsearch |
Recommended (or dedicated infra-logging) |
Log aggregation — memory intensive |
| OpenShift Image Registry |
Recommended |
Internal registry — runs on worker nodes by default |
Steps to create infrastructure nodes:
- Create new MachineSets (one per AZ) with the
infra role label:
```yaml
metadata:
labels:
node-role.kubernetes.io/infra: “”
spec:
taints:
- key: node-role.kubernetes.io/infra
effect: NoSchedule
```
- Label the nodes:
oc label node <node> node-role.kubernetes.io/infra=
- Apply taint to prevent application pods from scheduling:
oc adm taint nodes <node> node-role.kubernetes.io/infra:NoSchedule
- Move platform components (router, monitoring, registry, logging) to infra nodes by updating their operator configs with
nodeSelector and tolerations
- Verify no OCP subscription cost applies — confirm with Red Hat support that nodes are correctly labelled
Auto-Scaling
| Type |
Mechanism |
Notes |
| Cluster Autoscaler |
Automatically adds/removes worker nodes based on pending pods |
Configure min/max per MachineSet |
| Horizontal Pod Autoscaler (HPA) |
Scales pod replicas based on CPU/memory metrics |
Per-deployment configuration |
| Vertical Pod Autoscaler (VPA) |
Adjusts pod resource requests based on actual usage |
Use in recommendation mode first |
| KEDA (Event-Driven Autoscaling) |
Scales based on external event sources (queue length, Kafka lag, Prometheus metrics, cron schedules) |
Install via OperatorHub; useful for enterprise batch processing and message-driven workloads |
| Machine Health Check |
Automatically replaces unhealthy nodes |
Configure for each MachineSet |
Autoscaler Recommendations
# Example: Cluster Autoscaler
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
name: default
spec:
podPriorityThreshold: -10
resourceLimits:
maxNodesTotal: 30
scaleDown:
enabled: true
delayAfterAdd: 10m
delayAfterDelete: 5m
unneededTime: 5m
Capacity Planning
- Reserve 10-20% headroom for burst workloads
- Set resource requests and limits on all workloads — autoscaler relies on pending pods
- Use
PodDisruptionBudgets (PDB) for critical workloads during node scale-down, upgrades, or maintenance
- Configure
minAvailable or maxUnavailable to ensure service continuity
- Avoid overly aggressive PDBs that block node drains during upgrades
- Apply
ResourceQuotas per namespace to prevent any single team from consuming excessive cluster resources
- Apply
LimitRanges per namespace to set default requests/limits for pods that don’t specify them
Pod Health & Self-Healing
- Configure liveness probes on all containers — Kubernetes restarts unresponsive pods automatically
- Configure readiness probes — prevents traffic from being routed to pods not yet ready to serve
- Configure startup probes for slow-starting applications (e.g., Java/Spring Boot) to avoid premature restarts
- ARO automatically repairs unhealthy nodes via Machine Health Checks
5. Storage Integration
Azure Storage Options
| Storage Type |
Azure Service |
CSI Driver |
Access Mode |
Use Case |
| Block storage |
Azure Managed Disks (Premium SSD, Ultra) |
disk.csi.azure.com |
RWO |
Databases, stateful apps |
| File storage |
Azure Files (Premium) |
file.csi.azure.com |
RWX |
Shared config, CMS, logs |
| Blob storage |
Azure Blob Storage |
blob.csi.azure.com |
RWX (via NFS/FUSE) |
Large unstructured data, ML datasets |
| Object storage |
Azure Blob (S3-compatible via MinIO) |
N/A |
API-based |
Backups (Velero), image registry |
StorageClass Configuration
| StorageClass |
Provisioner |
Reclaim Policy |
Volume Binding |
Notes |
managed-premium (default) |
disk.csi.azure.com |
Delete |
WaitForFirstConsumer |
Premium SSD, recommended |
managed-csi-encrypted |
disk.csi.azure.com |
Retain |
WaitForFirstConsumer |
With customer-managed encryption key |
azurefile-csi-premium |
file.csi.azure.com |
Delete |
Immediate |
For RWX workloads |
Recommendations for Regulated / Enterprise Environments
- Use Premium SSD v2 or Ultra Disk for database workloads requiring high IOPS
- Enable Customer-Managed Keys (CMK) for disk encryption via Azure Key Vault
- Set reclaim policy to Retain for critical data volumes
- Use Azure Files Premium with private endpoints for shared storage
- For backup storage: Azure Blob with immutable storage for compliance
- Private Endpoints for all storage accounts — no public access
Internal Image Registry
- ARO’s built-in image registry uses Azure Blob Storage by default
- For regulated environments: configure to use a dedicated storage account with private endpoint and CMK encryption
6. Security
Data Encryption
| Layer |
Mechanism |
Notes |
| etcd encryption |
AES-CBC encryption at rest |
Enabled by default in ARO |
| Persistent volume encryption |
Azure Managed Disk encryption (SSE) |
Default: platform-managed key. Recommend: Customer-Managed Key (CMK) via Key Vault |
| Azure Files encryption |
SSE with CMK |
Configure via storage account encryption settings |
| In-transit encryption |
TLS 1.2+ for all API and ingress traffic |
Default; enforce via ingress controller config |
| Image registry |
Blob storage encryption with CMK |
Configure dedicated storage account |
Communication Encryption
| Communication Path |
Encryption |
Notes |
| Client → Ingress |
TLS 1.2+ (enterprise cert) |
Configure custom TLS certificate on ingress controller |
| Ingress → Pod |
TLS (optional) |
Enable re-encryption or passthrough routes |
| Pod → Pod |
mTLS via Service Mesh (optional) |
Deploy OpenShift Service Mesh for zero-trust networking |
| API client → API server |
TLS 1.2+ |
Custom certificate recommended |
| Node → Control plane |
TLS (managed) |
Handled by ARO |
Secret Management
| Approach |
Description |
Recommendation |
| OpenShift Secrets |
Base64-encoded in etcd (encrypted at rest) |
Acceptable for non-sensitive config |
| Azure Key Vault CSI Provider |
Mount Key Vault secrets as volumes in pods |
Primary recommendation — secrets never stored in etcd |
| External Secrets Operator |
Sync secrets from Key Vault to OpenShift Secrets |
Alternative if CSI mount is not suitable |
| Sealed Secrets |
GitOps-friendly encrypted secrets |
For secrets managed in Git |
Governance & Admission Control
| Control |
Implementation |
Notes |
| Azure Policy |
Azure Policy for ARO/AKS (limited preview) |
Enforce Azure-level governance on cluster resources |
| OPA Gatekeeper / ConstraintTemplates |
Install via OperatorHub |
Enforce custom admission policies (e.g., deny privileged containers, enforce labels, restrict host paths) |
| Container Image Governance |
Allowed registries policy |
Only permit images from authorized registries (ACR, registry.redhat.io); deny pulls from Docker Hub or unapproved sources |
| Resource label enforcement |
Gatekeeper constraint |
Require cost-center and owner labels on all namespaces |
| Namespace isolation |
Gatekeeper + Network Policies |
Prevent cross-namespace resource access |
Additional Security Controls
| Control |
Implementation |
Notes |
| FIPS compliance |
--fips flag at cluster creation |
Required for regulatory compliance; cannot be changed after creation |
| Pod Security |
Pod Security Admission (PSA) / Security Context Constraints (SCC) |
Enforce restricted SCC by default; only elevate for verified workloads |
| Network Policies |
OVN-Kubernetes NetworkPolicy |
Enforce micro-segmentation between namespaces; highly recommended for regulated environments |
| Image security |
Red Hat Quay + Clair scanning, or Azure Defender for Containers |
Scan all images before deployment; enforce image signing with Cosign/Sigstore |
| Vulnerability scanning |
Microsoft Defender for Containers |
Enable on the ARO cluster |
| Compliance scanning |
OpenShift Compliance Operator |
CIS benchmark profile, daily scan at 3 AM, 7-day report retention, auto-remediation disabled initially |
| Advanced Cluster Security (ACS) |
Red Hat ACS (StackRox) |
Runtime threat detection, network segmentation visibility, vulnerability management |
| Audit logging |
OpenShift API audit logs |
Forward to Azure Log Analytics for retention and alerting |
| Confidential Containers |
OpenShift Sandboxed Containers (Kata) |
GA since Nov 2025 — secure enclave isolation for sensitive workloads |
| NSG / Private Link |
ARO-managed |
Do not modify NSGs or remove Private Link — required for SRE access |
Compliance Certifications
ARO inherits Azure compliance certifications:
| Certification |
Status |
| SOC 2 Type 2 |
Yes |
| SOC 3 |
Yes |
| ISO 27001 / 27017 / 27018 |
Yes |
| PCI DSS |
Yes (via Azure) |
| HIPAA |
Yes |
| FedRAMP High |
Yes |
Break-Glass Account (Security)
- Separate from regular admin access
- Stored in Azure Key Vault with:
- Access logging enabled
- Alerts on secret read events
- Rotation policy (quarterly)
- Only used when Entra ID / OIDC is down
- Documented procedure for break-glass use
7. Logging & Monitoring
Monitoring Architecture
| Layer |
Tool |
Data Collected |
| Platform health |
ARO SRE (Microsoft + Red Hat Geneva) |
Cluster health, node status, API availability — automatic, no user config needed |
| Cluster metrics |
Built-in Prometheus + Grafana |
CPU, memory, pod metrics, etcd, API server latency |
| Azure-level monitoring |
Azure Monitor Container Insights |
Node/pod performance, container logs, Kubernetes events |
| Application metrics |
User Workload Monitoring (Prometheus) |
Custom application metrics via ServiceMonitor |
| Resource Health |
Azure Resource Health alerts |
Cluster maintenance events, API unreachable alerts |
Logging Architecture
| Log Source |
Default Destination |
Recommended Integration |
| Container stdout/stderr |
Cluster logging (Loki/Elasticsearch) |
Forward to Azure Log Analytics via OpenShift Cluster Logging + Azure plugin |
| Audit logs (API server) |
Local storage |
Forward to Azure Log Analytics — critical for compliance |
| Infrastructure logs |
Cluster logging |
Forward to Azure Log Analytics |
| Security logs (OAuth, SCC violations) |
Cluster logging |
Forward to Azure Sentinel for SIEM |
| Node logs (journald) |
Local node |
Forward to Azure Log Analytics |
Recommended Data Export
| Data Type |
Export To |
Retention |
Notes |
| Audit logs |
Azure Log Analytics |
365 days (regulatory) |
API audit events, authentication events |
| Container logs |
Azure Log Analytics |
90 days hot, archive to Blob |
Application logs, error tracking |
| Platform metrics |
Azure Monitor Metrics |
93 days (default) |
CPU, memory, network metrics |
| Security events |
Azure Sentinel |
365 days |
OAuth events, policy violations, SCC violations |
| Alert history |
Azure Monitor Alerts |
30 days (default) |
Extend via Action Groups + Log Analytics |
Logging Storage Sizing (Reference)
Reference logging storage sizing:
| Component |
Storage |
Notes |
| Loki log storage |
3x 500 GB disks on infra-logging nodes |
Adjust based on log volume |
| Loki S3/Blob backend |
1.5 TB |
Long-term log storage |
| Prometheus PVC |
200 GB |
Metrics retention |
| Thanos Ruler PVC |
200 GB |
Multi-cluster metrics |
| Metrics retention |
90 days |
Configurable |
| Log retention |
90 days (hot), archive to Blob |
Regulatory: audit logs 365 days |
Network Observability
- Enable OVN-Kubernetes flow logging for network traffic visibility between pods and namespaces
- Use OpenShift Network Observability Operator (eBPF-based) to collect flow logs without sidecar overhead
- Forward network flow data to Loki for querying and Grafana for dashboards
- Key use cases for regulated environments: detect unexpected cross-namespace traffic, identify external communication patterns, audit network policy effectiveness
| Alert |
Condition |
Severity |
| Node not ready |
Node status != Ready for > 5 min |
Critical |
| Pod crash loop |
RestartCount > 5 in 10 min |
High |
| etcd leader changes |
> 3 leader changes in 1 hour |
Critical |
| API server latency |
p99 > 1s for > 5 min |
High |
| PV usage |
> 85% capacity |
Warning |
| Certificate expiry |
< 30 days to expiry |
Warning |
| Break-glass login |
Any htpasswd admin login |
Critical |
| Cluster maintenance |
Azure Resource Health signal |
Info |
8. Disaster Recovery & Backup
DR Strategy
| Scenario |
Strategy |
RPO |
RTO |
| Single AZ failure |
Multi-AZ deployment (3 AZs) |
0 |
Automatic failover |
| Full region failure |
Active-Passive in secondary region |
< 1 hour |
2-4 hours |
| Data corruption / accidental deletion |
Backup and restore |
< 1 hour |
1-2 hours |
| Cluster rebuild |
Infrastructure-as-Code (Terraform) + GitOps |
< 4 hours |
4-8 hours |
Backup Architecture
| Component |
Backup Tool |
Storage Target |
Schedule |
Retention |
| Kubernetes resources (deployments, configmaps, secrets) |
Velero + Azure Blob plugin |
Azure Blob (RA-GRS) with immutable storage |
Every 6 hours |
30 days |
| Persistent volumes |
Velero CSI snapshots / Azure Disk snapshots |
Azure Managed Disk snapshots |
Daily |
30 days |
| etcd |
ARO managed (SRE) |
Automatic |
Automatic |
Managed by SRE |
| GitOps state (desired state) |
Git repository |
Azure DevOps / GitHub |
Every commit |
Indefinite |
| Container images |
Azure Container Registry (geo-replicated) |
ACR Premium with geo-replication |
Continuous |
Indefinite |
| Secrets |
Azure Key Vault (soft delete + purge protection) |
Key Vault |
Continuous |
90 days (soft delete) |
DR Design — Active-Passive
Primary Region (e.g., Southeast Asia) Secondary Region (e.g., East Asia)
┌─────────────────────┐ ┌─────────────────────┐
│ ARO Cluster (Active)│ │ ARO Cluster (Standby)│
│ - Multi-AZ │ │ - Minimal workers │
│ - Full workload │ │ - Scale up on failover│
│ │ Replication │ │
│ Azure Blob ─────────┼──── RA-GRS ──────►│ Azure Blob │
│ ACR ─────────┼──── Geo-rep ─────►│ ACR │
│ Key Vault ─────────┼──── Backup ──────►│ Key Vault │
└─────────────────────┘ └─────────────────────┘
│ │
└──────── Azure Front Door / Traffic Manager ──────┘
OADP (OpenShift API for Data Protection)
OADP is the built-in backup tool for OpenShift, based on Velero:
- Backs up Kubernetes resources and internal images at namespace granularity
- PV backup via CSI snapshots or Restic (file-level backup)
- Schedule: recommend daily at 4 AM during maintenance window
- Storage: Azure Blob with RA-GRS for cross-region durability
- Limitations: currently only supports Azure Managed Disk-based PVs for CSI snapshots
- For comprehensive DR beyond OADP, consider enterprise solutions: Veeam Kasten, Trilio, Portworx PX-Backup
Key DR Decisions
| Decision |
Options |
Recommendation |
| DR topology |
Active-Active / Active-Passive / Pilot Light |
Active-Passive — cost-effective for most enterprises; Active-Active for mission-critical APIs |
| State management |
Velero backup-restore / GitOps + DB replication |
GitOps for stateless (rebuild from Git); Velero + DB replication for stateful |
| Failover trigger |
Manual / Automated (Azure Front Door health probe) |
Manual with automated detection — enterprises prefer controlled failover |
| DR testing |
Quarterly / Bi-annually |
Quarterly — regulatory requirement |
9. Day 2 Operations
Cluster Lifecycle
| Activity |
Responsibility |
Frequency |
| Cluster upgrades (control plane + workers) |
Customer-initiated — upgrades the entire cluster (control plane and workers together, cannot be separated). No rollback once started. ARO manages the rolling process. |
Schedule maintenance window; test in non-prod first |
| Certificate rotation |
ARO SRE (automatic) |
Automatic |
| Node scaling |
Customer (manual or autoscaler) |
As needed |
| Operator updates |
Customer |
Review and approve in Operator Hub |
GitOps — Recommended for Enterprise
- Use OpenShift GitOps (ArgoCD) for declarative, auditable deployments
- All cluster configuration and application manifests stored in Git
- Changes require pull request review and approval
- Full audit trail for regulatory compliance
Cluster Bootstrapping
After cluster creation, a bootstrapping process prepares the cluster for workloads:
- Day 0 bootstrapping (via Terraform/IaC):
- Identity provider (Entra ID) configuration
- Infrastructure node MachineSets
- Cluster Autoscaler and Machine Health Checks
- Custom TLS certificates for ingress and API server
- Day 1 bootstrapping (via GitOps/ArgoCD):
- Install operators (Logging, Monitoring, Compliance, ACS, OADP)
- Create namespaces with quotas and network policies
- Deploy ingress controllers and storage classes
- Configure Gatekeeper constraints
- Validation: Run smoke tests to verify cluster health before onboarding workloads
Use GitOps to manage bootstrapping — this ensures new clusters (or DR rebuilds) reach operational state automatically.
CI/CD Pipeline Strategy
| Pipeline |
Tool |
Notes |
| Cluster infrastructure |
Terraform + Azure DevOps / GitHub Actions |
IaC for cluster provisioning and day 0 config |
| Cluster configuration |
OpenShift GitOps (ArgoCD) |
Operators, policies, namespaces — synced from Git |
| Application workloads |
OpenShift Pipelines (Tekton) or Azure DevOps |
Build, test, scan, deploy container images |
| Image build |
OpenShift Builds or Azure DevOps |
Build from source, push to ACR |
| Image scanning |
ACS (StackRox) or Defender for Containers |
Gate deployment on scan results |
- Separate pipelines for cluster infra, cluster config, and application workloads
- Promote images across environments (dev → uat → prod) via image tags or ACR repository promotion
- Enforce pipeline gates: code review, image scan pass, compliance check
Namespace Management
- Standard namespace naming:
<team>-<env> (e.g., payments-prod, lending-uat)
- Resource quotas per namespace to prevent noisy neighbor issues
- Network policies per namespace for micro-segmentation
- Label standards for cost allocation and monitoring
Change Management
| Change Type |
Process |
Approval |
| Cluster upgrade |
Schedule maintenance window, test in non-prod first |
Change Advisory Board (CAB) |
| New namespace |
Request via ServiceNow / GitOps PR |
Team lead + Platform Admin |
| New operator |
Security review + non-prod testing |
Platform Admin + Security |
| Firewall rule change |
Submit to network team |
Network + Security team |
| Storage class change |
Impact assessment |
Platform Admin |
Maintenance Windows & Upgrade Strategy
Important: ARO cluster upgrades are customer-initiated and upgrade the entire cluster as a whole — control plane and worker nodes together. You cannot upgrade control plane and workers separately. There is no rollback once an upgrade is started. This makes pre-upgrade validation critical.
- Schedule cluster upgrades during off-peak hours (e.g., weekends, 2-6 AM)
- Always test upgrades in non-prod cluster first — allow minimum 1-week soak time before upgrading prod
- Upgrade process: customer initiates via Azure CLI or portal → ARO upgrades control plane first, then rolls workers one-by-one (respecting PDBs)
- Upgrade cadence: align with OpenShift minor releases (~quarterly); apply z-stream patches within 2 weeks of release
- Node OS patches: RHCOS updates are applied as part of cluster upgrades
- Since there is no rollback, mitigate risk by:
- Taking OADP/Velero backups immediately before upgrading
- Reviewing OpenShift release notes and known issues for the target version
- Verifying operator compatibility with the target OpenShift version
- Having a DR cluster available as fallback in case the upgrade causes critical issues
Resilience Testing
- Conduct chaos testing to validate cluster resilience:
- Simulate node failures (cordon/drain random nodes)
- Simulate AZ failure (scale down MachineSet in one AZ)
- Simulate pod failures (kill pods, test PDB behavior)
- Simulate network partitions (network policy changes)
- Use tools like Kraken (Red Hat chaos testing for OpenShift) or Azure Chaos Studio
- Schedule chaos tests quarterly alongside DR drills
- Document runbooks for each failure scenario
10. Shared Responsibility Model
| Area |
Microsoft + Red Hat |
Customer |
| Cluster creation & management |
✅ |
— |
| Control plane & worker node management |
✅ |
— |
| Platform monitoring (Geneva) |
✅ |
— |
| Platform software/security updates |
✅ |
— |
| Certificate rotation (platform) |
✅ |
— |
| Network infrastructure (LB, NSG, Private Link) |
✅ |
— |
| Identity provider configuration |
— |
✅ |
| User & RBAC management |
— |
✅ |
| Project & quota management |
— |
✅ |
| Application lifecycle (deploy, scale, update) |
— |
✅ |
| Application data & backups |
— |
✅ |
| Application logging & monitoring |
Shared |
Shared |
| Application networking (routes, network policies) |
Shared |
Shared |
| Virtual networking (VNET, peering, firewall) |
Shared |
Shared |
| Capacity management (worker node sizing) |
Shared |
Shared |
Incident management flow: SRE first responder → incident lead → communication/coordination → resolution summary in support ticket. RCA within 7 business days, full root cause analysis within 30 business days.
11. Cost Management
Cost Components
| Component |
Billing |
Notes |
| ARO infrastructure (VMs, disks, network) |
Azure bill |
Standard Azure VM pricing |
| OpenShift subscription |
Included in ARO pricing |
Per-worker-node hourly fee |
| Infrastructure nodes |
No OCP subscription cost |
Label correctly to qualify |
| Storage (Managed Disks, Azure Files, Blob) |
Azure bill |
Per-GB pricing |
| Network egress |
Azure bill |
Cross-region and internet egress charged |
| Azure Firewall |
Azure bill |
Per-hour + per-GB processing |
| ExpressRoute |
Azure bill |
Per-circuit + per-GB |
| Red Hat ACS / Quay (optional) |
Separate Red Hat subscription |
If not using built-in alternatives |
Cost Optimization Strategies
- Create infrastructure nodes post-install for platform services (router, monitoring, logging) — saves OCP subscription cost; not provisioned by default
- Right-size worker nodes — start with reference sizing, adjust based on actual utilization
- Cluster Autoscaler — scale down unused capacity during off-hours
- Azure Reserved Instances — 1-year or 3-year RI for predictable worker node costs
- Azure Savings Plans — flexible compute commitment across VM families
- MACC eligibility — ARO spend counts toward Microsoft Azure Consumption Commitment
Cost Monitoring & Analysis
- Use Azure Cost Management dashboards filtered by ARO resource group to track spend
- Use Azure Advisor for right-sizing recommendations on underutilized worker VMs
- Set Azure Budgets with alerts at 80% and 100% thresholds to prevent overspend
- Use OpenShift namespace-level resource consumption reports (via Prometheus) for internal chargeback
- Review cost monthly: compare actual vs. reserved capacity, identify idle resources
Cost Tagging
Apply consistent Azure tags and OpenShift labels for cost tracking:
| Tag/Label |
Example |
Purpose |
cost-center |
CC-1234 |
Financial allocation |
project-id |
PRJ-LENDING |
Project-level tracking |
department |
IT-Platform |
Department attribution |
environment |
prod / non-prod |
Environment separation |
owner |
team-platform |
Ownership |
Appendix A: Checklist
| # |
Design Area |
Decision |
Status |
| 1 |
Deployment model |
Private without public IP |
☐ |
| 2 |
Custom domain |
Custom domain with enterprise TLS |
☐ |
| 3 |
VNET design |
Hub-spoke with Azure Firewall |
☐ |
| 4 |
Subnet sizing |
/23 for control plane, /23 for workers |
☐ |
| 5 |
Pod/Service CIDR |
Non-overlapping with existing networks |
☐ |
| 6 |
On-prem connectivity |
ExpressRoute |
☐ |
| 7 |
Identity provider |
Microsoft Entra ID (OIDC) |
☐ |
| 8 |
Break-glass account |
htpasswd in Key Vault |
☐ |
| 9 |
RBAC model |
Entra ID groups mapped to OCP roles |
☐ |
| 10 |
Availability zones |
3 AZs for control plane + workers |
☐ |
| 11 |
Autoscaling |
Cluster Autoscaler + HPA |
☐ |
| 12 |
Storage |
Premium SSD with CMK, Azure Files Premium |
☐ |
| 13 |
etcd encryption |
Default (enabled) |
☐ |
| 14 |
FIPS compliance |
Enable at creation |
☐ |
| 15 |
Secret management |
Azure Key Vault CSI Provider |
☐ |
| 16 |
Logging |
Forward to Azure Log Analytics |
☐ |
| 17 |
Audit log retention |
365 days |
☐ |
| 18 |
Monitoring |
Azure Monitor Container Insights + built-in Prometheus |
☐ |
| 19 |
DR strategy |
Active-Passive in secondary region |
☐ |
| 20 |
Backup |
Velero to Azure Blob (RA-GRS) |
☐ |
| 21 |
GitOps |
OpenShift GitOps (ArgoCD) |
☐ |
| 22 |
Deployment IaC |
Terraform |
☐ |
| 23 |
Egress control |
Azure Firewall with required endpoints whitelisted |
☐ |
| 24 |
Image scanning |
Microsoft Defender for Containers |
☐ |
| 25 |
Compliance scanning |
OpenShift Compliance Operator (CIS benchmark) |
☐ |
| 26 |
Subscription model |
Separate subscriptions for Prod / Non-Prod |
☐ |
| 27 |
Bastion / jump box |
Azure Bastion or jump box VM in Hub VNET |
☐ |
| 28 |
Infra nodes |
Create post-install: 3x infra nodes for router/monitoring/logging (not default) |
☐ |
| 29 |
Cost tagging |
Azure tags + OpenShift labels for cost allocation |
☐ |
| 30 |
OADP backup |
Daily backup at 4 AM to Azure Blob (RA-GRS) |
☐ |
| 31 |
ACS (StackRox) |
Runtime threat detection and vulnerability management |
☐ |
| 32 |
Custom TLS certificates |
Enterprise CA for ingress and API server |
☐ |
| 33 |
DNS forwarding |
CoreDNS → on-prem DNS for internal domains |
☐ |
| 34 |
MACC eligibility |
Confirm ARO spend counts toward Azure commitment |
☐ |
| 35 |
Reserved Instances |
1-year or 3-year RI for worker nodes |
☐ |
| 36 |
Private Link endpoints |
Dedicated subnet for all Azure PaaS private endpoints |
☐ |
| 37 |
Governance (Gatekeeper) |
OPA Gatekeeper policies for image sources, labels, pod security |
☐ |
| 38 |
Container image governance |
Allow only authorized registries (ACR, registry.redhat.io) |
☐ |
| 39 |
Workload Identity |
Azure Workload Identity for pod-to-Azure-service authentication |
☐ |
| 40 |
KEDA |
Event-driven autoscaling for message/queue workloads |
☐ |
| 41 |
Cluster bootstrapping |
GitOps-based bootstrapping for operators, namespaces, policies |
☐ |
| 42 |
CI/CD pipelines |
Separate pipelines for infra, cluster config, and app workloads |
☐ |
| 43 |
Maintenance windows |
Scheduled upgrade windows; non-prod first, 1-week soak |
☐ |
| 44 |
Chaos / resilience testing |
Quarterly chaos tests alongside DR drills |
☐ |
| 45 |
Pod health probes |
Liveness, readiness, and startup probes on all containers |
☐ |
| 46 |
Network observability |
Network Observability Operator for flow logging |
☐ |
| 47 |
Azure Budgets |
Cost alerts at 80% and 100% thresholds |
☐ |
| 48 |
Application Gateway + WAF |
WAF for external-facing applications |
☐ |