Skip to main content

Configure Space-level observability

important

This feature is GA since v1.14.0, requires Spaces v1.6.0, and is off by default. To enable, set observability.enabled=true (features.alpha.observability.enabled=true before v1.14.0) when installing Spaces:

up space init --token-file="${SPACES_TOKEN_PATH}" "v${SPACES_VERSION}" \
...
--set "observability.enabled=true" \

This guide explains how to configure Space-level observability. This feature is only applicable to self-hosted Space administrators. This lets Space administrators observe the cluster infrastructure where the Space software gets installed.

When you enable observability in a Space, Upbound deploys a single OpenTelemetry Collector to collect and export metrics, logs, and traces to your configured observability backends.

Prerequisites

This feature requires the OpenTelemetry Operator on the Space cluster. Install this now if you haven't already:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.116.0/opentelemetry-operator.yaml

If running Spaces v1.11 or later, use OpenTelemetry Operator v0.110.0 or later due to breaking changes in the OpenTelemetry Operator.

Configuration

To configure how Upbound exports, review the spacesCollector value in your Space installation Helm chart. Below is an example of an otlphttp compatible endpoint.

observability:
spacesCollector:
config:
exporters:
otlphttp:
endpoint: "<your-endpoint>"
headers:
api-key: YOUR_API_KEY
exportPipeline:
logs:
- otlphttp
metrics:
- otlphttp
traces:
- otlphttp

You can export metrics, logs, and traces from your Crossplane installation, Spaces infrastructure (controller, API, router, etc.), provider-helm, and provider-kubernetes.

Router metrics

The Spaces router component uses Envoy as a reverse proxy and exposes detailed metrics about request handling, circuit breakers, and connection pooling. Upbound collects these metrics in your Space after you enable Space-level observability.

Envoy metrics in Upbound include:

  • Upstream cluster metrics - Request status codes, timeouts, retries, and latency for traffic to control planes and services
  • Circuit breaker metrics - Connection and request circuit breaker state for both DEFAULT and HIGH priority levels
  • Downstream listener metrics - Client connections and requests received
  • HTTP connection manager metrics - End-to-end HTTP request processing and latency

For a complete list of available router metrics and example PromQL queries, see the Router metrics reference.

Router tracing

The Spaces router generates distributed traces through OpenTelemetry integration, providing end-to-end visibility into request flow across the system. Use these traces to debug latency issues, understand request paths, and correlate errors across services.

The router uses:

  • Protocol: OTLP (OpenTelemetry Protocol) over gRPC
  • Service name: spaces-router
  • Transport: TLS-encrypted connection to telemetry collector

Trace configuration

Enable tracing and configure the sampling rate with the following Helm values:

observability:
enabled: true
tracing:
enabled: true
sampling:
rate: 0.1 # Sample 10% of new traces (0.0-1.0)

The sampling behavior depends on whether a parent trace context exists:

  • With parent context: If a traceparent header is present, the parent's sampling decision is respected, enabling proper distributed tracing across services.
  • Root spans: For new traces without a parent, Envoy samples based on x-request-id hashing. The default sampling rate is 10%.

TLS configuration for external collectors

To send traces to an external OTLP collector, configure the endpoint and TLS settings:

observability:
enabled: true
tracing:
enabled: true
endpoint: "otlp-gateway.example.com"
port: 443
tls:
caBundleSecretRef: "custom-ca-secret"

If caBundleSecretRef is set, the router uses the CA bundle from the referenced Kubernetes secret. The secret must contain a key named ca.crt with the PEM-encoded CA bundle. If not set, the router uses the Spaces CA for the in-cluster collector.

Custom trace tags

The router adds custom tags to every span to enable filtering and grouping by control plane:

TagSourceDescription
controlplane.idx-upbound-mxp-id headerControl plane UUID
controlplane.namex-upbound-mxp-host headerInternal vcluster hostname
hostcluster.idx-upbound-hostcluster-id headerHost cluster identifier

These tags enable queries like "show all slow requests to control plane X" or "find errors for control planes in host cluster Y".

Example trace

The following example shows the attributes from a successful GET request:

Span: ingress
├─ Service: spaces-router
├─ Duration: 8.025ms
├─ Attributes:
│ ├─ http.method: GET
│ ├─ http.status_code: 200
│ ├─ upstream_cluster: ctp-b2b37aaa-ee55-492c-ba0c-4d561a6325fa-api-cluster
│ ├─ controlplane.id: b2b37aaa-ee55-492c-ba0c-4d561a6325fa
│ ├─ controlplane.name: vcluster.mxp-b2b37aaa-ee55-492c-ba0c-4d561a6325fa-system
│ └─ response_size: 1827

Available metrics

Space-level observability collects metrics from multiple infrastructure components:

Infrastructure component metrics

  • Crossplane controller metrics
  • Spaces controller, API, and router metrics
  • Provider metrics (provider-helm, provider-kubernetes)

Router metrics

The router component exposes Envoy proxy metrics for monitoring traffic flow and service health. Key metric categories include:

  • envoy_cluster_upstream_rq_* - Upstream request metrics (status codes, timeouts, retries, latency)
  • envoy_cluster_circuit_breakers_* - Circuit breaker state and capacity
  • envoy_listener_downstream_* - Client connection and request metrics
  • envoy_http_downstream_* - HTTP request processing metrics

Example query to monitor total request rate:

sum(rate(envoy_cluster_upstream_rq_total{job="spaces-router-envoy"}[5m]))

Example query for P95 latency:

histogram_quantile(
0.95,
sum by (le) (
rate(envoy_cluster_upstream_rq_time_bucket{job="spaces-router-envoy"}[5m])
)
)

For detailed router metrics documentation and more query examples, see the Router metrics reference.

OpenTelemetryCollector image

Control plane (SharedTelemetry) and Space observability deploy the same custom OpenTelemetry Collector image. The OpenTelemetry Collector image supports otlhttp, datadog, and debug exporters.

For more information on observability configuration, review the Helm chart reference.

Observability in control planes

Read the observability documentation to learn about the features Upbound offers for collecting telemetry from control planes.

Router metrics reference

To avoid overwhelming observability tools with hundreds of Envoy metrics, an allow-list filters metrics to only the following metric families.

Upstream cluster metrics

Metrics tracking requests sent from Envoy to configured upstream clusters. Individual control planes, spaces-api, and other services are each considered an upstream cluster. Use these metrics to monitor service health, identify upstream errors, and measure backend latency.

MetricDescription
envoy_cluster_upstream_rq_xx_totalHTTP status codes (2xx, 3xx, 4xx, 5xx) with label envoy_response_code_class
envoy_cluster_upstream_rq_timeout_totalRequests that timed out waiting for upstream
envoy_cluster_upstream_rq_retry_limit_exceeded_totalRequests that exhausted retry attempts
envoy_cluster_upstream_rq_totalTotal upstream requests
envoy_cluster_upstream_rq_time_bucketLatency histogram (for P50/P95/P99 calculations)
envoy_cluster_upstream_rq_time_sumSum of request durations
envoy_cluster_upstream_rq_time_countCount of requests

Circuit breaker metrics

Metrics tracking circuit breaker state and remaining capacity. Circuit breakers prevent cascading failures by limiting connections and concurrent requests to unhealthy upstreams. Two priority levels exist: DEFAULT for watch requests and HIGH for API requests.

NameDescription
envoy_cluster_circuit_breakers_default_cx_openDEFAULT priority connection circuit breaker open (gauge)
envoy_cluster_circuit_breakers_default_rq_openDEFAULT priority request circuit breaker open (gauge)
envoy_cluster_circuit_breakers_default_remaining_cxAvailable DEFAULT priority connections (gauge)
envoy_cluster_circuit_breakers_default_remaining_rqAvailable DEFAULT priority request slots (gauge)
envoy_cluster_circuit_breakers_high_cx_openHIGH priority connection circuit breaker open (gauge)
envoy_cluster_circuit_breakers_high_rq_openHIGH priority request circuit breaker open (gauge)
envoy_cluster_circuit_breakers_high_remaining_cxAvailable HIGH priority connections (gauge)
envoy_cluster_circuit_breakers_high_remaining_rqAvailable HIGH priority request slots (gauge)

Downstream listener metrics

Metrics tracking requests received from clients such as kubectl and API consumers. Use these metrics to monitor client connection patterns, overall request volume, and responses sent to external users.

NameDescription
envoy_listener_downstream_rq_xx_totalHTTP status codes for responses sent to clients
envoy_listener_downstream_rq_totalTotal requests received from clients
envoy_listener_downstream_cx_totalTotal connections from clients
envoy_listener_downstream_cx_activeCurrently active client connections (gauge)

HTTP connection manager metrics

Metrics from Envoy's HTTP connection manager tracking end-to-end request processing. These metrics provide a comprehensive view of the HTTP request lifecycle including status codes and client-perceived latency.

NameDescription
envoy_http_downstream_rq_xxHTTP status codes (note: no _total suffix for this metric family)
envoy_http_downstream_rq_totalTotal HTTP requests received
envoy_http_downstream_rq_time_bucketDownstream request latency histogram
envoy_http_downstream_rq_time_sumSum of downstream request durations
envoy_http_downstream_rq_time_countCount of downstream requests