Configure Space-level observability

important

This feature is GA since v1.14.0, requires Spaces v1.6.0, and is off by default. To enable, set observability.enabled=true (features.alpha.observability.enabled=true before v1.14.0) when installing Spaces:

up space init --token-file="${SPACES_TOKEN_PATH}" "v${SPACES_VERSION}" \
  ...
  --set "observability.enabled=true" \

This guide explains how to configure Space-level observability. This feature is only applicable to self-hosted Space administrators. This lets Space administrators observe the cluster infrastructure where the Space software gets installed.

When you enable observability in a Space, Upbound deploys a single OpenTelemetry Collector to collect and export metrics, logs, and traces to your configured observability backends.

Prerequisites

This feature requires the OpenTelemetry Operator on the Space cluster. Install this now if you haven't already:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.116.0/opentelemetry-operator.yaml

If running Spaces v1.11 or later, use OpenTelemetry Operator v0.110.0 or later due to breaking changes in the OpenTelemetry Operator.

Configuration

To configure how Upbound exports telemetry, review the spacesCollector value in your Space installation Helm chart. Supported exporters are otlphttp, datadog, splunk_hec, and debug. Replace the exporter name and configuration options based on your backend.

Below is an example using otlphttp.

observability:
  spacesCollector:
    env:
      - name: API_KEY
        valueFrom:
          secretKeyRef:
            name: my-secret
            key: api-key
    config:
      exporters:
        otlphttp:
          endpoint: "<your-endpoint>"
          headers:
            api-key: ${env:API_KEY}
      exportPipeline:
        logs:
          - otlphttp
        metrics:
          - otlphttp
        traces:
          - otlphttp

For exporter-specific configuration options, see the OpenTelemetry documentation for otlphttp, datadog, and splunk_hec.

You can export metrics, logs, and traces from your Crossplane installation, Spaces infrastructure (controller, API, router, etc.), provider-helm, and provider-kubernetes.

Router metrics

The Spaces router component uses Envoy as a reverse proxy and exposes detailed metrics about request handling, circuit breakers, and connection pooling. Upbound collects these metrics in your Space after you enable Space-level observability.

Envoy metrics in Upbound include:

Upstream cluster metrics - Request status codes, timeouts, retries, and latency for traffic to control planes and services
Circuit breaker metrics - Connection and request circuit breaker state for both DEFAULT and HIGH priority levels
Downstream listener metrics - Client connections and requests received
HTTP connection manager metrics - End-to-end HTTP request processing and latency

For a complete list of available router metrics and example PromQL queries, see the Router metrics reference.

Router tracing

The Spaces router generates distributed traces through OpenTelemetry integration, providing end-to-end visibility into request flow across the system. Use these traces to debug latency issues, understand request paths, and correlate errors across services.

The router uses:

Protocol: OTLP (OpenTelemetry Protocol) over gRPC
Service name: spaces-router
Transport: TLS-encrypted connection to telemetry collector

Trace configuration

Enable tracing and configure the sampling rate with the following Helm values:

observability:
  enabled: true
  tracing:
    enabled: true
    sampling:
      rate: 0.1  # Sample 10% of new traces (0.0-1.0)

The sampling behavior depends on whether a parent trace context exists:

With parent context: If a traceparent header is present, the parent's sampling decision is respected, enabling proper distributed tracing across services.
Root spans: For new traces without a parent, Envoy samples based on x-request-id hashing. The default sampling rate is 10%.

TLS configuration for external collectors

To send traces to an external OTLP collector, configure the endpoint and TLS settings:

observability:
  enabled: true
  tracing:
    enabled: true
    endpoint: "otlp-gateway.example.com"
    port: 443
    tls:
      caBundleSecretRef: "custom-ca-secret"

If caBundleSecretRef is set, the router uses the CA bundle from the referenced Kubernetes secret. The secret must contain a key named ca.crt with the PEM-encoded CA bundle. If not set, the router uses the Spaces CA for the in-cluster collector.

Custom trace tags

The router adds custom tags to every span to enable filtering and grouping by control plane:

Tag	Source	Description
`controlplane.id`	`x-upbound-mxp-id` header	Control plane UUID
`controlplane.name`	`x-upbound-mxp-host` header	Internal vcluster hostname
`hostcluster.id`	`x-upbound-hostcluster-id` header	Host cluster identifier

These tags enable queries like "show all slow requests to control plane X" or "find errors for control planes in host cluster Y."

Example trace

The following example shows the attributes from a successful GET request:

Span: ingress
├─ Service: spaces-router
├─ Duration: 8.025ms
├─ Attributes:
│  ├─ http.method: GET
│  ├─ http.status_code: 200
│  ├─ upstream_cluster: ctp-b2b37aaa-ee55-492c-ba0c-4d561a6325fa-api-cluster
│  ├─ controlplane.id: b2b37aaa-ee55-492c-ba0c-4d561a6325fa
│  ├─ controlplane.name: vcluster.mxp-b2b37aaa-ee55-492c-ba0c-4d561a6325fa-system
│  └─ response_size: 1827

Available metrics

Space-level observability collects metrics from multiple infrastructure components:

Infrastructure component metrics

Crossplane controller metrics
Spaces controller, API, and router metrics
Provider metrics (provider-helm, provider-kubernetes)

Router metrics

The router component exposes Envoy proxy metrics for monitoring traffic flow and service health. Key metric categories include:

envoy_cluster_upstream_rq_* - Upstream request metrics (status codes, timeouts, retries, latency)
envoy_cluster_circuit_breakers_* - Circuit breaker state and capacity
envoy_listener_downstream_* - Client connection and request metrics
envoy_http_downstream_* - HTTP request processing metrics

Example query to monitor total request rate:

sum(rate(envoy_cluster_upstream_rq_total{job="spaces-router-envoy"}[5m]))

Example query for P95 latency:

histogram_quantile(
  0.95,
  sum by (le) (
    rate(envoy_cluster_upstream_rq_time_bucket{job="spaces-router-envoy"}[5m])
  )
)

For detailed router metrics documentation and more query examples, see the Router metrics reference.

OpenTelemetryCollector image

Control plane (SharedTelemetry) and Space observability deploy the same custom OpenTelemetry Collector image. The OpenTelemetry Collector image supports otlphttp, datadog, splunk_hec, and debug exporters.

For more information on observability configuration, review the Helm chart reference.

Observability in control planes

Read the observability documentation to learn about the features Upbound offers for collecting telemetry from control planes.

Router metrics reference

To avoid overwhelming observability tools with hundreds of Envoy metrics, an allow-list filters metrics to only the following metric families.

Upstream cluster metrics

Metrics tracking requests sent from Envoy to configured upstream clusters. Individual control planes, spaces-api, and other services are each considered an upstream cluster. Use these metrics to monitor service health, identify upstream errors, and measure backend latency.

Metric	Description
`envoy_cluster_upstream_rq_xx_total`	HTTP status codes (2xx, 3xx, 4xx, 5xx) with label `envoy_response_code_class`
`envoy_cluster_upstream_rq_timeout_total`	Requests that timed out waiting for upstream
`envoy_cluster_upstream_rq_retry_limit_exceeded_total`	Requests that exhausted retry attempts
`envoy_cluster_upstream_rq_total`	Total upstream requests
`envoy_cluster_upstream_rq_time_bucket`	Latency histogram (for P50/P95/P99 calculations)
`envoy_cluster_upstream_rq_time_sum`	Sum of request durations
`envoy_cluster_upstream_rq_time_count`	Count of requests

Circuit breaker metrics

Metrics tracking circuit breaker state and remaining capacity. Circuit breakers prevent cascading failures by limiting connections and concurrent requests to unhealthy upstreams. Two priority levels exist: DEFAULT for watch requests and HIGH for API requests.

Name	Description
`envoy_cluster_circuit_breakers_default_cx_open`	`DEFAULT` priority connection circuit breaker open (gauge)
`envoy_cluster_circuit_breakers_default_rq_open`	`DEFAULT` priority request circuit breaker open (gauge)
`envoy_cluster_circuit_breakers_default_remaining_cx`	Available `DEFAULT` priority connections (gauge)
`envoy_cluster_circuit_breakers_default_remaining_rq`	Available `DEFAULT` priority request slots (gauge)
`envoy_cluster_circuit_breakers_high_cx_open`	`HIGH` priority connection circuit breaker open (gauge)
`envoy_cluster_circuit_breakers_high_rq_open`	`HIGH` priority request circuit breaker open (gauge)
`envoy_cluster_circuit_breakers_high_remaining_cx`	Available `HIGH` priority connections (gauge)
`envoy_cluster_circuit_breakers_high_remaining_rq`	Available `HIGH` priority request slots (gauge)

Downstream listener metrics

Metrics tracking requests received from clients such as kubectl and API consumers. Use these metrics to monitor client connection patterns, overall request volume, and responses sent to external users.

Name	Description
`envoy_listener_downstream_rq_xx_total`	HTTP status codes for responses sent to clients
`envoy_listener_downstream_rq_total`	Total requests received from clients
`envoy_listener_downstream_cx_total`	Total connections from clients
`envoy_listener_downstream_cx_active`	Currently active client connections (gauge)

HTTP connection manager metrics

Metrics from Envoy's HTTP connection manager tracking end-to-end request processing. These metrics provide a comprehensive view of the HTTP request lifecycle including status codes and client-perceived latency.

Name	Description
`envoy_http_downstream_rq_xx`	HTTP status codes (note: no `_total` suffix for this metric family)
`envoy_http_downstream_rq_total`	Total HTTP requests received
`envoy_http_downstream_rq_time_bucket`	Downstream request latency histogram
`envoy_http_downstream_rq_time_sum`	Sum of downstream request durations
`envoy_http_downstream_rq_time_count`	Count of downstream requests

Prerequisites​

Configuration​

Router metrics​

Router tracing​

Trace configuration​

TLS configuration for external collectors​

Custom trace tags​

Example trace​

Available metrics​

Infrastructure component metrics​

Router metrics​

OpenTelemetryCollector image​

Observability in control planes​

Router metrics reference​

Upstream cluster metrics​

Circuit breaker metrics​

Downstream listener metrics​

HTTP connection manager metrics​

Prerequisites

Configuration

Router metrics

Router tracing

Trace configuration

TLS configuration for external collectors

Custom trace tags

Example trace

Available metrics

Infrastructure component metrics

Router metrics

OpenTelemetryCollector image

Observability in control planes

Router metrics reference

Upstream cluster metrics

Circuit breaker metrics

Downstream listener metrics

HTTP connection manager metrics