Configure Space-level observability
This feature is GA since v1.14.0, requires Spaces v1.6.0, and is off by default. To enable, set observability.enabled=true (features.alpha.observability.enabled=true before v1.14.0) when installing Spaces:
up space init --token-file="${SPACES_TOKEN_PATH}" "v${SPACES_VERSION}" \
...
--set "observability.enabled=true" \
This guide explains how to configure Space-level observability. This feature is only applicable to self-hosted Space administrators. This lets Space administrators observe the cluster infrastructure where the Space software gets installed.
When you enable observability in a Space, Upbound deploys a single OpenTelemetry Collector to collect and export metrics, logs, and traces to your configured observability backends.
Prerequisites
This feature requires the OpenTelemetry Operator on the Space cluster. Install this now if you haven't already:
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.116.0/opentelemetry-operator.yaml
If running Spaces v1.11 or later, use OpenTelemetry Operator v0.110.0 or later due to breaking changes in the OpenTelemetry Operator.
Configuration
To configure how Upbound exports, review the spacesCollector value in your Space installation Helm chart. Below is an example of an otlphttp compatible endpoint.
observability:
spacesCollector:
config:
exporters:
otlphttp:
endpoint: "<your-endpoint>"
headers:
api-key: YOUR_API_KEY
exportPipeline:
logs:
- otlphttp
metrics:
- otlphttp
traces:
- otlphttp
You can export metrics, logs, and traces from your Crossplane installation, Spaces infrastructure (controller, API, router, etc.), provider-helm, and provider-kubernetes.
Router metrics
The Spaces router component uses Envoy as a reverse proxy and exposes detailed metrics about request handling, circuit breakers, and connection pooling. Upbound collects these metrics in your Space after you enable Space-level observability.
Envoy metrics in Upbound include:
- Upstream cluster metrics - Request status codes, timeouts, retries, and latency for traffic to control planes and services
- Circuit breaker metrics - Connection and request circuit breaker state for both
DEFAULTandHIGHpriority levels - Downstream listener metrics - Client connections and requests received
- HTTP connection manager metrics - End-to-end HTTP request processing and latency
For a complete list of available router metrics and example PromQL queries, see the Router metrics reference.
Router tracing
The Spaces router generates distributed traces through OpenTelemetry integration, providing end-to-end visibility into request flow across the system. Use these traces to debug latency issues, understand request paths, and correlate errors across services.
The router uses:
- Protocol: OTLP (OpenTelemetry Protocol) over gRPC
- Service name:
spaces-router - Transport: TLS-encrypted connection to telemetry collector
Trace configuration
Enable tracing and configure the sampling rate with the following Helm values:
observability:
enabled: true
tracing:
enabled: true
sampling:
rate: 0.1 # Sample 10% of new traces (0.0-1.0)
The sampling behavior depends on whether a parent trace context exists:
- With parent context: If a
traceparentheader is present, the parent's sampling decision is respected, enabling proper distributed tracing across services. - Root spans: For new traces without a parent, Envoy samples based on
x-request-idhashing. The default sampling rate is 10%.
TLS configuration for external collectors
To send traces to an external OTLP collector, configure the endpoint and TLS settings:
observability:
enabled: true
tracing:
enabled: true
endpoint: "otlp-gateway.example.com"
port: 443
tls:
caBundleSecretRef: "custom-ca-secret"
If caBundleSecretRef is set, the router uses the CA bundle from the referenced
Kubernetes secret. The secret must contain a key named ca.crt with the
PEM-encoded CA bundle. If not set, the router uses the Spaces CA for the
in-cluster collector.
Custom trace tags
The router adds custom tags to every span to enable filtering and grouping by control plane:
| Tag | Source | Description |
|---|---|---|
controlplane.id | x-upbound-mxp-id header | Control plane UUID |
controlplane.name | x-upbound-mxp-host header | Internal vcluster hostname |
hostcluster.id | x-upbound-hostcluster-id header | Host cluster identifier |
These tags enable queries like "show all slow requests to control plane X" or "find errors for control planes in host cluster Y".
Example trace
The following example shows the attributes from a successful GET request:
Span: ingress
├─ Service: spaces-router
├─ Duration: 8.025ms
├─ Attributes:
│ ├─ http.method: GET
│ ├─ http.status_code: 200
│ ├─ upstream_cluster: ctp-b2b37aaa-ee55-492c-ba0c-4d561a6325fa-api-cluster
│ ├─ controlplane.id: b2b37aaa-ee55-492c-ba0c-4d561a6325fa
│ ├─ controlplane.name: vcluster.mxp-b2b37aaa-ee55-492c-ba0c-4d561a6325fa-system
│ └─ response_size: 1827
Available metrics
Space-level observability collects metrics from multiple infrastructure components:
Infrastructure component metrics
- Crossplane controller metrics
- Spaces controller, API, and router metrics
- Provider metrics (provider-helm, provider-kubernetes)
Router metrics
The router component exposes Envoy proxy metrics for monitoring traffic flow and service health. Key metric categories include:
envoy_cluster_upstream_rq_*- Upstream request metrics (status codes, timeouts, retries, latency)envoy_cluster_circuit_breakers_*- Circuit breaker state and capacityenvoy_listener_downstream_*- Client connection and request metricsenvoy_http_downstream_*- HTTP request processing metrics
Example query to monitor total request rate:
sum(rate(envoy_cluster_upstream_rq_total{job="spaces-router-envoy"}[5m]))
Example query for P95 latency:
histogram_quantile(
0.95,
sum by (le) (
rate(envoy_cluster_upstream_rq_time_bucket{job="spaces-router-envoy"}[5m])
)
)
For detailed router metrics documentation and more query examples, see the Router metrics reference.
OpenTelemetryCollector image
Control plane (SharedTelemetry) and Space observability deploy the same custom
OpenTelemetry Collector image. The OpenTelemetry Collector image supports
otlhttp, datadog, and debug exporters.
For more information on observability configuration, review the Helm chart reference.
Observability in control planes
Read the observability documentation to learn about the features Upbound offers for collecting telemetry from control planes.
Router metrics reference
To avoid overwhelming observability tools with hundreds of Envoy metrics, an allow-list filters metrics to only the following metric families.
Upstream cluster metrics
Metrics tracking requests sent from Envoy to configured upstream clusters. Individual control planes, spaces-api, and other services are each considered an upstream cluster. Use these metrics to monitor service health, identify upstream errors, and measure backend latency.
| Metric | Description |
|---|---|
envoy_cluster_upstream_rq_xx_total | HTTP status codes (2xx, 3xx, 4xx, 5xx) with label envoy_response_code_class |
envoy_cluster_upstream_rq_timeout_total | Requests that timed out waiting for upstream |
envoy_cluster_upstream_rq_retry_limit_exceeded_total | Requests that exhausted retry attempts |
envoy_cluster_upstream_rq_total | Total upstream requests |
envoy_cluster_upstream_rq_time_bucket | Latency histogram (for P50/P95/P99 calculations) |
envoy_cluster_upstream_rq_time_sum | Sum of request durations |
envoy_cluster_upstream_rq_time_count | Count of requests |
Circuit breaker metrics
Metrics tracking circuit breaker state and remaining capacity. Circuit breakers
prevent cascading failures by limiting connections and concurrent requests to
unhealthy upstreams. Two priority levels exist: DEFAULT for watch requests and
HIGH for API requests.
| Name | Description |
|---|---|
envoy_cluster_circuit_breakers_default_cx_open | DEFAULT priority connection circuit breaker open (gauge) |
envoy_cluster_circuit_breakers_default_rq_open | DEFAULT priority request circuit breaker open (gauge) |
envoy_cluster_circuit_breakers_default_remaining_cx | Available DEFAULT priority connections (gauge) |
envoy_cluster_circuit_breakers_default_remaining_rq | Available DEFAULT priority request slots (gauge) |
envoy_cluster_circuit_breakers_high_cx_open | HIGH priority connection circuit breaker open (gauge) |
envoy_cluster_circuit_breakers_high_rq_open | HIGH priority request circuit breaker open (gauge) |
envoy_cluster_circuit_breakers_high_remaining_cx | Available HIGH priority connections (gauge) |
envoy_cluster_circuit_breakers_high_remaining_rq | Available HIGH priority request slots (gauge) |
Downstream listener metrics
Metrics tracking requests received from clients such as kubectl and API consumers. Use these metrics to monitor client connection patterns, overall request volume, and responses sent to external users.
| Name | Description |
|---|---|
envoy_listener_downstream_rq_xx_total | HTTP status codes for responses sent to clients |
envoy_listener_downstream_rq_total | Total requests received from clients |
envoy_listener_downstream_cx_total | Total connections from clients |
envoy_listener_downstream_cx_active | Currently active client connections (gauge) |
HTTP connection manager metrics
Metrics from Envoy's HTTP connection manager tracking end-to-end request processing. These metrics provide a comprehensive view of the HTTP request lifecycle including status codes and client-perceived latency.
| Name | Description |
|---|---|
envoy_http_downstream_rq_xx | HTTP status codes (note: no _total suffix for this metric family) |
envoy_http_downstream_rq_total | Total HTTP requests received |
envoy_http_downstream_rq_time_bucket | Downstream request latency histogram |
envoy_http_downstream_rq_time_sum | Sum of downstream request durations |
envoy_http_downstream_rq_time_count | Count of downstream requests |