Expert Rust Error Handling in Large Distributed Systems

1. The Challenge of Distributed Errors

Failures can occur at any layer in distributed systems:

You need to distinguish transient vs permanent vs policy errors, propagate context, and maintain observability.

2. Recoverable vs Unrecoverable

Avoid unwrap() or expect() in library code to preserve graceful degradation.

3. Designing Your Error Types

3.1. Per-crate Error Enums

use thiserror::Error;

#[derive(Debug, Error)]
pub enum StorageError {
    #[error("network failure: {0}")]
    Network(#[from] reqwest::Error),

    #[error("timeout after {0:?}")]
    Timeout(std::time::Duration),

    #[error("invalid key: {key}")]
    InvalidKey { key: String },

    #[error("unknown error: {0}")]
    Other(String),
}

Use thiserror to reduce boilerplate and avoid anyhow::Error in libraries.

3.2. Contextual Wrapping

use anyhow::{Context, Result};

fn handle_request(req: Request) -> Result {
    let key = parse_key(&req).context("parsing cache key")?;
    let value = storage
        .get(&key)
        .with_context(|| format!("reading storage for key {}", key))?;
    Ok(Response::new(value))
}

Use anyhow sparingly for application-level rich context.

4. Propagation & the ? Operator

fn do_work() -> Result<(), StorageError> {
    let data = fetch_data()?;
    process(data)?;
    Ok(())
}

Reshape errors at each layer to the layer’s own Error type and avoid Box<dyn Error> internally.

5. Classifying Errors for Retry & Circuit Breaking

pub trait Retriable {
    fn is_retriable(&self) -> bool;
}

impl Retriable for StorageError {
    fn is_retriable(&self) -> bool {
        matches!(self, StorageError::Network(_) | StorageError::Timeout(_))
    }
}

Combine with Tower’s Retry and Timeout layers and use exponential backoff + jitter to avoid thundering herds.

6. Instrumentation & Context Propagation

6.1. Structured Logging

use tracing::{error, instrument};

#[instrument(name = "upload_file", skip(storage))]
async fn upload(storage: &Storage, path: &Path) -> Result<(), StorageError> {
    storage.put(path).await.map_err(|e| {
        error!(%path, error = %e, "upload failed");
        e
    })
}

6.2. Distributed Tracing

use opentelemetry::trace::Tracer;

let tracer = opentelemetry::sdk::trace::TracerProvider::default()
    .get_tracer("my-service", None);
let span = tracer.start("process_request");
// ...
span.end();

Inject/extract W3C Trace Context for end-to-end visibility.

7. Idempotency & Saga Patterns

async fn checkout(order: Order) -> Result<(), CheckoutError> {
    inventory.reserve(&order).await?;
    payment.charge(&order).await.map_err(|e| {
        inventory.release(&order).await.ok(); // compensating action
        e.into()
    })
}

Design idempotent APIs and use sagas for multi-step workflows with compensating actions on failure.

8. Chaos Engineering & Testing

9. Observability & SLOs

Use Prometheus via metrics + metrics-exporter-prometheus.

10. Putting It All Together

  1. Define clear error enums per crate with thiserror.
  2. Propagate rich context at the application boundary.
  3. Classify errors for retries, timeouts, and circuit breakers.
  4. Instrument with tracing & OpenTelemetry.
  5. Embrace idempotency and sagas.
  6. Automate fault injection and measure via metrics.

Consistent application of these patterns yields predictable, observable, and resilient error handling in Rust-powered distributed ecosystems.