Failures can occur at any layer in distributed systems:
You need to distinguish transient vs permanent vs policy errors, propagate context, and maintain observability.
panic!() only at the application boundary.Result<T, E>, where E implements std::error::Error.Avoid unwrap() or expect() in library code to preserve graceful degradation.
Error Typesuse thiserror::Error;
#[derive(Debug, Error)]
pub enum StorageError {
#[error("network failure: {0}")]
Network(#[from] reqwest::Error),
#[error("timeout after {0:?}")]
Timeout(std::time::Duration),
#[error("invalid key: {key}")]
InvalidKey { key: String },
#[error("unknown error: {0}")]
Other(String),
}
Use thiserror to reduce boilerplate and avoid anyhow::Error in libraries.
use anyhow::{Context, Result};
fn handle_request(req: Request) -> Result {
let key = parse_key(&req).context("parsing cache key")?;
let value = storage
.get(&key)
.with_context(|| format!("reading storage for key {}", key))?;
Ok(Response::new(value))
}
Use anyhow sparingly for application-level rich context.
? Operatorfn do_work() -> Result<(), StorageError> {
let data = fetch_data()?;
process(data)?;
Ok(())
}
Reshape errors at each layer to the layer’s own Error type and avoid Box<dyn Error> internally.
pub trait Retriable {
fn is_retriable(&self) -> bool;
}
impl Retriable for StorageError {
fn is_retriable(&self) -> bool {
matches!(self, StorageError::Network(_) | StorageError::Timeout(_))
}
}
Combine with Tower’s Retry and Timeout layers and use exponential backoff + jitter to avoid thundering herds.
use tracing::{error, instrument};
#[instrument(name = "upload_file", skip(storage))]
async fn upload(storage: &Storage, path: &Path) -> Result<(), StorageError> {
storage.put(path).await.map_err(|e| {
error!(%path, error = %e, "upload failed");
e
})
}
use opentelemetry::trace::Tracer;
let tracer = opentelemetry::sdk::trace::TracerProvider::default()
.get_tracer("my-service", None);
let span = tracer.start("process_request");
// ...
span.end();
Inject/extract W3C Trace Context for end-to-end visibility.
async fn checkout(order: Order) -> Result<(), CheckoutError> {
inventory.reserve(&order).await?;
payment.charge(&order).await.map_err(|e| {
inventory.release(&order).await.ok(); // compensating action
e.into()
})
}
Design idempotent APIs and use sagas for multi-step workflows with compensating actions on failure.
fail crate for CI error injection.cargo-fuzz for serialization boundaries.error_total, latency histograms, saturation gauges.Use Prometheus via metrics + metrics-exporter-prometheus.
thiserror.tracing & OpenTelemetry.Consistent application of these patterns yields predictable, observable, and resilient error handling in Rust-powered distributed ecosystems.