Concurrent Counting and Resilience: Sharded Metrics and a Circuit Breaker

Under load, one global lock is the bottleneck. Gortex’s metrics collector shards the lock into 16 and counts with atomics; the circuit breaker is a textbook three-state machine. This post explains why channels are the wrong tool here.

Prerequisite: A4 goroutines/channels, A5 sync primitives.

Lock sharding: one lock into sixteen

Under load, every goroutine contending for one lock over one metrics map makes that lock the bottleneck. Gortex’s ShardedCollector splits business metrics into 16 independent shards, each with its own lock:

type ShardedCollector struct {
    httpRequestCount int64 // atomic counter
    shardCount       int   // fixed at 16
    shards           []*metricShard
}
type metricShard struct {
    mu      sync.RWMutex
    metrics map[string]float64
    lruList *list.List // per-shard LRU
    // ...
}

To record a metric, an FNV hash decides which shard it lands in, and only that shard is locked:

shardIndex := c.hashKey(metricKey) // fnv.New32a() % 16
shard := c.shards[shardIndex]
shard.mu.Lock()
defer shard.mu.Unlock()
// ...touch only this shard

Sixteen is fixed (the source comment says “for predictable performance”), not tied to CPU count. Lock contention drops from “one global lock” to roughly 1/16.

Atomic counters and per-shard LRU

Not everything needs a lock. The highest-frequency counters — HTTP request count, WebSocket connections — use atomic directly:

atomic.AddInt64(&c.httpRequestCount, 1) // lock-free

Only when a map must be updated alongside (breakdowns by status, by method) does httpMu come into play. That’s the A5 choice: a single counter is atomic, a block of state is a Mutex.

Each shard runs its own LRU eviction: when its metric count exceeds the per-shard cap (maxCardinality / 16), the least-recently-used entry is evicted from that shard’s list.List. The LRU is per-shard — a global LRU would itself become the new bottleneck, undoing the sharding.

The circuit breaker’s three states

A circuit breaker stops requests hitting a downstream that’s already broken, giving it room to recover. Gortex’s is a textbook three-state machine:

  • Closed: requests pass; failures accumulate, and once ReadyToTrip fires (default “more than 10 requests and a failure ratio above 0.5”) it trips to Open.
  • Open: requests get ErrCircuitOpen immediately, sparing the downstream; after Timeout (default 60s) it moves to Half-Open.
  • Half-Open: only MaxRequests probe requests are admitted; it returns to Closed only after that many succeed, and a single failure throws it back to Open.

Two things drive the concurrency control: state lives in an atomic.Value while counts and expiry are guarded by a sync.Mutex; and a generation — expiry.UnixNano() is the generation number, and afterRequest compares generations to discard results from a stale one.

func (cb *CircuitBreaker) onBeforeRequestHalfOpen() (uint64, error) {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    if cb.halfOpen >= cb.config.MaxRequests {
        return 0, ErrTooManyRequests
    }
    cb.halfOpen++
    return uint64(cb.expiry.UnixNano()), nil
}

Note that the half-open admission counter halfOpen is guarded by mu, not an atomic — the gate has to stay consistent with the state/expiry transitions that run under the same lock, or concurrent probes would breach MaxRequests.

Why not channels here

This is where A4’s setup pays off. Follow the actor mindset and funnel every count through a channel into one serialising goroutine, and that goroutine becomes the new global bottleneck — as bad as one global lock, with extra latency on top.

High-frequency counting has the shape of “many goroutines each bumping a number”, best served by atomics (lock-free single value) plus lock sharding (contention spread out). The breaker has the shape of “one small shared state machine”, clearest when guarded by a mutex. Channels earn their place in B6’s “own a piece of state, serialise an event stream” shape. One project: B5 uses atomic/mutex, B6 uses a channel — pick the tool for the shape of the data, not reflexively reach for a channel.

Takeaways

  • Lock sharding: an FNV hash spreads metrics across 16 shards, each with its own lock, cutting contention to about 1/16.
  • The highest-frequency counters use atomics (lock-free); a Mutex only joins when a map must update too; the LRU is per-shard to avoid a global bottleneck.
  • The breaker’s three states (Closed / Open / Half-Open): state in atomic.Value, counts under a mutex, a generation to discard stale requests; the half-open gate is deliberately mu-guarded for consistency.
  • Why not channels: high-frequency counting and a shared state machine are faster and more direct with atomics/mutexes; channels are for B6’s hub.
  • It’s A4’s “don’t fetishise channels” and A5’s “pick the right primitive” made concrete (B5 locks/atomic ↔ B6 channel, gortex-websocket-actor-hub).

Source: yshengliao/gortex.


Outline by Sheng, drafted with Claude · Go 1.25 (gortex go.mod) · compiled retroactively · part of the 2026-06-13 blog renovation, paint still drying.