Resilience: Circuit Breaker, Retry Policy, and Dead Letter Queue

MIH is built with resilience as a first-class concern. This document explains the three mechanisms that protect Moodle from external service failures.

Overview

Mechanism	Purpose	Scope
Circuit Breaker	Prevents calling services that are known to be down	Per-service
Retry Policy	Automatically retries transient failures	Per-request
Dead Letter Queue	Stores permanently failed events for review	Per event rule

These three mechanisms work together in sequence:

mih::request()
    │
    ├─ Circuit Breaker: Is the service available?
    │   └─ OPEN + cooldown not expired → fail immediately (no network call)
    │
    ├─ Retry Policy: Execute with retries
    │   ├─ Attempt 1 → transport.execute()
    │   ├─ Failure → wait backoff → Attempt 2
    │   ├─ Failure → wait backoff*2 → Attempt 3
    │   └─ ...up to max_retries
    │
    ├─ Circuit Breaker: Record outcome
    │   ├─ Success → record_success() (may close circuit)
    │   └─ Failure → record_failure() (may open circuit)
    │
    └─ Return mih_response

For Event Bridge tasks, if the MIH API fails after all retries, the task is retried by Moodle's cron up to 5 times total before being moved to the DLQ.

Circuit Breaker

States

The circuit breaker has three states:

         [threshold reached]
CLOSED ─────────────────────► OPEN
  ▲                             │
  │ [success]                   │ [cooldown expired]
  │                             ▼
HALFOPEN ◄───────────────────────

State	Behavior
CLOSED	Normal. All requests pass through. Failure counter increments on each failure.
OPEN	Tripped. All requests fail immediately without a network call. Protects the system from piling up requests to a downed service.
HALFOPEN	Recovery probe. One request is allowed through. If it succeeds → CLOSED. If it fails → back to OPEN.

State Transitions

From	To	Condition
CLOSED	OPEN	`failure_count >= cb_failure_threshold`
OPEN	HALFOPEN	`time() - last_failure >= cb_cooldown`
HALFOPEN	CLOSED	Next request succeeds
HALFOPEN	OPEN	Next request fails

Configuration

Per-service settings (configurable in the dashboard):

Setting	DB Column	Default	Description
Failure Threshold	`cb_failure_threshold`	`5`	Consecutive failures before opening
Cooldown	`cb_cooldown`	`30`	Seconds before attempting recovery

Implementation Details

The circuit breaker state is stored in local_integrationhub_cb:

SELECT state, failure_count, last_failure
FROM local_integrationhub_cb
WHERE serviceid = ?

is_available() logic:

public function is_available(): bool {
    if ($this->state === 'closed') {
        return true;
    }
    if ($this->state === 'open') {
        // Check if cooldown has expired
        if (time() - $this->last_failure >= $this->cooldown) {
            $this->transition_to('halfopen');
            return true; // Allow one probe request
        }
        return false; // Still cooling down
    }
    // halfopen: allow one request through
    return true;
}

record_failure() logic:

public function record_failure(): void {
    $this->failure_count++;
    $this->last_failure = time();

    if ($this->failure_count >= $this->threshold) {
        $this->transition_to('open');
    }
    $this->save();
}

record_success() logic:

public function record_success(): void {
    $this->failure_count = 0;
    if ($this->state !== 'closed') {
        $this->transition_to('closed');
    }
    $this->save();
}

Manual Reset

From the dashboard, click Reset Circuit to force a service back to CLOSED:

public function reset(): void {
    $this->state         = 'closed';
    $this->failure_count = 0;
    $this->last_failure  = 0;
    $this->save();
}

Use this when:

You have confirmed the external service has recovered
You want to test a service without waiting for the cooldown
A false positive tripped the circuit (e.g., a one-time network blip)

Retry Policy

Algorithm

MIH uses exponential backoff — each retry waits twice as long as the previous one, capped at 60 seconds:

delay(attempt) = min(backoff * 2^(attempt-1), 60)

With max_retries = 3 and retry_backoff = 1:

Attempt	Delay Before This Attempt
1 (initial)	0s (immediate)
2 (retry 1)	1s
3 (retry 2)	2s
4 (retry 3)	4s

Total maximum wait: 7 seconds (1 + 2 + 4).

With max_retries = 5 and retry_backoff = 2:

Attempt	Delay
1	0s
2	2s
3	4s
4	8s
5	16s
6	32s

Total maximum wait: 62 seconds.

Configuration

Per-service settings:

Setting	DB Column	Default	Description
Max Retries	`max_retries`	`3`	Additional attempts after the first failure
Initial Backoff	`retry_backoff`	`1`	Seconds before the first retry

What Triggers a Retry

The retry policy retries on any exception thrown by the transport driver. This includes:

Network timeouts (cURL timeout)
Connection refused
DNS resolution failure
AMQP connection errors

It does not automatically retry based on HTTP status codes. A 500 Internal Server Error response is returned as a failed mih_response but does not trigger a retry by default (the transport returns a result, not an exception).

Design note: This is intentional. HTTP 5xx errors may indicate a permanent server-side issue (e.g., a bug in the external API). Retrying them blindly could cause duplicate processing on the external side. If you need to retry on 5xx, implement that logic in your calling code.

Implementation

public function execute(callable $operation): mixed {
    $lastexception = null;

    for ($attempt = 1; $attempt <= $this->maxattempts; $attempt++) {
        try {
            return $operation($attempt);
        } catch (\Exception $e) {
            $lastexception = $e;

            if ($attempt < $this->maxattempts) {
                $delay = min($this->backoff * (2 ** ($attempt - 1)), 60);
                sleep($delay);
            }
        }
    }

    throw $lastexception;
}

Dead Letter Queue (DLQ)

Purpose

The DLQ is a safety net for the Event Bridge. When an event cannot be delivered after all retry attempts, it is stored in the DLQ instead of being silently dropped.

When Events Go to the DLQ

dispatch_event_task fails (exception thrown)
Moodle retries the task (up to Moodle's own retry limit)
MIH tracks its own attempt counter in custom_data
After 5 total attempts, the task calls move_to_dlq() and returns without rethrowing
Moodle marks the task as complete (no more retries)

DLQ Table Structure

local_integrationhub_dlq:
  id            INT          -- Primary key
  eventname     VARCHAR(255) -- Event class name
  serviceid     INT          -- Target service ID
  payload       TEXT         -- JSON payload that failed
  error_message TEXT         -- Last error message
  timecreated   INT          -- Timestamp of failure

Reviewing the DLQ

Navigate to /local/integrationhub/queue.php:

View all failed events with their error messages
See the exact payload that was attempted
Identify patterns (e.g., all failures for one service = service is down)

Replaying DLQ Events

Click Replay on any DLQ entry to re-queue it as a new adhoc task. The task will go through the full dispatch flow again (template interpolation, MIH API call, retries).

Use replay when:

The external service has recovered
You fixed a bug in the payload template
A network issue was temporary

Deleting DLQ Events

Click Delete to permanently remove a DLQ entry. Use this when:

The event is no longer relevant
The external service no longer exists
You have processed the event manually

Tuning Recommendations

High-Traffic Production

cb_failure_threshold = 10    (more tolerance for occasional failures)
cb_cooldown          = 60    (longer recovery window)
max_retries          = 2     (fewer retries to avoid blocking cron)
retry_backoff        = 1     (fast retries)
timeout              = 3     (short timeout to fail fast)

Low-Traffic / Development

cb_failure_threshold = 3     (trip quickly to catch issues)
cb_cooldown          = 10    (recover quickly for testing)
max_retries          = 3     (standard retries)
retry_backoff        = 1
timeout              = 10    (more lenient for slow dev servers)

Critical Integrations (Must Not Lose Events)

max_retries          = 5     (more retries before DLQ)
retry_backoff        = 2     (longer backoff)
cb_failure_threshold = 20    (very tolerant circuit)
cb_cooldown          = 120   (long cooldown)

And monitor the DLQ regularly to catch any events that do end up there.

Overview​

Circuit Breaker​

States​

State Transitions​

Configuration​

Implementation Details​

Manual Reset​

Retry Policy​

Algorithm​

Configuration​

What Triggers a Retry​

Implementation​

Dead Letter Queue (DLQ)​

Purpose​

When Events Go to the DLQ​

DLQ Table Structure​

Reviewing the DLQ​

Replaying DLQ Events​

Deleting DLQ Events​

Tuning Recommendations​

High-Traffic Production​

Low-Traffic / Development​

Critical Integrations (Must Not Lose Events)​

Overview

Circuit Breaker

States

State Transitions

Configuration

Implementation Details

Manual Reset

Retry Policy

Algorithm

Configuration

What Triggers a Retry

Implementation

Dead Letter Queue (DLQ)

Purpose

When Events Go to the DLQ

DLQ Table Structure

Reviewing the DLQ

Replaying DLQ Events

Deleting DLQ Events

Tuning Recommendations

High-Traffic Production

Low-Traffic / Development

Critical Integrations (Must Not Lose Events)