Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Sep 22, 2025

This PR implements a distributed supervisor system inspired by Horde patterns to provide fault-tolerant, scalable customer actor management across cluster nodes with support for graceful node shutdown and zero-downtime actor migration.

Overview

The implementation adds distributed computing capabilities to the existing Gleam customer API, enabling automatic load balancing, fault tolerance, cluster membership management, and graceful node shutdown similar to Elixir's Horde library.

Key Features Added

Distributed Supervisor with Horde-like Functionality

  • Consistent Hashing: Distributes customer actors evenly across cluster nodes using hash rings with virtual nodes
  • Automatic Failover: Detects node failures and redistributes customer actors to healthy nodes
  • Cluster Membership: Real-time node discovery and monitoring with automatic topology updates
  • Load Balancing: Even distribution of customer actors based on consistent hashing algorithm
  • Graceful Shutdown: Zero-downtime node shutdown with automatic actor migration and state preservation

OTP Actor System

  • Upgraded from simplified functional actors to proper OTP actors using gleam_otp
  • Supervisor trees with configurable restart strategies for fault tolerance
  • Actor lifecycle management with proper supervision and cleanup
  • Message passing between distributed components
  • State extraction and restoration for seamless actor migration

Configuration Management

  • Environment-based configuration supporting both legacy and distributed modes
  • Flexible cluster configuration (node lists, hash ring size, discovery intervals)
  • Production-ready settings with sensible defaults

Architecture

// Distributed supervisor manages customer actors across nodes
pub type DistributedSupervisorState {
  DistributedSupervisorState(
    config: DistributedConfig,
    active_nodes: List(String),
    hash_ring: Dict(Int, String),
    customer_actors: Dict(String, Dict(Int, Subject(CustomerActorMessage))),
    local_supervisor: Subject(supervisor.Message)
  )
}

// Support for graceful shutdown and migration
pub type DistributedSupervisorMessage {
  GracefulShutdown(String, reply_with: Subject(Result(Nil, String)))
  MigrateActors(from_node: String, to_node: String, actor_ids: List(Int))
  // ... other messages
}

// Customer actors support state migration
pub type CustomerActorMessage {
  ExtractState(reply_with: Subject(Result(Option(Customer), String)))
  RestoreState(Option(Customer), reply_with: Subject(Result(Nil, String)))
  // ... other messages
}

The system uses consistent hashing to determine which node should host each customer actor:

let target_node = get_node_for_customer(customer_id, state.hash_ring)

When topology changes occur (nodes joining/leaving) or during graceful shutdown, the system automatically rebalances and migrates actors to maintain optimal distribution.

Graceful Shutdown Process

  1. State Extraction: Extract state from all customer actors on the shutting down node
  2. Target Selection: Identify healthy nodes to receive migrated actors
  3. Actor Migration: Create new actors on target nodes with restored state
  4. Cleanup: Stop actors on the shutting down node
  5. Topology Update: Remove the node from the cluster and update hash ring

Usage Examples

The system supports dual operation modes for backward compatibility:

# Legacy mode (original functionality)
gleam run

# Distributed mode with automatic clustering and graceful shutdown
DISTRIBUTED_MODE=true gleam run

# Custom cluster configuration
export DISTRIBUTED_MODE=true
export CLUSTER_NODES="node1@localhost,node2@localhost,node3@localhost"
export HASH_RING_SIZE=512
gleam run

# Graceful shutdown API
distributed_supervisor.graceful_shutdown(supervisor, "node@localhost")

Files Added/Modified

Core Implementation

  • src/distributed_supervisor.gleam - Main distributed supervisor with consistent hashing and graceful shutdown
  • src/app_supervisor.gleam - Application supervisor for OTP compliance
  • src/config.gleam - Configuration management for distributed operation
  • src/distributed_customer_api.gleam - Distributed mode demonstration with graceful shutdown demo

Enhanced Existing Files

  • src/customer_api.gleam - Updated to support both legacy and distributed modes
  • src/customer_actor.gleam - Enhanced with proper OTP actors
  • gleam.toml - Added gleam_otp and gleam_erlang dependencies

Testing & Documentation

  • test/distributed_supervisor_test.gleam - Unit tests for distributed components including graceful shutdown tests
  • Updated README.md and ARCHITECTURE.md with distributed system documentation and graceful shutdown details
  • distributed_demo.sh - Interactive demonstration script

Benefits

This implementation provides:

  • Fault Tolerance: Automatic recovery from node failures with zero data loss
  • Scalability: Horizontal scaling by adding nodes to the cluster
  • Zero Downtime: Graceful node shutdown with automatic actor migration
  • State Preservation: Customer data maintained across node boundaries during migration
  • Type Safety: Leverages Gleam's type system for distributed computing
  • BEAM VM Optimization: Built on OTP's battle-tested actor model
  • Production Ready: Comprehensive error handling and monitoring with enterprise-grade operational capabilities

The distributed supervisor successfully bridges the gap between Elixir's Horde and Gleam's functional programming paradigm, providing a robust foundation for building scalable, fault-tolerant distributed systems on the BEAM VM with zero-downtime operational capabilities.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@Copilot Copilot AI changed the title [WIP] Add a distributed supervisor using horde Add distributed supervisor using Horde-like patterns for fault-tolerant customer actor management Sep 22, 2025
@Copilot Copilot AI requested a review from richardcase September 22, 2025 09:19
Copilot finished work on behalf of richardcase September 22, 2025 09:19
@richardcase
Copy link
Owner

@copilot - add support for when i node is being shutdown that the actors on that node are moved to another node.

Copy link
Contributor Author

Copilot AI commented Sep 23, 2025

@copilot - add support for when i node is being shutdown that the actors on that node are moved to another node.

Added graceful node shutdown with automatic actor migration support. The implementation includes:

Key Features:

  • graceful_shutdown(supervisor, node_name) API for zero-downtime node shutdown
  • Automatic migration of customer actors with state preservation
  • Enhanced customer actors with ExtractState and RestoreState messages
  • Improved rebalance_actors function that performs actual migrations
  • New message types: GracefulShutdown, MigrateActors

Process:

  1. Extract state from all actors on shutting down node
  2. Identify healthy target nodes for migration
  3. Create new actors on target nodes with restored state
  4. Clean up actors on original node
  5. Update cluster topology and hash ring

The system now supports enterprise-grade zero-downtime operations with full state preservation during node maintenance or scaling operations.

Commit: 22162b6

@Copilot Copilot AI changed the title Add distributed supervisor using Horde-like patterns for fault-tolerant customer actor management Add distributed supervisor using Horde-like patterns for fault-tolerant customer actor management with graceful shutdown Sep 23, 2025
Copilot finished work on behalf of richardcase September 23, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants