Persistent Worker Policy¶
This document records durable workflow contracts for SIMPLE's persistent-worker queue-system overlay. It is policy, not a line-by-line implementation map.
This document is the single policy authority for the persistent-worker backend.
1. Core Model¶
The persistent-worker backend is a dispatch overlay on the existing qsys system.
Standard qsys backends still generate and run SIMPLE bash scripts. The persistent-worker overlay changes how those scripts are dispatched:
qsys_envdetects a worker qsys name such aslocal_workerorslurm_worker- the underlying base backend launches long-lived
simple_persistent_workerprocesses - generated SIMPLE job scripts are queued to a TCP worker server instead of being submitted directly to the base backend
- workers poll the server by heartbeat and execute script paths in local pthread slots
- the existing SIMPLE scheduler still observes completion through filesystem sentinel files
The persistent-worker backend does not change scientific workflow semantics. It only changes the transport used for generated distributed scripts.
2. Ownership¶
simple_qsys_env.f90 owns worker-overlay activation, base-backend creation,
dispatch-backend creation, worker-server reuse checks, worker-process launch,
and qsys environment teardown.
simple_qsys_factory.f90 owns construction of the internal
persistent_worker qsys backend.
simple_qsys_persistent_worker.f90 owns the qsys backend type used for script
dispatch after the persistent pool is running. It is a qsys submission object,
not the worker-server owner.
simple_qsys_ctrl.f90 owns script generation, scheduler loops, filesystem
completion polling, and dispatching generated script paths to the persistent
worker server.
simple_persistent_worker_server.f90 owns the runtime singleton, TCP listener,
worker registry, task queues, worker heartbeats, task selection, and
termination replies.
simple_persistent_worker_message_*.f90 owns the fixed wire-message types and
their serialisation contracts.
production/simple_persistent_worker.f90 owns the compute-node worker process:
argument parsing, heartbeat loop, local thread-slot management, script
execution, and worker-process cleanup.
Top-level executables such as simple_exec, simple_stream, and single_exec
own final worker-server shutdown when they launched a persistent pool.
3. Backend Selection¶
A qsys name containing _worker activates the persistent-worker overlay. The
local qsys name is stripped before constructing the base backend. The dispatch
backend is then constructed with the internal factory name persistent_worker.
Examples:
local_workerlaunches workers throughlocalslurm_workerlaunches workers throughslurmlsf_worker,pbs_worker, andsge_workerfollow the same overlay pattern
qsys_env keeps the two roles separate:
base_qsysandbase_qscriptslaunch persistent worker processes through the underlying schedulerdispatch_qsysandqscriptsqueue generated SIMPLE scripts to the running persistent-worker server
Do not collapse these roles back into one ambiguous qsys object.
4. Runtime and Configuration¶
The module-level persistent_worker runtime lives in
simple_persistent_worker_server.f90. It records:
- the owning
persistent_worker_serverpointer - the launch backend name
- the number of persistent worker processes launched
- the thread capacity per worker process
Worker concurrency is resolved in qsys_env%new():
nthr_per_worker = params%nthr
if worker_nthr > 0, use worker_nthr
if qsys_nthr is present, use qsys_nthr
n_workers = max(1, params%ncunits)
if workers > 0, use workers
worker_nthr controls advertised thread capacity per worker process.
workers controls how many persistent worker processes are launched. The
existing qsys_ctrl scheduler still controls in-flight generated scripts using
its normal ncunits bookkeeping and filesystem completion files.
An existing worker server may be reused only when it is running and compatible:
- same launch backend
- existing
nthr_per_workeris at least the requested value - existing worker count is at least the requested value
Otherwise the code throws rather than silently reusing the wrong pool.
5. Worker Server¶
persistent_worker_server%new(nthr_workers) allocates listener state, allocates
shared worker data, initialises the listener mutex, propagates the debug flag,
starts the TCP listener, and records the bound port and host IP list.
persistent_worker_data is the only server-side data written by the listener
pthread. It holds:
- worker heartbeat registry
- high, normal, and low priority task queues
- termination and debug flags
All mutable persistent_worker_data fields are protected by the listener
mutex. queue_task also increments job_count and assigns task%job_id inside
that mutex.
The listener thread must not perform blocking network replies while holding the mutex. It copies the selected task to local storage, releases the mutex, and then sends exactly one reply.
Task selection order is high, normal, low. A task is dispatchable only when the
worker heartbeat reports enough free thread capacity for task%nthr.
Production qsys submissions currently use normal priority. The high and low
queues are implemented and tested through the server API, but they are not used
by qsys_ctrl.
6. Message Protocol¶
The wire protocol has four fixed message types:
| Message | Value | Direction | Purpose |
|---|---|---|---|
WORKER_TERMINATE_MSG |
1 | server to worker | orderly shutdown |
WORKER_HEARTBEAT_MSG |
2 | worker to server | liveness and thread-load report |
WORKER_TASK_MSG |
3 | server to worker | script dispatch |
WORKER_STATUS_MSG |
4 | server to worker | idle or error reply |
Every concrete message type must override serialise and allocate the buffer
with sizeof(self) where self is declared as the concrete type. Calling the
base serialise for derived messages truncates the payload to the base-message
layout.
TCP_BUFSZ = 1460 is the shared socket buffer size and must be large enough for
every message type.
7. Script Dispatch and Completion¶
Generated SIMPLE bash scripts remain the unit of distributed work.
When the active qsys object is qsys_persistent_worker,
qsys_ctrl calls dispatch_task_to_persistent_worker, creates a
qsys_persistent_worker_message_task, fills nthr and script_path, and
queues the task as normal priority.
If the worker server is missing or the queue is full, the current dispatch path logs the failure. It does not currently propagate enqueue failure back to the scheduler as a failed submission.
The existing scheduler remains the source of truth for completion:
- batch scheduling watches
JOB_FINISHED* - streaming scheduling watches
EXIT_CODE_JOB_*and done/fail stacks
The worker server is a dispatch transport. It is not currently a completion authority.
For the normal-priority production path, the listener marks a dispatched normal task inactive immediately before queue compaction. This is the current ownership-transfer model: after dispatch, filesystem completion files decide whether the job finished. High and low priority queue entries do not currently use that same immediate-removal path and should be reviewed before production use.
8. Worker Process¶
simple_persistent_worker accepts only key-value command-line arguments:
nthrportserverworker_id
port and server are mandatory. Missing or invalid values terminate the
worker with stop 1. nthr < 1 is forced to one thread slot.
Each worker computes a stable process identity:
worker_uid = HOSTNAME_PID
The UID is included in every heartbeat. If the server sees a different UID for
an occupied worker_id, it sends WORKER_TERMINATE_MSG so a restarted process
does not inherit stale identity.
The heartbeat loop sends:
worker_idworker_uid- epoch heartbeat time
- total local thread slots
- currently used thread slots
The worker accepts at most one task per heartbeat reply. Each accepted task is run as:
bash <script_path>
The script does not need executable permissions because bash is explicit.
9. Thread Slots¶
Each worker slot has its own mutex and local state:
- task message
- pthread handle
startedcompletenthrexit_code
nthr == 0 is the authoritative idle signal. get_nthr_used sums the nthr
fields while locking one slot at a time.
A slot must be claimed before pthread_create by writing the task, setting
nthr, clearing complete, and setting started under the slot mutex. If a
previous pthread handle still exists, it is joined before the slot is reused.
On normal task exit, the worker thread records the script exit code, sets
complete = .true., and sets nthr = 0 under the slot mutex.
On worker-process cleanup, incomplete started threads are cancelled, started threads are joined, slot state is reset, and each slot mutex is destroyed.
10. Current Limits¶
These are current implementation facts, not desired end-state contracts:
- workers do not send completion acknowledgements back to the server
- task timing fields in
qsys_persistent_worker_message_taskexist, but the production path does not currently maintain them end-to-end - enqueue failure is logged but not returned to
qsys_ctrlas a failed submission qsys_env%kill()does not kill the persistent-worker server; final shutdown is done by top-level executables- top-level shutdown kills the server but does not deallocate and nullify the module-level server pointer
simple_persistent_workercontainsis_safe_script_path, but the task path currently bypasses it withsafe_path = .true.qsys_envstrips_workerin the local backend-selection variable; code that readsqdescr('qsys_name')still needs to be checked for_workernames
Do not write future policy as though these limitations have already been removed.
11. Invariants¶
- Persistent workers execute generated scripts; they do not understand SIMPLE task graphs or scientific algorithms.
- The base qsys backend launches persistent worker processes.
- The
persistent_workerqsys backend dispatches generated scripts to the TCP worker server. - Worker-server task dispatch must preserve existing filesystem completion semantics until a real completion-acknowledgement protocol replaces them.
- The listener mutex is initialised by
persistent_worker_server%newand destroyed bypersistent_worker_server%killafter the listener thread has been joined. job_countincrement and taskjob_idassignment stay inside the server mutex.- Blocking network replies stay outside the server mutex.
- Concrete wire messages override
serialisewith concretesizeof(self). - Worker slots use
nthr == 0as the idle signal. - Worker task execution remains explicit
bash <script_path>. - Production use must not claim script-path validation is active while
safe_path = .true.remains in the worker.
12. Review Checklist¶
For persistent-worker changes, check:
- Does
_workerstill select a base backend and a separate dispatch backend? - Are worker-server reuse checks still preventing incompatible pool reuse?
- Does
qsys_ctrlstill have a valid completion path if worker dispatch succeeds? - Can enqueue failure be observed by the scheduler, or is it still only logged?
- Are task-queue ownership rules explicit for every priority path in use?
- Are server mutex boundaries free of blocking socket writes?
- Are worker thread slots claimed before pthread creation and joined before reuse or destruction?
- Are top-level executables still shutting down any launched worker server?
- Is the document still honest about script-path validation and completion acknowledgement state?