Refactoring: Merge Projects With Distinct CTF Models¶
Implementation Status¶
The first implementation phase is in place:
src/main/commanders/simple/simple_commanders_project_core.f90routesmerge_projectsthrough the dedicated project table parameterprojtaband requiresprojfile_mergedas the explicit output project.src/main/ui/simple/simple_ui_project.f90exposes the file-table-only interface and no longer treatsmerge_projectsas an in-place two-project project command.src/fileio/simple_projfile_utils.f90has a reusablemerge_selected_project_fileshelper, following the same file-array merge shape asmerge_chunk_projfiles.- Input
.simplefiles are inspected through existingsp_projectsegment-info and segment-read methods, avoiding the full project read path that rewritesprojinfo. - If a listed project file contains only metadata, the helper fails with the offending filename and, when possible, a hint pointing at a nearby data-bearing numbered-stage project file.
The helper is now a generic project-field merger. It accepts any data-bearing project shape whose populated data segments match across all inputs:
- movie/micrograph-only projects through
os_mic - stack-only projects through
os_stk - particle projects through
os_ptcl2Dand/oros_ptcl3D - class/output/optics-bearing projects when those segments are present in all inputs
Top-level project files that contain only metadata segments such as projinfo
and compenv are rejected because there is no data segment to merge.
The helper currently:
- validates all populated mergeable data segments as all-or-none across inputs
- validates stack box and sampling distance when stack rows carry those fields
- validates micrograph sampling distance when mic-only rows carry
smpd - validates row-level CTF-model fields on
os_micandos_stkwhen CTF is enabled - validates particle
stkindwhen particles are merged with stacks - preserves complete rows with
transfer_ori - precomputes output row offsets and parallelizes independent row
transfer/remap loops for mic, stack, optics,
ptcl2D, andptcl3Dsegments - parallelizes large particle validation scans and row-level
ogiddiscovery while reporting deterministic first failing rows - preserves exact
os_ptcl2D%statevalues whenptcl2Dexists - remaps stack
fromp/toponly when particle rows are present, remaps particlestkindand row-levelogid, and clears staleptcl2Dclustering fields while preservingptcl2D%state - remaps
ogidindependently of whetheros_opticsexists - copies/remaps
os_opticsonly when the optics segment is part of the matching input field set - rebuilds
os_ptcl3Dfrom mergedos_ptcl2Donly for the historical ptcl2D-only input shape, then removes 2D-clustering fields fromptcl3D; this syntheticptcl3Drow construction is also parallelized
No os_optics backfill path was added. sp_project%get_ctfparams already
reads smpd, ctf, kv, cs, fraca, and phaseplate from os_stk for
particle analysis and particle-specific defocus from particle rows, so the
remaining resolver-related work is an audit of direct CTF construction outside
that project API plus dedicated merge tests.
Goal¶
Support N-project merging for SIMPLE projects that may come from different microscopes or CTF models but have compatible analysis data. The immediate use case is merging independently processed projects that have already been scaled to the same sampling distance, but the merge utility should not be hard-coded to a 2D-selected particle project shape.
The general contract is:
- Every input has the same populated project data segments.
- Row counts may differ.
- Segment rows are preserved verbatim except for required merge-local rewrites.
- Cross-project local namespaces and foreign keys are remapped.
- CTF-model values come from authoritative row-level fields, not from
os_optics.
For CTF-aware rows, the merger must preserve:
smpdkvcsfracactfphaseplate
fraca is deliberately included. It is not a microscope hardware constant, but
it changes the CTF and therefore must be merged and resolved with kv, cs,
and smpd.
Non-Goals¶
Do not change heterogeneous movie, micrograph, or particle import in this refactor. The current scenario assumes data have already been imported and processed in separate internally consistent projects when needed.
Do not make os_optics required. Some datasets do not have optics-group
metadata, depending on collection and import path.
Do not use os_optics as a source of truth or a repair source for missing
authoritative row-level CTF fields.
Do not rescale or re-extract during merge. For stack/particle projects, the current target requires input stacks to have already been scaled/extracted to the same sampling distance and compatible box size. Mismatches should fail validation with an actionable message.
Do not attempt to scientifically reconcile per-project class averages,
reconstructions, or output artifacts. For merge_projects, cls2D, cls3D,
and out are intentionally ignored on input and left empty in the merged
output.
Target Policy¶
The authoritative CTF-model representation is the row-level bundle used by each processing stage:
os_mic: source of truth for movie/micrograph CTF fitting history.os_stk: source of truth for particle-stack-level CTF constants used by CTF-aware 2D/3D analysis.os_ptcl2D/os_ptcl3D: source of truth for particle-level CTF values such as defocus, astigmatism angle, phase shift,stkind, and selection state.
os_optics rows are optional metadata. Row-level ogid values, when present,
are optics-group assignments and must be remapped during merge even if the input
project has no os_optics segment. The merge must still work from row-level
CTF-model fields when no durable optics rows exist.
Because this generic merger requires matching populated data segments, mixed
presence of os_optics is treated as a project-shape mismatch in this phase. If
none of the input projects have os_optics, the merge is still valid and any
row-level ogid values are remapped. If future workflows need mixed optional
optics metadata, that should be an explicit extension rather than an implicit
backfill from optics rows.
The merge must not use os_optics to repair missing CTF-model fields. For this
workflow the authoritative mic, stk, and particle rows are expected to
already contain the values needed for CTF fitting and downstream analysis. A
missing authoritative field is a validation error unless CTF is explicitly
disabled for that row.
All active stacks entering a merged 2D/3D project should have one effective analysis sampling distance. Different sampling distances should be handled by an explicit scaling or re-extraction step before this merge.
Original Code Findings¶
Before this refactor, the merge_projects commander in
src/main/commanders/simple/simple_commanders_project_core.f90 was not the
right abstraction:
- It supported exactly two projects by hard-coding
nprojs = 2. - It read
projfileandprojfile_target, appended the target into the first project withsp_project%append_project, and wrote back toprojfile. - It detected different box or sampling distance and then ran
reextract. Heterogeneous CTF-model merging should instead require compatible data for the current scenario. - It delegated row merging to
sp_project%append_project, which mixed merge policy with project mutation. - It remapped
ogidonly inside theos_opticsbranch, missing projects that carry row-level optics-group assignments without anos_opticstable.
The stream cluster2D_subsets path has a better framework for project
aggregation:
src/main/stream/simple_stream_cluster2D_subsets.f90tracks project files inrec_list/chunk_recand passes arrays of project filenames into a merge utility.src/fileio/simple_projfile_utils.f90::merge_chunk_projfilesaccepts an array of project files, allocates output project segments, transfers orientations withtransfer_ori, remaps stack and particle indices, and optionally writes the merged project.- That framework keeps commanders thin and puts reusable project-file merge mechanics outside the commander.
The refactor reuses this framework shape: merge_projects builds an ordered
list of source project files from projtab, then calls a project-file merge
utility in simple_projfile_utils.
Project Field Contract¶
The merger should determine the populated project data segments for every input project:
os_micos_stkos_ptcl2Dos_ptcl3Dos_optics
Every segment must be either populated in all inputs or empty in all inputs. This is the meaning of "the individual fields match" for this phase. A stack-only project can merge with another stack-only project; a movie-only project can merge with another movie-only project; a particle project can merge with another particle project that has the same populated particle/support segments. A stack-only project should not be silently merged with a stack-plus-particle project.
merge_projects is an acquisition/particle merge for heterogeneous CTF models,
not an analysis-product merge. The helper should ignore os_cls2D, os_cls3D,
and os_out when reading input shape and must not populate those fields in the
merged output. Existing class-average and output rows are considered stale after
cross-microscope merge; users should re-run 2D/3D analysis from the merged
particle/stack state.
Metadata-only segments such as projinfo, jobproc, and compenv are not a
data shape. The merged project should copy them from the first input and update
the output project filename.
Complete source rows should be transferred with transfer_ori. The merger
should mutate only fields whose values are local to the source project and
therefore cannot remain verbatim in the merged namespace.
Particle State Contract¶
When os_ptcl2D is present, state is an authoritative input field. The
merger must preserve the state value for every copied ptcl2D row verbatim
after append and index remapping. It must not reset all particles to active,
infer ptcl2D state from class rows, drop deselected rows, or normalize
positive state labels unless a separate explicit pruning/relabeling command is
requested.
If os_ptcl2D is present and os_ptcl3D is absent, the helper may preserve the
historical stream behavior of creating os_ptcl3D from the merged ptcl2D
rows and deleting 2D-clustering fields. If both particle segments are present,
both should be transferred and remapped as independent authoritative segments.
Index Contracts¶
stkind is not a value to preserve verbatim across projects. It is a foreign
key from each particle row into that project's local os_stk table. During an
N-project merge, stack rows are concatenated into one output os_stk table, so
every copied particle row with stacks present must have stkind rewritten to
the corresponding output stack index.
For stack/particle projects, the merger must:
- Transfer each source stack row.
- Rewrite stack
fromp/topinto the output particle index range when those fields are present. - Require valid particle
stkindvalues when particles are merged with stacks. - Rewrite each transferred particle
stkindfrom the source stack index to the output stack index.
For stack-only projects, stack rows are transferred as-is. The absence of
particles should not make the stack field unmergeable, and stack-only merges do
not rewrite fromp / top.
When class segments are present, class rows and particle class assignments
must be offset into the merged class namespace. The helper should not infer
selection from class rows.
Optics Assignment Contract¶
ogid is a row-level assignment namespace. Values from different source
projects can collide, so the merged project must rewrite positive ogid values
into a single output namespace.
The remapping trigger is the presence of row-level ogid values on any copied
segment, not the presence of os_optics. For each source project, collect the
maximum positive ogid namespace used by its rows, allocate a collision-free
output offset, and rewrite every copied row that has an ogid field.
If os_optics is part of the matching input field set, append/remap those rows
with the same output ogid namespace and update ogname when appropriate. If
no source project has os_optics, no durable optics rows need to be synthesized.
Core Refactor¶
Add or finish a project-level CTF-model resolver in src/main/project, with
behavior roughly equivalent to:
- Identify the relevant row for
oritypeand particle index. - Read CTF-model parameters from the stage's primary row:
os_micfor micrographs,os_stkfor stacks and particles. - Do not consult
os_opticsto fill missing CTF-model values. - Require the returned model to be complete unless CTF is explicitly disabled; missing authoritative fields should fail validation rather than be inferred from metadata.
- Read image- or particle-specific defocus from the current particle-row locations.
- Return a complete
ctfparams.
Update sp_project%get_ctfparams to call this resolver if direct duplicated
logic remains. That change covers the major CTF-aware 2D/3D consumers because
polar-FT CTF matrix generation, class averaging, and reconstruction helper
paths already call get_ctfparams.
Add validation helpers:
validate_project_field_shape: ensure all inputs have matching populated data segments.validate_ctf_model_rows: ensure every activemicorstkrow that needs CTF has complete row-levelsmpd,kv,cs,fraca,ctf, and phase-plate state.validate_particle_ctf_rows: ensure particles with CTF enabled through their stack have defocus/astigmatism/phase-shift fields required by their CTF mode.validate_ptcl2D_state_rows: whenptcl2Dexists, record the state vector so the merger can assert that copied rows are unchanged.build_ogid_remap: collect row-levelogidassignments per source project and allocate a collision-free outputogidnamespace independent ofos_optics.validate_common_analysis_smpd: enforce or report the common sampling distance expected by current 2D/3D parameter derivation when stack/mic rows carrysmpd.validate_common_particle_box: enforce identical particle image dimensions when stack rows carrybox.
Project Merge Changes¶
merge_projects should remain a thin commander over the reusable N-project
merge utility:
- Accept only a project file table for N input projects through
projtab. Do not preserve the legacy two-projectprojfileplusprojfile_targetinterface for this refactor. - Require an explicit output project name,
projfile_merged, so the merged project is not written over one of the inputs. - Resolve a relative
projfile_mergedagainst the execution directory without calling an existence-based canonicalizer; the output file may not exist yet. - Build an ordered
project_fnames(:)array, analogous to the streamcluster2D_subsets/merge_chunk_projfilescall sites. - Read each source through existing
sp_projectsegment-info and segment-read methods somerge_projectscan inspect arbitrary populated fields without rewriting source project metadata. - Call the shared project-file merge helper:
merge_selected_project_files(project_fnames, projfile_out, merged_proj, ...). - Keep the commander responsible only for CLI validation, input normalization,
writing, and
simple_end.
The shared helper should:
- Read all source projects and determine their populated project field shape.
- Fail if the populated data segments do not match across inputs.
- Fail if no mergeable data segments exist beyond metadata.
- Validate row-level CTF-model data where CTF is enabled.
- Validate identical analysis sampling distance and particle box dimensions
when those fields exist. Do not run
reextractin this merge path. - Allocate every populated mergeable output segment up front, excluding
os_cls2D,os_cls3D, andos_out. - Copy source rows with
transfer_ori. - Remap only merge-local fields: stack
fromp/topwhen particles are present, particlestkind, and row-levelogid. - Preserve row-level CTF-model values by transferring complete
os_micandos_stkrows. Do not overwritekv,cs,fraca,ctf,phaseplate, orsmpdfrom global parameters. - Preserve particle-level CTF values and exact
os_ptcl2D%stateflags by transferring complete particle rows. Clear staleptcl2Dclustering fields after transfer because class rows are intentionally not carried forward. - Remap row-level
ogidassignments for every copied segment that carriesogid, independent of whether the source project hasos_optics. - Preserve/remap
os_opticsonly when it is part of the matching input field set. - If a source project has incomplete authoritative row-level CTF fields, fail
validation even if corresponding values exist in
os_optics.
Performance policy:
- The merge is allowed to keep all input and output metadata rows in memory; this refactor is not optimized for memory-constrained machines.
- The merge must not rewrite movie, micrograph, stack, or particle image data.
- The hot row-transfer/remap loops should be OpenMP-parallel where each iteration writes to a precomputed, distinct output row.
- Large validation and namespace-discovery scans should also use OpenMP when
they can reduce to deterministic scalar results, such as the first invalid
particle row or the maximum row-level
ogidin a source project. - The final project metadata write is still a single output-file serialization step through the existing project writer.
This keeps merge behavior predictable and avoids accidental regrouping.
Optional Optics Metadata¶
When os_optics exists for every source project, it should be kept consistent
with row-level values and remapped with the same ogid namespace used for the
rows. When no source project has os_optics, the row-level ogid assignment
should still be remapped and the project should still be complete and
scientifically valid.
os_optics is not a repair source for this refactor. It may be preserved and
remapped as metadata, but it should not be read to fill missing smpd, kv,
cs, fraca, CTF flag, or phase-plate state on authoritative rows.
If future STAR export needs an optics table from a merged project that lacks
os_optics, the STAR adapter can synthesize a temporary export-only optics
table from row-level CTF-model groups. It should not require or persist
os_optics just to export.
2D and 3D Analysis¶
Once get_ctfparams uses the resolver consistently, most CTF-aware 2D/3D
analysis should pick up mixed kv, cs, and fraca values without broad
algorithm changes.
Still audit these areas explicitly:
src/main/pftc/simple_polarft_ctf.f90src/main/class/simple_classaverager_restore.f90src/main/volumereconstruction helpers that apply CTF- any direct construction of
ctf(params%smpd, params%kv, params%cs, ...)outside simulation or validation code
Global params%kv and params%cs should not be used for real project CTF
evaluation. They can remain defaults for simulation, validation, and
homogeneous import.
Tests¶
Add project-level and commander-level tests for matching field shapes:
merge_projectsaccepts an N-project file table and routes through the shared project-file merge helper rather than an in-commander append loop.- Stack-only projects merge when all inputs have
os_stkand no particle segments. - Movie/micrograph-only projects merge when all inputs have
os_micand no stack/particle segments. - Particle projects merge when all populated data segments match.
- Metadata-only projects fail with a clear error that identifies the offending
projtabentry and hints at a nearby data-bearing stage project when one is present. - Projects with mismatched populated data segments fail with a clear field-shape error.
- Input projects can have identical analysis
smpdand compatible boxes but differentkv,cs, and/orfracaon theiros_stkrows. merge_projectspreserves row-level mic/stack CTF-model values.merge_projectspreserves the exactos_ptcl2D%statevector from each input project in the corresponding output row range, including0entries and any positive labels.merge_projectspreserves particle-level defocus fields and remapsstkindcorrectly after append.- For every output stack row, all output particles in that stack's
fromp/toprange havestkindequal to the output stack index. - Class rows and particle
classvalues are offset when class segments are part of the matching input field set. sp_project%get_ctfparams('ptcl2D', iptcl)returns differentkv/csfor particles originating from different projects.- Cluster2D and refine3D CTF matrix construction call through the resolver.
- The merger works when no source project has
os_optics. - The merger remaps row-level
ogidassignments even when no source project hasos_optics. - If two source projects both use
ogid=1, the output rows from those projects receive distinct outputogidvalues even when neither source hasos_optics. - A project with incomplete row-level CTF fields fails validation even if
os_opticscontains matching values. - A mixed-sampling or mixed-box project fails validation for this merge path, with instructions to rescale/re-extract before merging.
- The old two-project
projfile/projfile_targetinterface is not supported by this refactor.
Suggested Migration Plan¶
- Add resolver and validation helpers with no behavior change.
- Route
get_ctfparamsthrough the resolver without introducing anos_opticspath for missing authoritative fields. - Add a reusable project-file merge helper alongside
merge_chunk_projfiles. - Refactor
merge_projectsto build a project filename array fromprojtaband call the shared helper. - Generalize the helper from a 2D-selected particle-only merge to a matching project-field merge.
- Make stack/particle merge paths require identical
smpdand box size when those fields are present; leave scaling/re-extraction as an explicit pre-merge step. - Preserve row-level CTF-model fields and exact
os_ptcl2D%stateflags during row transfer. - Remap local indices and namespaces during row transfer independently of
whether
os_opticsexists. - Add merge tests for stack-only, mic-only, particle, and heterogeneous CTF particle projects.
- Audit direct CTF-field lookups in 2D/3D paths and route them through the resolver where needed.
- Leave heterogeneous import support for a separate future refactor.
Open Questions¶
- Should mixed optional
os_opticspresence be supported by explicitly dropping optics rows from the merged output, or should matching field shape remain the rule? - Should phase-plate state be required to be stack-level for merged particle projects, or should per-particle phase-plate state be allowed?
- Should selected class provenance be retained as optional annotations after class/output segments are concatenated?