model_cavgs_rejection¶

Scope¶

model_cavgs_rejection is the SIMPLE class-average quality selection program. It evaluates cls2D class averages, applies hard validity rejects, extracts a fixed scalar feature bank, normalizes those features within the dataset, and applies a named quality model to partition class averages into accepted and rejected sets. The model can also abstain from additional soft rejection and report that the final decision was hard preselection only.

This is a learned feature-vector model. It should be read alongside the streaming microchunk rejector, which is a cumulative rule engine. The two systems share several image-processing primitives, but they use them differently: microchunking applies fixed sequential criteria and rejects a class as soon as any criterion fires; model_cavgs_rejection uses hard rejects only for validity failures, then learns a weighted scalar quality score and dataset-specific thresholding behavior for the remaining class averages.

The implementation lives in src/main/cavg_quality. The command entry point is exec_model_cavgs_rejection in src/main/commanders/simple/simple_commanders_cavgs.f90.

Implementation Files¶

simple_cavg_quality_types.f90: shared constants and derived types.
simple_cavg_quality_feats.f90: feature definitions, feature extraction, hard rejects, and robust normalization.
simple_cavg_quality_stats.f90: binary metrics, AUC, distance-matrix normalization, and statistical helpers.
simple_cavg_quality_model.f90: built-in presets, model file I/O, scoring, clustering, thresholding, and promotion snippets.
simple_cavg_quality_analysis.f90: evaluation, analysis reports, and feature tables.
simple_cavg_quality_learn.f90: analysis-table reader, feature-policy search, model search, and learn reports.

Microchunk Comparison¶

The stream path in simple_microchunked2D currently calls cluster2D_rejector. That engine is deliberately deterministic and rule based. model_cavgs_rejection is a separate backend that uses a learned feature vector over class averages.

Aspect	Microchunk rejection	`model_cavgs_rejection`
Decision type	Cumulative rule engine.	Learned feature-vector model with hard validity gates.
Primary owner	`simple_cluster2D_rejector` called from `simple_microchunked2D`.	`src/main/cavg_quality` called by `model_cavgs_rejection`.
Unit of evidence	Individual scalar rules applied one after another.	Normalized feature vector plus model weights, clustering, and score thresholding.
Rejection semantics	A class rejected by any rule remains rejected.	Hard rejects are final; remaining classes are scored and partitioned by the model.
Threshold source	Fixed constants, with tier-specific population and local-variance overrides.	Built-in or learned model specification; learn mode chooses feature policy, weights, and threshold controls from analysis datasets.
Adaptation to dataset	Robust local-variance z-scores are dataset-relative, but thresholds are fixed.	Robust feature normalization, k-medoids distances, Otsu thresholds, and selected model controls are dataset-relative.
Training data	None.	Manual selections from `quality_mode=analyze` files.
Output role	Online stream cleanup and particle deselection.	Batch/chunk/pool class-average selection, analysis, training, and model promotion.

Evidence Comparison¶

The model feature bank keeps the microchunk-style image-processing evidence as a feature family and appends optional evidence families. The current chunk_default_v2 preset uses the microchunk_plus_score policy: the microchunk-family features plus corr_frc_proxy.

Evidence	Microchunk rule engine	`model_cavgs_rejection` feature-vector model	Current chunk role
Class population	Hard rule: reject below a tier-specific fraction of total population.	`log_pop` feature plus hard reject below `0.0035` of total population.	Active feature and hard gate.
Resolution	Hard rule: reject when `res > 40.0`.	`neg_log_res` feature plus hard reject when `res > 40.0`.	Active feature and hard gate.
Foreground centering	Otsu foreground connected components; reject if any centroid lies outside the mask radius.	`centered` feature is negative normalized centroid displacement; centroid outside the mask is a hard reject.	Active feature and hard gate.
Foreground outside mask	Reject when the largest valid foreground component has more than 10 pixels outside the mask disc.	`cc_area_frac` measures largest component area relative to the mask disc; outside-mask excess is a hard reject.	Active feature and hard gate.
Full-image component cleanup	Connected components spanning the full image are removed before mask tests.	Same concept is applied during feature extraction before geometry features and hard mask rejection.	Hard-gate preprocessing.
Foreground local variance	Robust-z rule on foreground/background local variance after 10 A low-pass and Otsu masking.	`log_locvar_fg` feature; only both variances non-positive is a hard reject.	Active feature.
Background local variance	Robust-z rule paired with foreground local variance.	`log_locvar_bg` feature; only both variances non-positive is a hard reject.	Active feature.
Stored score evidence	Not used by the rule engine.	`corr_frc_proxy` reads stored class correlation or FRC-like score when present.	Active feature.
Center/edge signal	Not used by the rule engine.	`log_center_edge_snr` feature in the `signal` family.	Available, zero weight in current chunk default.
Presence	Not used by the rule engine.	`presence` feature in the `signal` family.	Available, zero weight in current chunk default.
Invalid pixels	Not a named microchunk rule.	Image values containing invalid pixels are hard rejected.	Hard gate.

Hard-Reject Comparison¶

Hard rejects in model_cavgs_rejection are intended to mirror the non-negotiable validity parts of microchunking while leaving ordinary quality variation to the learned score. The main difference is local variance: microchunking uses local variance as a direct rejection rule, while model_cavgs_rejection keeps local variance as learned scalar evidence except for degenerate zero-variance cases.

Criterion	Microchunk rule engine	`model_cavgs_rejection`
Population	Rejects `pop < ceiling(sum(pop) * fraction)`. Fractions are tier-specific.	Rejects `pop <= 0` and `pop < ceiling(sum(pop) * 0.0035)`.
Resolution	Rejects `res > 40.0`.	Rejects `res > 40.0`.
Empty or mismatched class-average stack	Stream chunk is marked `REJECTION_FAILED` and `COMPLETE`.	Apply/analyze require a valid project, `cls2D` entries, matching class-average stack, and `mskdiam`; invalid inputs stop the command.
Invalid pixels	No separate named rule.	Hard rejected.
No valid foreground component	Rejected after full-image connected components are pruned.	Hard rejected after the same style of foreground-component pruning.
Foreground centroid outside mask	Rejected.	Hard rejected.
Largest foreground component outside mask	Rejected when outside pixels exceed 10.	Hard rejected when outside pixels exceed 10.
Local variance exactly degenerate	Rejected when both inside/outside scores are near zero.	Hard rejected when both foreground/background local variances are non-positive.
Local variance low but not degenerate	Rejected by fixed robust-z thresholds.	Not a hard reject; encoded as `log_locvar_fg` and `log_locvar_bg` features.

Microchunk tier thresholds:

Tier	Population fraction	Local-variance strong threshold	Local-variance weak threshold
Pass 1	`0.0050`	`-0.5`	`-0.1`
Pass 2	`0.0035`	`-1.0`	`-1.0`
Reference chunk	`0.0025`	`-2.0`	`-2.0`
Match chunk	`0.0025`	`-2.0`	`-2.0`

model_cavgs_rejection uses a fixed population hard-reject fraction of 0.0035 for the current feature extractor, independent of model context.

Learning Model¶

The rejection model is a low-dimensional, monotone feature-vector classifier with explicit hard constraints. In learning-theory terms, the hard rejects are prior validity constraints: they remove classes the model is not asked to rescue or fit. The remaining rows form a supervised training set from manual selections.

The learned artifact has three parts:

a feature policy, which selects a cumulative family set;
non-negative feature weights, which define a linear scalar quality score;
thresholding controls, which govern how the per-dataset score boundary is chosen after k-medoids clustering and optional Otsu thresholding.

The apply-time classifier is partly dataset-adaptive. Features are robustly normalized inside the current dataset. K-medoids and Otsu operate on the current score/feature distribution. The learned parameters provide the inductive bias: which features are active, how the score is weighted, and how aggressively the cluster-derived boundary is shifted or replaced.

The learning objective is empirical risk minimization over manually annotated quality_mode=analyze runs. Feature-weight candidates are learned only from datasets where both manual states remain present after hard rejects, because those are the datasets with a learnable soft boundary. Candidate models are scored over all scoreable datasets using only non-hard-rejected rows. Two-class trainable datasets contribute balanced accuracy. Datasets where only manually good rows remain after hard rejects contribute recall, so they teach the model not to reject good classes that passed the hard gates. Datasets where only manually bad rows remain contribute specificity only when no manually good classes were removed by hard rejects. If manually good classes were entirely removed by hard rejects, the dataset is reported as hard-gate blocked and skipped for soft-model scoring. Hard-rejected rows remain visible in diagnostics, including counts of manually good classes lost to hard gates, but they do not participate in fitting the learned boundary.

Feature-weight search is ab initio for the supplied training data. The learner first constructs an AUC-derived seed:

weight(feature) = max(0, pooled_auc(feature, manual_state) - 0.5)

The weights are normalized after the active feature policy is applied. Each feature policy contributes one deterministic AUC-derived weight vector to the full classifier search. There is no base-weight blending in the current learner. This means each training run derives the scalar model from the current analysis files rather than nudging an inherited default.

Feature-family policies act as a small structural model-selection layer:

Policy	Active families	Purpose
`microchunk`	`microchunk`	Tests whether scalar analogues of the stream rejector evidence are sufficient.
`microchunk_plus_score`	`microchunk`, `score`	Adds stored class correlation/FRC-like evidence. This is the current chunk default.
`microchunk_plus_signal`	`microchunk`, `signal`	Adds center/edge and presence evidence without stored score evidence.
`microchunk_plus_score_signal`	`microchunk`, `score`, `signal`	Uses the full current feature bank.

Learn mode reports feature signal, feature-drop diagnostics, and leave-one-dataset-out feature-policy diagnostics so that the selected model can be interpreted as a compact scientific statement about which evidence family and weight structure generalized on the supplied training set.

Command Modes¶

quality_mode=apply computes quality features, applies the selected model, writes cavgs_quality_features.txt, writes quality_selected_cavgs.mrc and quality_rejected_cavgs.mrc, maps the selected class averages into particle states, annotates cls2D with quality, quality_cluster, and accept, annotates cls3D when its class count matches cls2D, optionally prunes particles with prune=yes, and writes the project.

quality_mode=analyze computes the same model output but treats the existing cls2D state as the manual reference. It writes cavgs_quality_analysis.txt and the selected/rejected stacks. The project selection is left unchanged.

quality_mode=learn reads a file table of cavgs_quality_analysis.txt files and searches for a model specification. It writes a learned model file controlled by fname= and writes cavgs_quality_learn_report.txt.

quality_mode=promote reads a model file from infile= and writes a Fortran promotion snippet controlled by fname=.

apply and analyze require projfile and mskdiam. learn requires filetab. promote requires infile. The commander sets oritype=cls2D, defaults mkdir=yes, and defaults prune=no.

Model Selection¶

quality_model selects a built-in preset. The default is chunk_default_v2. The available built-ins are:

chunk_default_v2: default chunk/stream-style model.
pool_default_v2: pool/batch-style model with minimum accepted fraction enforcement.

When infile is supplied, the model file is treated as a complete model and wins over the built-in preset.

Model files use model_version=5 and explicit key-value fields:

name
context
feature_policy
feature_weights
boundary_margin
min_score_separation
otsu_min_offset
otsu_max_offset
cluster_rescue_margin
min_accept_frac
use_lowsep_otsu
use_otsu_window
use_cluster_rescue
enforce_min_accept_frac

Unknown model-file keys are rejected.

Feature Bank¶

CAVG_QUALITY_NFEATS is 9. All feature directions are higher_is_better. The model consumes normalized z_* values; raw values are written for diagnostics.

Index	Feature	Family	Source	Meaning
1	`log_pop`	`microchunk`	`cls2D` `pop`	Log class population.
2	`neg_log_res`	`microchunk`	`cls2D` `res`	Negative log class resolution estimate.
3	`centered`	`microchunk`	Otsu foreground connected components	Negative normalized centroid displacement of segmented signal.
4	`log_locvar_fg`	`microchunk`	Foreground-mask local variance	Log local variance in the foreground Otsu mask.
5	`log_locvar_bg`	`microchunk`	Background-mask local variance	Log local variance outside the foreground Otsu mask.
6	`corr_frc_proxy`	`score`	`cls2D` `corr` when present	Stored class correlation or FRC-like quality proxy.
7	`log_center_edge_snr`	`signal`	Center/edge variance statistics	Log central variance relative to edge variance.
8	`cc_area_frac`	`microchunk`	Foreground connected-component area	Largest foreground component area divided by expected mask area.
9	`presence`	`signal`	Center/border statistics	Central signal excess relative to border background noise.

Feature families are used by learn mode. Every policy starts with the microchunk family and then appends optional evidence families:

microchunk
microchunk_plus_score
microchunk_plus_signal
microchunk_plus_score_signal

Features outside the selected policy are encoded by zero weights.

Extraction Constants¶

FOREGROUND_SEG_LP = 30.0
SIGNAL_METRIC_LP = 10.0
CAVG_RES_HARD_REJECT_A = 40.0
POP_FRACTION_HARD_REJECT = 0.0035
LOCVAR_WINDOW = 10
MASK_HARD_OUTSIDE_PIXELS = 10
LOG_EPS = 1.0e-12
CLIP_Z = 4.0

Hard Rejects¶

Hard rejects are validity gates applied before model fitting, clustering, and training-score calculation. A class average is hard rejected when any of these conditions hold:

pop <= 0;
pop < ceiling(sum(pop) * 0.0035);
res > 40.0;
the image contains invalid pixels;
foreground segmentation has no valid component after full-image connected components are pruned;
a segmented foreground component centroid lies outside the mask radius;
the largest foreground component has more than 10 pixels outside the mask disc;
foreground and background local variance are both non-positive.

Hard-rejected classes receive rejected state directly. Their normalized features are set to -CLIP_Z, their model scores are set to -CLIP_Z, and they remain visible in analysis and learning reports.

Normalization¶

normalize_cavg_quality_features computes per-feature robust statistics over non-hard-rejected rows only. Each feature is centered by the median and scaled by the Gaussian MAD. Values are clipped to [-CLIP_Z, CLIP_Z]. If a feature has no robust spread, its normalized value is set to zero for trainable rows.

Built-In Presets¶

chunk_default_v2 has context chunk, feature policy microchunk_plus_score, cluster rescue disabled, and minimum accepted fraction disabled.

log_pop             1.355581E-01
neg_log_res         1.808395E-01
centered            4.942806E-02
log_locvar_fg       1.635655E-01
log_locvar_bg       1.726983E-01
corr_frc_proxy      2.108072E-01
log_center_edge_snr 0.000000E+00
cc_area_frac        8.710331E-02
presence            0.000000E+00

boundary_margin         0.15
min_score_separation    0.15
otsu_min_offset         0.35
otsu_max_offset         0.50
cluster_rescue_margin   0.20
min_accept_frac         0.00
use_lowsep_otsu         true
use_otsu_window         true
use_cluster_rescue      false
enforce_min_accept_frac false

pool_default_v2 has context pool, feature policy microchunk_plus_score_signal, cluster rescue enabled, Otsu windowing disabled, and minimum accepted fraction enforcement enabled.

log_pop             1.826303E-01
neg_log_res         2.379928E-01
centered            0.000000E+00
log_locvar_fg       7.328372E-02
log_locvar_bg       6.126274E-02
corr_frc_proxy      1.257789E-01
log_center_edge_snr 1.572098E-01
cc_area_frac        2.139509E-02
presence            1.404466E-01

boundary_margin         1.70
min_score_separation    0.20
otsu_min_offset         0.15
otsu_max_offset         0.50
cluster_rescue_margin   0.20
min_accept_frac         0.95
use_lowsep_otsu         true
use_otsu_window         false
use_cluster_rescue      true
enforce_min_accept_frac true

All model weights are clipped to non-negative values and normalized to sum to one when a model is initialized or read.

Classification¶

For non-hard-rejected class averages:

Compute a linear quality score with matmul(normalized_features, weights).
Build a pairwise Euclidean distance matrix in normalized feature space using nonzero-weight features.
Robustly normalize the distance matrix.
Run two-cluster k-medoids clustering.
Choose the good cluster as the cluster with the higher mean model score. If the score means tie, choose the larger cluster; if cluster sizes also tie, choose the cluster with the higher medoid score.
Place the raw threshold halfway between the good-cluster and bad-cluster mean scores.
If cluster separation is below min_score_separation, use the Otsu score threshold only when use_lowsep_otsu is enabled and Otsu separation is sufficient. Otherwise, accept all trainable class averages as a single cluster.
If cluster separation is sufficient, start from raw_threshold - boundary_margin.
If use_otsu_window is enabled, replace the threshold with Otsu when the Otsu threshold is at least otsu_min_offset and at most otsu_max_offset above the raw threshold, and Otsu separation is sufficient.
If use_cluster_rescue is enabled, also accept good-cluster members within cluster_rescue_margin below the threshold.
If enforce_min_accept_frac is enabled, accept enough top-scoring trainable rows to satisfy min_accept_frac.

If there are fewer than four trainable rows, the distance matrix is degenerate, clustering fails, or the low-separation branch has no acceptable Otsu threshold, all non-hard-rejected rows are accepted as a single cluster.

threshold_offset is reported as raw_threshold - threshold. A positive value means the effective threshold is lower than the cluster midpoint.

The classifier reports its soft decision explicitly:

soft_decision=soft_threshold: a learned score threshold was used to reject at least one non-hard-rejected class average.
soft_decision=hard_only: no additional soft rejection was made after hard preselection.

Common hard-only reasons are too_few_trainable, flat_feature_distances, invalid_two_cluster_result, low_score_separation, soft_threshold_accepts_all, and no_trainable_after_hard. This is the real-world abstention path: if the post-hard feature distribution does not contain a credible low-quality partition, the model stops after hard preselection.

Analysis Output¶

cavgs_quality_analysis.txt starts with # model_cavgs_rejection_analysis_version=4. It includes:

dataset and model identity;
selected/rejected counts;
soft_decision, soft_reason, and the number of trainable rows rejected by the soft model;
confusion metrics against the manual cls2D state;
score AUC;
raw threshold, threshold offset, and effective threshold;
best balanced-accuracy and F1 thresholds against the manual reference;
model fields as comments;
feature inventory;
per-feature AUC, robust separation, current weight, and suggested weight;
threshold scan;
one row per class average containing state, hard-reject flag, quality cluster, score, manual state, raw features, and normalized features.

cavgs_quality_features.txt uses the same per-class row format without the analysis summary comments.

Learning¶

quality_mode=learn reads analysis files generated by the current analysis table format. Required columns are:

dataset_id
hard_reject
manual_state
all z_* feature columns for the 9-feature bank

Hard-rejected rows are kept in reports but excluded from feature-weight estimation, candidate scoring, feature-signal diagnostics, feature-drop diagnostics, and feature-policy diagnostics.

Learn mode assigns each dataset an automatic role:

balanced: both manual good and manual bad rows remain after hard rejects; used for feature weights and scored by balanced accuracy.
trainable_good_only: only manual good rows remain after hard rejects; not used for feature weights, scored by guarded recall.
trainable_bad_only: only manual bad rows remain after hard rejects and no manually good rows were hard rejected; not used for feature weights, scored by specificity.
skip_hard_gate_blocked: only manual bad rows remain because all manually good rows were hard rejected; reported but skipped for soft-model fitting and scoring.

Learn mode derives candidate feature weights from the training data. The starting candidate is:

weight(feature) = max(0, pooled_auc(feature, manual_state) - 0.5)

The weights are normalized to sum to one. If all active weights are zero for a policy, uniform weights are assigned over the active policy features. The full classifier search chooses among the AUC-derived candidate vectors for the available feature policies.

Good-only datasets use a recall guard in the learn score. Raw recall is still reported in the dataset table, but the learn score applies an additional shortfall penalty below the recall floor. This prevents a candidate model from improving balanced datasets by rejecting a large number of classes from datasets that contain only trainable manual positives.

Learn mode searches:

feature policies: microchunk, microchunk_plus_score, microchunk_plus_signal, microchunk_plus_score_signal;
feature weights: one AUC-derived candidate for each feature policy;
min_score_separation: 0.05, 0.10, 0.15, 0.20, 0.30;
chunk boundary_margin: -0.60, -0.50, -0.40, -0.30, -0.25, -0.15, -0.05, 0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50;
pool boundary_margin: -0.60, -0.50, -0.40, -0.30, -0.25, -0.15, -0.05, 0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.00, 1.10, 1.20, 1.30, 1.40, 1.50, 1.60, 1.70, 1.80, 1.90, 2.00, 2.25, 2.50, 2.75, 3.00, 3.50, 4.00;
use_lowsep_otsu: false, true;
use_otsu_window: false, true;
otsu_min_offset: 0.05, 0.10, 0.15, 0.25, 0.35 when the Otsu window is enabled;
otsu_max_offset: 0.40, 0.50, 0.65 when the Otsu window is enabled;
min_accept_frac: 0.50, 0.60, 0.65, 0.70, 0.80, 0.85, 0.90, 0.925, 0.95, 0.975, 1.00 for pool-context training.

Each candidate is evaluated by running the full classifier on every scoreable training dataset and averaging the role-specific learn score. If multiple candidates tie, the selected candidate is the one closest to the starting model in the searched threshold controls.

The learn report includes the search grid, macro_learn_score, suggested weights, selected model, dataset-role diagnostics, Otsu ablation diagnostics, feature-signal diagnostics, feature-drop diagnostics, leave-one-dataset-out feature-policy diagnostics, top candidates, best ties, and per-dataset confusion metrics.

Promotion¶

quality_mode=promote writes a Fortran snippet for adding a learned model as a built-in preset in simple_cavg_quality_model.f90. The snippet also lists the UI/help files that need the model name added when a new built-in is introduced.

Example Commands¶

Analyze a manually selected project:

simple_exec prg=model_cavgs_rejection \
  quality_mode=analyze \
  projfile=my_project.simple \
  mskdiam=180 \
  mkdir=yes

Apply the default chunk model:

simple_exec prg=model_cavgs_rejection \
  quality_mode=apply \
  projfile=my_project.simple \
  mskdiam=180 \
  mkdir=yes

Train from a file table of analysis outputs:

simple_exec prg=model_cavgs_rejection \
  quality_mode=learn \
  filetab=analysis_files.txt \
  fname=cavgs_quality_model_chunk_learned.txt \
  mkdir=yes

Apply a learned model file:

simple_exec prg=model_cavgs_rejection \
  quality_mode=apply \
  infile=cavgs_quality_model_chunk_learned.txt \
  projfile=my_project.simple \
  mskdiam=180 \
  mkdir=yes

Promote a learned model:

simple_exec prg=model_cavgs_rejection \
  quality_mode=promote \
  infile=cavgs_quality_model_chunk_learned.txt \
  fname=cavgs_quality_model_chunk_builtin_code.txt \
  mkdir=yes