model_cavgs_rejection¶
Scope¶
model_cavgs_rejection is the SIMPLE class-average quality selection program. It evaluates cls2D class averages, applies hard validity rejects, extracts a fixed scalar feature bank, normalizes those features within the dataset, and applies a named quality model to partition class averages into accepted and rejected sets. The model can also abstain from additional soft rejection and report that the final decision was hard preselection only.
This is a learned feature-vector model. It should be read alongside the streaming microchunk rejector, which is a cumulative rule engine. The two systems share several image-processing primitives, but they use them differently: microchunking applies fixed sequential criteria and rejects a class as soon as any criterion fires; model_cavgs_rejection uses hard rejects only for validity failures, then learns a weighted scalar quality score and dataset-specific thresholding behavior for the remaining class averages.
The implementation lives in src/main/cavg_quality. The command entry point is exec_model_cavgs_rejection in src/main/commanders/simple/simple_commanders_cavgs.f90.
Implementation Files¶
simple_cavg_quality_types.f90: shared constants and derived types.simple_cavg_quality_feats.f90: feature definitions, feature extraction, hard rejects, and robust normalization.simple_cavg_quality_stats.f90: binary metrics, AUC, distance-matrix normalization, and statistical helpers.simple_cavg_quality_model.f90: built-in presets, model file I/O, scoring, clustering, thresholding, and promotion snippets.simple_cavg_quality_analysis.f90: evaluation, analysis reports, and feature tables.simple_cavg_quality_learn.f90: analysis-table reader, feature-policy search, model search, and learn reports.
Microchunk Comparison¶
The stream path in simple_microchunked2D currently calls cluster2D_rejector. That engine is deliberately deterministic and rule based. model_cavgs_rejection is a separate backend that uses a learned feature vector over class averages.
| Aspect | Microchunk rejection | model_cavgs_rejection |
|---|---|---|
| Decision type | Cumulative rule engine. | Learned feature-vector model with hard validity gates. |
| Primary owner | simple_cluster2D_rejector called from simple_microchunked2D. |
src/main/cavg_quality called by model_cavgs_rejection. |
| Unit of evidence | Individual scalar rules applied one after another. | Normalized feature vector plus model weights, clustering, and score thresholding. |
| Rejection semantics | A class rejected by any rule remains rejected. | Hard rejects are final; remaining classes are scored and partitioned by the model. |
| Threshold source | Fixed constants, with tier-specific population and local-variance overrides. | Built-in or learned model specification; learn mode chooses feature policy, weights, and threshold controls from analysis datasets. |
| Adaptation to dataset | Robust local-variance z-scores are dataset-relative, but thresholds are fixed. | Robust feature normalization, k-medoids distances, Otsu thresholds, and selected model controls are dataset-relative. |
| Training data | None. | Manual selections from quality_mode=analyze files. |
| Output role | Online stream cleanup and particle deselection. | Batch/chunk/pool class-average selection, analysis, training, and model promotion. |
Evidence Comparison¶
The model feature bank keeps the microchunk-style image-processing evidence as a feature family and appends optional evidence families. The current chunk_default_v2 preset uses the microchunk_plus_score policy: the microchunk-family features plus corr_frc_proxy.
| Evidence | Microchunk rule engine | model_cavgs_rejection feature-vector model |
Current chunk role |
|---|---|---|---|
| Class population | Hard rule: reject below a tier-specific fraction of total population. | log_pop feature plus hard reject below 0.0035 of total population. |
Active feature and hard gate. |
| Resolution | Hard rule: reject when res > 40.0. |
neg_log_res feature plus hard reject when res > 40.0. |
Active feature and hard gate. |
| Foreground centering | Otsu foreground connected components; reject if any centroid lies outside the mask radius. | centered feature is negative normalized centroid displacement; centroid outside the mask is a hard reject. |
Active feature and hard gate. |
| Foreground outside mask | Reject when the largest valid foreground component has more than 10 pixels outside the mask disc. | cc_area_frac measures largest component area relative to the mask disc; outside-mask excess is a hard reject. |
Active feature and hard gate. |
| Full-image component cleanup | Connected components spanning the full image are removed before mask tests. | Same concept is applied during feature extraction before geometry features and hard mask rejection. | Hard-gate preprocessing. |
| Foreground local variance | Robust-z rule on foreground/background local variance after 10 A low-pass and Otsu masking. | log_locvar_fg feature; only both variances non-positive is a hard reject. |
Active feature. |
| Background local variance | Robust-z rule paired with foreground local variance. | log_locvar_bg feature; only both variances non-positive is a hard reject. |
Active feature. |
| Stored score evidence | Not used by the rule engine. | corr_frc_proxy reads stored class correlation or FRC-like score when present. |
Active feature. |
| Center/edge signal | Not used by the rule engine. | log_center_edge_snr feature in the signal family. |
Available, zero weight in current chunk default. |
| Presence | Not used by the rule engine. | presence feature in the signal family. |
Available, zero weight in current chunk default. |
| Invalid pixels | Not a named microchunk rule. | Image values containing invalid pixels are hard rejected. | Hard gate. |
Hard-Reject Comparison¶
Hard rejects in model_cavgs_rejection are intended to mirror the non-negotiable validity parts of microchunking while leaving ordinary quality variation to the learned score. The main difference is local variance: microchunking uses local variance as a direct rejection rule, while model_cavgs_rejection keeps local variance as learned scalar evidence except for degenerate zero-variance cases.
| Criterion | Microchunk rule engine | model_cavgs_rejection |
|---|---|---|
| Population | Rejects pop < ceiling(sum(pop) * fraction). Fractions are tier-specific. |
Rejects pop <= 0 and pop < ceiling(sum(pop) * 0.0035). |
| Resolution | Rejects res > 40.0. |
Rejects res > 40.0. |
| Empty or mismatched class-average stack | Stream chunk is marked REJECTION_FAILED and COMPLETE. |
Apply/analyze require a valid project, cls2D entries, matching class-average stack, and mskdiam; invalid inputs stop the command. |
| Invalid pixels | No separate named rule. | Hard rejected. |
| No valid foreground component | Rejected after full-image connected components are pruned. | Hard rejected after the same style of foreground-component pruning. |
| Foreground centroid outside mask | Rejected. | Hard rejected. |
| Largest foreground component outside mask | Rejected when outside pixels exceed 10. | Hard rejected when outside pixels exceed 10. |
| Local variance exactly degenerate | Rejected when both inside/outside scores are near zero. | Hard rejected when both foreground/background local variances are non-positive. |
| Local variance low but not degenerate | Rejected by fixed robust-z thresholds. | Not a hard reject; encoded as log_locvar_fg and log_locvar_bg features. |
Microchunk tier thresholds:
| Tier | Population fraction | Local-variance strong threshold | Local-variance weak threshold |
|---|---|---|---|
| Pass 1 | 0.0050 |
-0.5 |
-0.1 |
| Pass 2 | 0.0035 |
-1.0 |
-1.0 |
| Reference chunk | 0.0025 |
-2.0 |
-2.0 |
| Match chunk | 0.0025 |
-2.0 |
-2.0 |
model_cavgs_rejection uses a fixed population hard-reject fraction of 0.0035 for the current feature extractor, independent of model context.
Learning Model¶
The rejection model is a low-dimensional, monotone feature-vector classifier with explicit hard constraints. In learning-theory terms, the hard rejects are prior validity constraints: they remove classes the model is not asked to rescue or fit. The remaining rows form a supervised training set from manual selections.
The learned artifact has three parts:
- a feature policy, which selects a cumulative family set;
- non-negative feature weights, which define a linear scalar quality score;
- thresholding controls, which govern how the per-dataset score boundary is chosen after k-medoids clustering and optional Otsu thresholding.
The apply-time classifier is partly dataset-adaptive. Features are robustly normalized inside the current dataset. K-medoids and Otsu operate on the current score/feature distribution. The learned parameters provide the inductive bias: which features are active, how the score is weighted, and how aggressively the cluster-derived boundary is shifted or replaced.
The learning objective is empirical risk minimization over manually annotated quality_mode=analyze runs. Feature-weight candidates are learned only from datasets where both manual states remain present after hard rejects, because those are the datasets with a learnable soft boundary. Candidate models are scored over all scoreable datasets using only non-hard-rejected rows. Two-class trainable datasets contribute balanced accuracy. Datasets where only manually good rows remain after hard rejects contribute recall, so they teach the model not to reject good classes that passed the hard gates. Datasets where only manually bad rows remain contribute specificity only when no manually good classes were removed by hard rejects. If manually good classes were entirely removed by hard rejects, the dataset is reported as hard-gate blocked and skipped for soft-model scoring. Hard-rejected rows remain visible in diagnostics, including counts of manually good classes lost to hard gates, but they do not participate in fitting the learned boundary.
Feature-weight search is ab initio for the supplied training data. The learner first constructs an AUC-derived seed:
weight(feature) = max(0, pooled_auc(feature, manual_state) - 0.5)
The weights are normalized after the active feature policy is applied. Each feature policy contributes one deterministic AUC-derived weight vector to the full classifier search. There is no base-weight blending in the current learner. This means each training run derives the scalar model from the current analysis files rather than nudging an inherited default.
Feature-family policies act as a small structural model-selection layer:
| Policy | Active families | Purpose |
|---|---|---|
microchunk |
microchunk |
Tests whether scalar analogues of the stream rejector evidence are sufficient. |
microchunk_plus_score |
microchunk, score |
Adds stored class correlation/FRC-like evidence. This is the current chunk default. |
microchunk_plus_signal |
microchunk, signal |
Adds center/edge and presence evidence without stored score evidence. |
microchunk_plus_score_signal |
microchunk, score, signal |
Uses the full current feature bank. |
Learn mode reports feature signal, feature-drop diagnostics, and leave-one-dataset-out feature-policy diagnostics so that the selected model can be interpreted as a compact scientific statement about which evidence family and weight structure generalized on the supplied training set.
Command Modes¶
quality_mode=apply computes quality features, applies the selected model, writes cavgs_quality_features.txt, writes quality_selected_cavgs.mrc and quality_rejected_cavgs.mrc, maps the selected class averages into particle states, annotates cls2D with quality, quality_cluster, and accept, annotates cls3D when its class count matches cls2D, optionally prunes particles with prune=yes, and writes the project.
quality_mode=analyze computes the same model output but treats the existing cls2D state as the manual reference. It writes cavgs_quality_analysis.txt and the selected/rejected stacks. The project selection is left unchanged.
quality_mode=learn reads a file table of cavgs_quality_analysis.txt files and searches for a model specification. It writes a learned model file controlled by fname= and writes cavgs_quality_learn_report.txt.
quality_mode=promote reads a model file from infile= and writes a Fortran promotion snippet controlled by fname=.
apply and analyze require projfile and mskdiam. learn requires filetab. promote requires infile. The commander sets oritype=cls2D, defaults mkdir=yes, and defaults prune=no.
Model Selection¶
quality_model selects a built-in preset. The default is chunk_default_v2. The available built-ins are:
chunk_default_v2: default chunk/stream-style model.pool_default_v2: pool/batch-style model with minimum accepted fraction enforcement.
When infile is supplied, the model file is treated as a complete model and wins over the built-in preset.
Model files use model_version=5 and explicit key-value fields:
namecontextfeature_policyfeature_weightsboundary_marginmin_score_separationotsu_min_offsetotsu_max_offsetcluster_rescue_marginmin_accept_fracuse_lowsep_otsuuse_otsu_windowuse_cluster_rescueenforce_min_accept_frac
Unknown model-file keys are rejected.
Feature Bank¶
CAVG_QUALITY_NFEATS is 9. All feature directions are higher_is_better. The model consumes normalized z_* values; raw values are written for diagnostics.
| Index | Feature | Family | Source | Meaning |
|---|---|---|---|---|
| 1 | log_pop |
microchunk |
cls2D pop |
Log class population. |
| 2 | neg_log_res |
microchunk |
cls2D res |
Negative log class resolution estimate. |
| 3 | centered |
microchunk |
Otsu foreground connected components | Negative normalized centroid displacement of segmented signal. |
| 4 | log_locvar_fg |
microchunk |
Foreground-mask local variance | Log local variance in the foreground Otsu mask. |
| 5 | log_locvar_bg |
microchunk |
Background-mask local variance | Log local variance outside the foreground Otsu mask. |
| 6 | corr_frc_proxy |
score |
cls2D corr when present |
Stored class correlation or FRC-like quality proxy. |
| 7 | log_center_edge_snr |
signal |
Center/edge variance statistics | Log central variance relative to edge variance. |
| 8 | cc_area_frac |
microchunk |
Foreground connected-component area | Largest foreground component area divided by expected mask area. |
| 9 | presence |
signal |
Center/border statistics | Central signal excess relative to border background noise. |
Feature families are used by learn mode. Every policy starts with the microchunk family and then appends optional evidence families:
microchunkmicrochunk_plus_scoremicrochunk_plus_signalmicrochunk_plus_score_signal
Features outside the selected policy are encoded by zero weights.
Extraction Constants¶
FOREGROUND_SEG_LP = 30.0SIGNAL_METRIC_LP = 10.0CAVG_RES_HARD_REJECT_A = 40.0POP_FRACTION_HARD_REJECT = 0.0035LOCVAR_WINDOW = 10MASK_HARD_OUTSIDE_PIXELS = 10LOG_EPS = 1.0e-12CLIP_Z = 4.0
Hard Rejects¶
Hard rejects are validity gates applied before model fitting, clustering, and training-score calculation. A class average is hard rejected when any of these conditions hold:
pop <= 0;pop < ceiling(sum(pop) * 0.0035);res > 40.0;- the image contains invalid pixels;
- foreground segmentation has no valid component after full-image connected components are pruned;
- a segmented foreground component centroid lies outside the mask radius;
- the largest foreground component has more than 10 pixels outside the mask disc;
- foreground and background local variance are both non-positive.
Hard-rejected classes receive rejected state directly. Their normalized features are set to -CLIP_Z, their model scores are set to -CLIP_Z, and they remain visible in analysis and learning reports.
Normalization¶
normalize_cavg_quality_features computes per-feature robust statistics over non-hard-rejected rows only. Each feature is centered by the median and scaled by the Gaussian MAD. Values are clipped to [-CLIP_Z, CLIP_Z]. If a feature has no robust spread, its normalized value is set to zero for trainable rows.
Built-In Presets¶
chunk_default_v2 has context chunk, feature policy microchunk_plus_score, cluster rescue disabled, and minimum accepted fraction disabled.
log_pop 1.355581E-01
neg_log_res 1.808395E-01
centered 4.942806E-02
log_locvar_fg 1.635655E-01
log_locvar_bg 1.726983E-01
corr_frc_proxy 2.108072E-01
log_center_edge_snr 0.000000E+00
cc_area_frac 8.710331E-02
presence 0.000000E+00
boundary_margin 0.15
min_score_separation 0.15
otsu_min_offset 0.35
otsu_max_offset 0.50
cluster_rescue_margin 0.20
min_accept_frac 0.00
use_lowsep_otsu true
use_otsu_window true
use_cluster_rescue false
enforce_min_accept_frac false
pool_default_v2 has context pool, feature policy microchunk_plus_score_signal, cluster rescue enabled, Otsu windowing disabled, and minimum accepted fraction enforcement enabled.
log_pop 1.826303E-01
neg_log_res 2.379928E-01
centered 0.000000E+00
log_locvar_fg 7.328372E-02
log_locvar_bg 6.126274E-02
corr_frc_proxy 1.257789E-01
log_center_edge_snr 1.572098E-01
cc_area_frac 2.139509E-02
presence 1.404466E-01
boundary_margin 1.70
min_score_separation 0.20
otsu_min_offset 0.15
otsu_max_offset 0.50
cluster_rescue_margin 0.20
min_accept_frac 0.95
use_lowsep_otsu true
use_otsu_window false
use_cluster_rescue true
enforce_min_accept_frac true
All model weights are clipped to non-negative values and normalized to sum to one when a model is initialized or read.
Classification¶
For non-hard-rejected class averages:
- Compute a linear quality score with
matmul(normalized_features, weights). - Build a pairwise Euclidean distance matrix in normalized feature space using nonzero-weight features.
- Robustly normalize the distance matrix.
- Run two-cluster k-medoids clustering.
- Choose the good cluster as the cluster with the higher mean model score. If the score means tie, choose the larger cluster; if cluster sizes also tie, choose the cluster with the higher medoid score.
- Place the raw threshold halfway between the good-cluster and bad-cluster mean scores.
- If cluster separation is below
min_score_separation, use the Otsu score threshold only whenuse_lowsep_otsuis enabled and Otsu separation is sufficient. Otherwise, accept all trainable class averages as a single cluster. - If cluster separation is sufficient, start from
raw_threshold - boundary_margin. - If
use_otsu_windowis enabled, replace the threshold with Otsu when the Otsu threshold is at leastotsu_min_offsetand at mostotsu_max_offsetabove the raw threshold, and Otsu separation is sufficient. - If
use_cluster_rescueis enabled, also accept good-cluster members withincluster_rescue_marginbelow the threshold. - If
enforce_min_accept_fracis enabled, accept enough top-scoring trainable rows to satisfymin_accept_frac.
If there are fewer than four trainable rows, the distance matrix is degenerate, clustering fails, or the low-separation branch has no acceptable Otsu threshold, all non-hard-rejected rows are accepted as a single cluster.
threshold_offset is reported as raw_threshold - threshold. A positive value means the effective threshold is lower than the cluster midpoint.
The classifier reports its soft decision explicitly:
soft_decision=soft_threshold: a learned score threshold was used to reject at least one non-hard-rejected class average.soft_decision=hard_only: no additional soft rejection was made after hard preselection.
Common hard-only reasons are too_few_trainable, flat_feature_distances, invalid_two_cluster_result, low_score_separation, soft_threshold_accepts_all, and no_trainable_after_hard. This is the real-world abstention path: if the post-hard feature distribution does not contain a credible low-quality partition, the model stops after hard preselection.
Analysis Output¶
cavgs_quality_analysis.txt starts with # model_cavgs_rejection_analysis_version=4. It includes:
- dataset and model identity;
- selected/rejected counts;
soft_decision,soft_reason, and the number of trainable rows rejected by the soft model;- confusion metrics against the manual
cls2Dstate; - score AUC;
- raw threshold, threshold offset, and effective threshold;
- best balanced-accuracy and F1 thresholds against the manual reference;
- model fields as comments;
- feature inventory;
- per-feature AUC, robust separation, current weight, and suggested weight;
- threshold scan;
- one row per class average containing state, hard-reject flag, quality cluster, score, manual state, raw features, and normalized features.
cavgs_quality_features.txt uses the same per-class row format without the analysis summary comments.
Learning¶
quality_mode=learn reads analysis files generated by the current analysis table format. Required columns are:
dataset_idhard_rejectmanual_state- all
z_*feature columns for the 9-feature bank
Hard-rejected rows are kept in reports but excluded from feature-weight estimation, candidate scoring, feature-signal diagnostics, feature-drop diagnostics, and feature-policy diagnostics.
Learn mode assigns each dataset an automatic role:
balanced: both manual good and manual bad rows remain after hard rejects; used for feature weights and scored by balanced accuracy.trainable_good_only: only manual good rows remain after hard rejects; not used for feature weights, scored by guarded recall.trainable_bad_only: only manual bad rows remain after hard rejects and no manually good rows were hard rejected; not used for feature weights, scored by specificity.skip_hard_gate_blocked: only manual bad rows remain because all manually good rows were hard rejected; reported but skipped for soft-model fitting and scoring.
Learn mode derives candidate feature weights from the training data. The starting candidate is:
weight(feature) = max(0, pooled_auc(feature, manual_state) - 0.5)
The weights are normalized to sum to one. If all active weights are zero for a policy, uniform weights are assigned over the active policy features. The full classifier search chooses among the AUC-derived candidate vectors for the available feature policies.
Good-only datasets use a recall guard in the learn score. Raw recall is still reported in the dataset table, but the learn score applies an additional shortfall penalty below the recall floor. This prevents a candidate model from improving balanced datasets by rejecting a large number of classes from datasets that contain only trainable manual positives.
Learn mode searches:
- feature policies:
microchunk,microchunk_plus_score,microchunk_plus_signal,microchunk_plus_score_signal; - feature weights: one AUC-derived candidate for each feature policy;
min_score_separation:0.05, 0.10, 0.15, 0.20, 0.30;- chunk
boundary_margin:-0.60, -0.50, -0.40, -0.30, -0.25, -0.15, -0.05, 0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50; - pool
boundary_margin:-0.60, -0.50, -0.40, -0.30, -0.25, -0.15, -0.05, 0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80,0.90, 1.00, 1.10, 1.20, 1.30, 1.40, 1.50, 1.60, 1.70, 1.80, 1.90, 2.00, 2.25, 2.50, 2.75, 3.00, 3.50, 4.00; use_lowsep_otsu:false, true;use_otsu_window:false, true;otsu_min_offset:0.05, 0.10, 0.15, 0.25, 0.35when the Otsu window is enabled;otsu_max_offset:0.40, 0.50, 0.65when the Otsu window is enabled;min_accept_frac:0.50, 0.60, 0.65, 0.70, 0.80, 0.85, 0.90, 0.925, 0.95, 0.975, 1.00for pool-context training.
Each candidate is evaluated by running the full classifier on every scoreable training dataset and averaging the role-specific learn score. If multiple candidates tie, the selected candidate is the one closest to the starting model in the searched threshold controls.
The learn report includes the search grid, macro_learn_score, suggested weights, selected model, dataset-role diagnostics, Otsu ablation diagnostics, feature-signal diagnostics, feature-drop diagnostics, leave-one-dataset-out feature-policy diagnostics, top candidates, best ties, and per-dataset confusion metrics.
Promotion¶
quality_mode=promote writes a Fortran snippet for adding a learned model as a built-in preset in simple_cavg_quality_model.f90. The snippet also lists the UI/help files that need the model name added when a new built-in is introduced.
Example Commands¶
Analyze a manually selected project:
simple_exec prg=model_cavgs_rejection \
quality_mode=analyze \
projfile=my_project.simple \
mskdiam=180 \
mkdir=yes
Apply the default chunk model:
simple_exec prg=model_cavgs_rejection \
quality_mode=apply \
projfile=my_project.simple \
mskdiam=180 \
mkdir=yes
Train from a file table of analysis outputs:
simple_exec prg=model_cavgs_rejection \
quality_mode=learn \
filetab=analysis_files.txt \
fname=cavgs_quality_model_chunk_learned.txt \
mkdir=yes
Apply a learned model file:
simple_exec prg=model_cavgs_rejection \
quality_mode=apply \
infile=cavgs_quality_model_chunk_learned.txt \
projfile=my_project.simple \
mskdiam=180 \
mkdir=yes
Promote a learned model:
simple_exec prg=model_cavgs_rejection \
quality_mode=promote \
infile=cavgs_quality_model_chunk_learned.txt \
fname=cavgs_quality_model_chunk_builtin_code.txt \
mkdir=yes