Final Report

Final Report#

Class: CEN 4935 Senior Software Engineering Projects II Team Members: John Holik (Team Lead), Christopher Foster, Claiton Pinto, Jesse Prieto Sponsor: Dr. Ahmed Elshall Project: Red Tide Reanalysis and Forecast Uncertainty Quantification

1. Introduction#

This project developed an integrated data assimilation and machine learning pipeline for forecasting harmful algal blooms caused by Karenia brevis, commonly known as red tide, along the Gulf coast of Florida. Red tide events have substantial consequences for public health, coastal economies, fisheries, and marine ecosystems, and improved early warning is a recognized priority for regional water resource managers. The project integrates a physics-based hydrological forward model with four ensemble-based uncertainty quantification methods and a pre-trained Random Forest bloom classifier to produce calibrated probabilistic forecasts of red tide conditions.

The work was conducted in support of Dr. Ahmed Elshall’s research program and builds on prior hydrological modeling and machine learning work developed through that group. The project does not aim to prevent red tide events, which is not achievable through forecasting. It aims to improve the reliability, calibration, and interpretability of red tide probability forecasts so that decision-makers have quantified uncertainty alongside point predictions.

2. Problem Statement#

Red tide events are driven by a combination of hydrological, meteorological, and biogeochemical factors that are imperfectly observed and imperfectly modeled. Traditional deterministic forecasts report a single point estimate of bloom probability without any indication of forecast confidence. For operational decisions such as public health advisories, beach closures, and fisheries management, a point estimate without uncertainty is insufficient: a 40 percent bloom probability accompanied by narrow, well-calibrated confidence bounds is a fundamentally different decision input than the same 40 percent point estimate with broad or miscalibrated bounds.

This project addresses that gap by developing a reanalysis pipeline that fuses Watershed Assessment Model (WAM) simulated hydrological ensembles with sparse observational data using four distinct uncertainty quantification methods, then passes the resulting probabilistic reanalysis products into a Random Forest bloom classifier to produce calibrated bloom probability forecasts.

Technologies learned and applied. The following technologies were learned or applied during this project: Python 3 and the scientific stack (NumPy, pandas, SciPy, scikit-learn) for the complete analysis pipeline; Jupyter notebooks for exploratory data analysis, data curation, and visualization; the Watershed Assessment Model (WAM) as the physics-based forward model producing hydrological ensemble trajectories for the Peace River watershed; the Ensemble Kalman Filter (EnKF) for sequential Bayesian data assimilation of sparse observations; Generalized Likelihood Uncertainty Estimation (GLUE) for behavioral-parameter-based uncertainty quantification; Residual Bootstrap (IID and AR(1) variants) for resampling-based uncertainty quantification; Linear Propagation of Uncertainty (LPU) for first-order Jacobian-based uncertainty propagation; the scikit-learn RandomForestClassifier as the pre-trained bloom prediction model; HydroErr for goodness-of-fit metrics including NSE, KGE, RMSE, and percent bias; pytest and coverage for unit, integration, and smoke testing; joblib for model persistence; and the USGS NWIS, FDEP, and Manasota Regional Water Supply data portals for observational hydrological and water quality data.

The four UQ methods were selected to span the principal families of hydrological uncertainty quantification: a resampling method (Bootstrap), a likelihood-filtered Monte Carlo method (GLUE), a sequential Bayesian data assimilation method (EnKF), and a first-order analytical method (LPU). This design supports direct comparison of method behavior under identical input conditions.

3. Requirements#

Functional requirements. The system must ingest daily hydrological and water quality observations from multiple source stations; load WAM-simulated ensemble trajectories; run each of the four UQ methods on the combined dataset to produce per-method ensemble reanalysis products of 200 members each; compute deterministic and probabilistic performance metrics on each reanalysis product; pass the reanalysis products into the pre-trained Random Forest classifier member-by-member to produce ensemble bloom probability forecasts; and write all outputs as standardized CSV files with consistent naming conventions.

Non-functional requirements. The pipeline must be reproducible given a fixed random seed. Method implementations must be independently testable via a unit test suite. Ensemble outputs must preserve member ordering across methods to allow pairwise comparison. The system must clearly distinguish between the calibration window where observations are available and the forecast extrapolation period where observations are absent.

User needs. The primary user is Dr. Ahmed Elshall’s research group, which requires reanalysis products that can serve as input features for downstream red tide forecasting research. Secondary users include environmental and public health agencies that require calibrated uncertainty information to support advisories, beach closure decisions, and shellfish harvesting closures.

Constraints. Observational data at the primary monitoring location, USGS Station 02296750 at Arcadia, Florida, is limited to 41 aligned total nitrogen and total phosphorus grab samples between 6 January 1999 and 29 September 2004. This observational sparsity is the binding constraint on UQ method calibration and is the principal limitation of the current results. Computational constraints included running 200-member ensembles across four methods on development workstations within typical project timelines. Time constraints imposed by the two-semester capstone sequence limited the scope of observational data acquisition and the breadth of method configurations that could be tested.

4. Design#

System architecture. The pipeline is organized into five layers. A data ingestion layer loads observational data from multiple portals and WAM simulation output. A quality assurance and quality control layer validates incoming observations against checks for negatives, gaps, duplicates, stuck sensors, and rate-of-change spikes. A UQ method layer runs each of the four methods on the combined dataset to produce 200-member ensemble reanalysis products. A metrics and scoring layer computes deterministic and probabilistic performance metrics. An ML inference layer passes each ensemble member independently through the pre-trained Random Forest bloom classifier, producing an ensemble of bloom probability trajectories rather than a single point prediction.

Data sources. The project integrates data from the following sources, all beginning 1 January 1999: USGS Stations 02296750 (Peace River at Arcadia, Florida), 02297330, and 270318081593100; Florida Department of Environmental Protection Station 3556; Manasota Regional Water Supply Stations PR14 and PR18; a weekly-interpolated dataset published by the Elshall research group at https://aselshall.github.io/redtides/machineLearning/data.html; and WAM model output for the Peace River watershed. The train/test split for the Random Forest bloom classifier is temporal, with 1 January 2019 as the cutoff. The UQ method calibration window is 1999 through 2004, constrained by observation availability at Station 02296750.

Forward model. The Watershed Assessment Model produces daily simulated trajectories of hydrological and water quality variables for the Peace River watershed. Within this project, WAM output is treated as a fixed background forecast rather than as a callable dynamical model. The 787 co-located daily timesteps between WAM output and observations form the basis for assimilation and scoring.

Uncertainty quantification methods. Each of the four methods produces a 200-member ensemble reanalysis from the same WAM background and the same observational dataset. Bootstrap (IID and AR(1) variants) resamples residuals between WAM output and observations to generate ensemble members, with the AR(1) path selected automatically when the Durbin-Watson statistic of the residual series falls below 1.5. GLUE generates a Monte Carlo parameter ensemble and filters it by a likelihood threshold to produce a behavioral parameter set, whose model runs constitute the reanalysis ensemble. EnKF applies sequential Kalman updates to assimilate each available observation into the WAM ensemble, with member-level perturbations and optional inflation applied within the forecast step. LPU uses first-order propagation of uncertainty based on Jacobian and parameter covariance, with SVD fallback for rank-deficient Jacobians.

[1]Representative Bootstrap ensemble for a single variable over the calibration window.

Bloom classifier. The Random Forest classifier is a pre-trained scikit-learn RandomForestClassifier with 100 estimators, balanced class weights, fixed random seed of 42, and 15 input features. The feature set includes Karenia brevis concentration lags, sea surface height (zos), salinity, water temperature, wind forcing, Peace River discharge and nutrient loading (TN, TP) with lags, and a four-week rolling discharge average. The classifier was trained on weekly-interpolated data spanning 1993 through 2018 (1,354 training weeks) with a held-out test set of 259 weeks beginning 2019, achieving a balanced accuracy of 0.887. Feature scaling uses a fitted RobustScaler persisted alongside the model artifact. The classifier is applied independently to each of the 200 ensemble members, producing an ensemble of bloom probability trajectories.

Engineering, science, and mathematics principles. The pipeline draws on principles from all three fields across its major components. From science, the project uses the Watershed Assessment Model as a physics-based forward operator, and the ensemble data assimilation framework is grounded in Bayesian inference, which provides the theoretical basis for combining prior model information with sparse observations. The selection of four methods spanning different UQ families (resampling, likelihood-filtered Monte Carlo, sequential Bayesian assimilation, and first-order analytical propagation) reflects a systematic scientific approach to comparing methodological options rather than committing to one a priori.

From mathematics, each method rests on a specific and well-documented mathematical foundation. Bootstrap applies resampling theory, with an AR(1) variant selected automatically via the Durbin-Watson statistic when residuals exhibit autocorrelation. GLUE filters a Monte Carlo parameter ensemble by a likelihood threshold to produce a behavioral subset. EnKF applies sequential Kalman updates derived from Bayesian filtering theory, using perturbed observations. LPU uses first-order Taylor expansion with parameter covariance estimated via the Jacobian pseudo-inverse, with SVD fallback for rank-deficient cases. The scoring framework relies on established statistical quantities (NSE, KGE, CRPS, coverage probability, spread-skill ratio), each with standard mathematical definitions in hydrology and ensemble forecast verification.

From engineering, the project applied principles of modular software design, with UQ methods implemented as independently testable submodules sharing a common BaseUQMethod interface. Reproducibility is maintained through fixed random seeds and standardized CSV output formats. Systematic quality assurance is implemented as an automated QAQC layer applied to all ingested observations. The pipeline is structured as a tested Python package rather than as ad-hoc scripts, enabling repeated execution and systematic comparison across methods and over future input data.

Broader context: public health, safety, welfare, and global/cultural/social/environmental/economic factors. The project’s motivation and design reflect considerations inherent to harmful algal bloom forecasting on the Florida Gulf coast. Red tide events cause respiratory symptoms in exposed populations through airborne brevetoxins, create the need for timely beach closures and shellfish harvest advisories, and impose substantial economic costs on coastal tourism and commercial fishing. The affected communities depend directly on healthy marine ecosystems that red tide events threaten. The emphasis on calibrated probabilistic forecasts, rather than confident point predictions, reflects a design response to how forecast products are used in public-interest contexts: decision-makers responsible for public health advisories and economic mitigation require quantified uncertainty to support risk-based actions, and overconfident point predictions provide an insufficient basis for precautionary decision-making. The explicit separation between the calibration window and the extrapolation period in the Results section reflects the same design value, communicating clearly where the reanalysis products are supported by observations and where they are not.

5. Implementation#

Implementation approach. The pipeline is implemented as a Python package named red_tide_reanalysis with modular submodules for each of the four UQ methods, shared core interfaces (BaseUQMethod, EnsembleResult, a method registry), data ingestion and QAQC modules, metrics computation, and CSV writers. A separate ML subpackage wraps the pre-trained Random Forest classifier and provides feature construction, inference, and integration with the reanalysis outputs. Jupyter notebooks handle exploratory analysis and data curation upstream of the packaged pipeline.

Programming languages, frameworks, and tools. The implementation is written in Python 3, using pandas for data frame operations, NumPy and SciPy for numerical routines including the Jacobian-based LPU formulation, scikit-learn for the Random Forest classifier, joblib for model persistence, and HydroErr for KGE and NSE computation. Version control is via Git, dependency management uses pyproject.toml, and testing is via pytest with coverage configured against the src directory.

Major features implemented. The completed implementation includes the four UQ methods, each producing 200-member ensembles; the data ingestion pipeline for the full set of observational stations identified in Section 4; deterministic and probabilistic scoring routines (NSE, KGE, RMSE, CRPS, CRPSS, coverage probability, spread-skill ratio, and rank histograms); CSV writers for ensemble outputs and bloom probabilities; and the ML inference pipeline that runs the classifier on each ensemble member.

Team functionality. John Holik served as team lead. Christopher Foster, Claiton Pinto, and Jesse Prieto collaborated on weekly tasks coordinated through team meetings and a dedicated Discord channel. The team established objectives on a weekly cadence and distributed tasks among members based on availability and familiarity with specific pipeline components.

Professional and ethical responsibilities. The pipeline’s design and reporting reflect several practices consistent with professional responsibility norms in scientific software. The transparent reporting of method limitations, including the identification of EnKF ensemble collapse and LPU spread failure, communicates negative results honestly rather than selecting favorable comparisons. The explicit acknowledgment that the 1999 to 2004 calibration window limits the validity of post-2004 uncertainty bounds for Bootstrap, GLUE, and LPU reflects scientific integrity over marketability: the report does not claim validation where observational support is absent. The automated QAQC layer applied to ingested observations reflects recognition that downstream users relying on pipeline outputs deserve data that has been systematically screened for common defects. The adoption of test-driven development in later phases, with failing tests written before implementation, reflects an engineering ethic of specifying behavior by executable verification rather than informal promise. These practices are collectively responsive to the public-interest context described above: a pipeline producing inputs to future red tide forecasting work should be held to standards of reproducibility, transparency, and honest reporting of limitations, because downstream users cannot independently verify choices made upstream.

6. Testing#

Testing framework and coverage. The pipeline is tested using pytest, with coverage configured over the src/red_tide_reanalysis package. The test suite comprises 17 test files and approximately 2,156 lines of test code, covering every module in the package. UQ method tests validate output shape of form (n_members, T), non-negativity, finiteness, seed reproducibility, and method-specific internals: the Durbin-Watson-driven AR(1) path in Bootstrap, divergence event tracking and inflation logic in EnKF, and SVD fallback in LPU. I/O tests cover the WAM loader, observation loader, and CSV writers. QAQC tests validate detection of negatives, gaps, duplicates, stuck-sensor patterns, and rate-of-change spikes. Metrics tests cover deterministic scores (NSE, KGE, RMSE), probabilistic scores (CRPS, rank histograms, spread-skill ratio), and the metrics writer. The ML subpackage is tested at three levels: feature builder unit tests, inference unit tests, and an end-to-end integration test that exercises the full ML pipeline when model artifacts are available.

Test-driven corrections and engineering judgment. Several defects were identified by the test suite or by running analysis scripts, and each led to a specific engineering correction grounded in evidence rather than speculation.

A float32 quantile aggregation bug was identified when run_inference returned a float32 array and, with only three ensemble members available during a test scenario, np.quantile(p, 0.05) on float32 data rounded above p.mean(), violating the invariant that the 5th percentile must not exceed the mean. The correction cast the probability array to float64 and clamped the 5th percentile to the minimum of the computed quantile and the mean (commit c80b3d0).

A station ID suffix misalignment was identified when the Phase 09 CLI’s glob-based prefix extraction failed to group all four methods’ ensemble CSVs. The root cause was inconsistent use of the _peace_river suffix across the GLUE, EnKF, and LPU notebook outputs. The correction standardized the suffix across all four methods (quick fix 260403-p2x, commit cd34db4).

A notebook filepath, dependency, and documentation consistency issue was corrected by replacing a space with an underscore in a filename, adding a ../ path prefix, pinning scipy>=1.13 in pyproject.toml, and correcting PROJECT.md to reflect the classifier’s actual 15-feature count (previously mis-stated as 17) (quick fix 260403-p6a, commits 6805a34 and 15d9c4a).

Phases 10 and 11 adopted a test-driven development pattern in which failing tests were written before implementation for the score_ensemble and score_baseline functions (commits fcb93b0 and ab2bfc8), ensuring scoring behavior was specified by executable tests before being implemented.

Known testing gaps. The exploratory scripts in the Scripts/ directory, including analyze_station_bias.py, analyze_temporal_overlap.py, and several plotting utilities, are not covered by the automated test suite. Defects in these scripts were caught by interactive execution rather than by pytest, and the corrective fixes did not add regression tests. This is an acknowledged gap in testing coverage and is identified as future work in Section 10.

#

7. Results#

Uncertainty quantification benchmark. The four UQ methods were compared on the 1999-2004 calibration window using the 200-member ensembles produced by each method against the 41 aligned observations at USGS Station 02296750. Deterministic accuracy is reported as Nash-Sutcliffe Efficiency (NSE) and Kling-Gupta Efficiency (KGE). Probabilistic performance is reported as Continuous Ranked Probability Score (CRPS), Continuous Ranked Probability Skill Score (CRPSS), coverage probability, and spread-skill ratio. Values are from notebooks/data/outputs/stats/method_comparison.csv.

Method	NSE	KGE	CRPS	CRPSS	Coverage	Spread-Skill
Bootstrap	0.851	0.870	7.85	0.718	0.895	0.772
GLUE	0.861	0.874	7.63	0.726	0.941	1.021
EnKF	0.963	0.938	3.89	0.861	0.378	0.440
LPU	0.864	0.900	9.10	0.674	0.085	0.014

All four methods use 200 ensemble members. Bold values indicate the best performance in each column.

Interpretation. The benchmark reveals a meaningful trade-off between point accuracy and probabilistic calibration that is obscured when only deterministic metrics are reported.

EnKF achieves the best point-prediction accuracy of the four methods, with NSE of 0.963, KGE of 0.938, and the lowest CRPS at 3.89. However, its coverage probability of 37.8 percent and spread-skill ratio of 0.44 indicate severe underdispersion. A well-calibrated ensemble should produce coverage near the nominal confidence level and spread-skill ratio near 1.0. EnKF’s ensemble is collapsing toward the mean, producing predictions that are overconfident relative to the true forecast error. This is a recognized failure mode of EnKF implementations without active ensemble maintenance, consistent with insufficient member perturbation or absent covariance inflation.

GLUE is the best-calibrated method in the benchmark. Its coverage of 94.1 percent is close to the nominal 95 percent level, and its spread-skill ratio of 1.02 is near ideal. GLUE’s point accuracy (NSE of 0.861, KGE of 0.874) is meaningfully below EnKF’s, but its probabilistic reliability is the highest of the four methods tested.

Bootstrap is acceptably calibrated, with coverage of 89.5 percent and spread-skill ratio of 0.77, indicating mild underdispersion. Its point accuracy is comparable to GLUE’s.

LPU produces the lowest-quality probabilistic forecasts by a wide margin. Its coverage of 8.5 percent and spread-skill ratio of 0.014 indicate that its uncertainty bounds are effectively zero-width. The method is producing near-deterministic predictions despite nominally providing a 200-member ensemble. Its point accuracy (NSE of 0.864, KGE of 0.900) is reasonable, but the probabilistic output is not operationally useful in the current configuration.

Critical caveat on scope. These benchmark results apply to the 1999-2004 calibration window, which is the only period with observational data available for scoring. Bootstrap, GLUE, and LPU reanalysis outputs from 2005 onward are extrapolations without observational support, and uncertainty bounds in that period must be interpreted as unvalidated. Only EnKF extends meaningfully across the full 1999-2023 record via sparse-observation assimilation at the same 41 dates. Any operational use of these reanalysis products beyond 2004 requires either additional observational data or explicit acknowledgment of the extrapolation risk.

Bloom classifier results. The pre-trained Random Forest bloom classifier achieves a balanced accuracy of 0.887 on its held-out 2019+ test set of 259 weeks. When applied member-by-member to the UQ ensemble reanalysis products, the four methods produce nearly identical mean bloom probabilities, with no timesteps in the test period exceeding the 0.5 probability threshold. This result admits two possible interpretations that the current analysis cannot distinguish: either the test period genuinely contained no bloom events that the classifier would be expected to flag, or the classifier is insensitive to differences across the reanalysis inputs at this threshold. Resolving this ambiguity is identified as future work in Section 10.

8. Product Delivery#

The deliverable transferred to Dr. Elshall comprises two connected components. The first is the reanalysis pipeline itself, implemented as a Python package (red_tide_reanalysis) with four independently usable UQ methods, shared data ingestion and QAQC layers, standardized CSV writers, deterministic and probabilistic scoring utilities, and an accompanying pytest-based test suite. The second is the ensemble datasets produced by running the four UQ methods on the Peace River observational and WAM data, delivered as CSV files with consistent member-ordering and naming conventions across methods.

The purpose of the deliverable is methodological rather than operational. The goal of the project was to test which UQ methods produce usable ensemble data, and to provide the research group with comparative ensembles produced under identical conditions by four distinct methods. The ensemble datasets are intended for use as input to downstream machine learning research, both for training new models and for evaluating how existing models respond to different characterizations of forecast uncertainty. The Random Forest bloom classifier analysis reported in Section 7 is one example of this kind of downstream evaluation; the ensembles are expected to support additional analyses of this form.

The comparative nature of the deliverable is itself a feature. Rather than committing the research group to a single UQ method, the delivered datasets allow downstream users to select the method whose characteristics best match their application: GLUE for well-calibrated probabilistic training data, EnKF for ensembles anchored tightly to observations at known dates, Bootstrap as a general-purpose baseline. The LPU outputs are included for completeness and reproducibility but are not currently recommended for downstream use, given the spread collapse documented in Section 7 and identified as future work in Section 10.

9. Conclusion#

This project produced two connected deliverables: a Python pipeline implementing four ensemble-based uncertainty quantification methods on a common interface, and the ensemble datasets generated by that pipeline for the Peace River watershed over the 1999 to 2023 period. Both are transferred to Dr. Elshall’s research group for use as input to downstream machine learning research, with the goal of supporting future work that trains and evaluates red tide forecasting models against comparatively-constructed ensemble inputs.

The central methodological finding is that deterministic accuracy alone is an insufficient basis for evaluating probabilistic forecasting methods, and that the four methods exhibit distinctly different trade-offs between point accuracy and ensemble calibration within the 1999 to 2004 calibration window. EnKF achieves the best point accuracy but is severely underdispersed. GLUE produces the best-calibrated uncertainty bounds at slightly lower point accuracy. Bootstrap is acceptably calibrated and accurate. LPU requires reconfiguration before its outputs are useful. These findings directly inform which ensembles are suitable for which downstream uses.

The project also identified a binding constraint on the scope of the current conclusions. With observations available only through 2004, uncertainty bounds from 2005 onward must be treated as unvalidated extrapolation for Bootstrap, GLUE, and LPU. Extending the observational record is the highest-priority improvement identified in Section 10. The Random Forest bloom classifier analysis, in which all four reanalysis inputs produced nearly identical sub-threshold bloom probabilities on the 2019+ test period, illustrates the intended use of the delivered datasets: comparative ensemble inputs support input-sensitivity analyses that would not be possible with a single UQ method’s output. This kind of downstream analysis is the direct motivation for the deliverable.

10. Future Work#

Several directions for continued development are identified by the current analysis.

Expanded observational record. The binding constraint on the current results is the 41 aligned observation dates at USGS Station 02296750. Acquiring additional observational data, either through expanded grab-sample campaigns or by incorporating other stations with longer aligned records, would extend the calibration window beyond 2004 and allow validation of the reanalysis products in the 2005-2023 period.

EnKF ensemble collapse mitigation. The EnKF implementation exhibits severe underdispersion in the current configuration. Standard mitigations include multiplicative or additive covariance inflation, localization of the analysis covariance, and increased perturbation of ensemble members during the forecast step. Implementing and benchmarking these mitigations is a direct next step.

LPU reconfiguration. The LPU implementation produces near-zero-width uncertainty bounds despite reasonable point accuracy, indicating that the parameter covariance estimate or Jacobian scaling is misconfigured. Diagnosing the source of the spread collapse and reconfiguring the method is required before LPU can be reported as a meaningful comparator to the other three methods.

Bloom threshold interpretation. No test-period timesteps exceeded the 0.5 bloom probability threshold across any of the four reanalysis inputs. Determining whether this reflects true absence of bloom conditions in 2019 onward or insensitivity of the classifier to input variation is necessary before the combined reanalysis and classification pipeline can be deployed operationally. This analysis requires comparison against independent bloom occurrence records for the test period.

Regression testing for exploratory scripts. The Scripts/ directory is currently covered only by interactive execution. Adding pytest-based regression tests for the analysis and plotting utilities, particularly for the datetime parsing and flow-regime binning logic where defects have previously been identified and patched, would extend the testing safety net and prevent regression.

Real-time forecasting deployment. The current pipeline operates on historical data with a fixed train/test split. Adapting the system to ingest operational observations on a near-real-time basis, with appropriate handling of data latency and missing values, would be a necessary step toward operational forecast deployment.

11. References#

No formal bibliography was maintained during the development of this project. The following are the foundational references for the methods used and are included to establish the standard citation context.

Bottcher, A. B., Hiscock, J. G., Pickering, N. B., and Jacobson, B. M. (2012). WAM: Watershed Assessment Model. Soil and Water Engineering Technology, Inc. [Reference requires confirmation with SWET or with Dr. Elshall for the current canonical citation.]

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

Beven, K., and Binley, A. (1992). The future of distributed models: Model calibration and uncertainty prediction. Hydrological Processes, 6(3), 279-298.

Burgers, G., van Leeuwen, P. J., and Evensen, G. (1998). Analysis scheme in the ensemble Kalman filter. Monthly Weather Review, 126(6), 1719-1724.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1-26.

Evensen, G. (1994). Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. Journal of Geophysical Research: Oceans, 99(C5), 10143-10162.

Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359-378.

Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of Hydrology, 377(1-2), 80-91.

Hersbach, H. (2000). Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15(5), 559-570.

Nash, J. E., and Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I: A discussion of principles. Journal of Hydrology, 10(3), 282-290.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2007). Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge University Press. [Chapter 15 is the relevant reference for parameter covariance estimation via pcov = σ²(JᵀJ)⁻¹, as used in the LPU implementation.]

U.S. Geological Survey. National Water Information System (NWIS). Station 02296750: Peace River at Arcadia, FL. Retrieved [date required]. https://waterdata.usgs.gov/nwis/