Why Yuma County?
Yuma County, Arizona is one of the most agriculturally productive counties in the United States and one of the most water-stressed. It sits at the confluence of the Colorado and Gila Rivers, relies almost entirely on irrigation for its $3 billion-per-year crop output, and has experienced some of the most severe prolonged drought conditions in recorded history.
That combination — economic importance, extreme climate exposure, and rich historical data — made it an ideal study area for testing whether machine learning models can reliably distinguish drought from non-drought periods using observable climate variables, and more ambitiously, whether they can forecast drought conditions several weeks in advance.
The data pipeline
I integrated three federal datasets covering the same geographic footprint:
- NOAA climate records — daily temperature, precipitation, and humidity readings from weather stations in and around Yuma County, spanning multiple decades.
- USGS groundwater data — well-depth measurements at monitoring sites, used as a lagged indicator of sustained dry conditions that surface precipitation measures can miss.
- USCRN soil moisture and temperature — sub-surface readings at multiple depths, providing a direct measure of water availability to crops rather than relying purely on atmospheric proxies.
Merging these datasets required careful handling of different temporal resolutions (daily vs. weekly), missing values during equipment outages, and spatial alignment across monitoring sites that did not share coordinates. The final feature set comprised 40+ engineered features including rolling means, lag variables, and cross-sensor interaction terms.
The class imbalance problem
Extreme drought is, by definition, a rare event. In the historical record, drought conditions meeting the D3/D4 threshold (severe and exceptional drought on the USDM scale) occur in a relatively small fraction of weeks. A naive model that predicts "no drought" for every observation would achieve high accuracy simply by exploiting this imbalance — while being completely useless for the actual forecasting task.
I addressed class imbalance using SMOTE (Synthetic Minority Over-sampling Technique) on the training set, combined with class-weight adjustments in the model training step. I evaluated all models on held-out data without any resampling to ensure performance estimates reflected real-world conditions.
Models compared
Used to explore feature structure and reduce dimensionality before supervised modeling. Revealed two dominant climate regimes corresponding to wet and dry seasons.
Strong baseline. Handled non-linear interactions between soil moisture and temperature well. Feature importance scores pointed to groundwater depth as the single most predictive variable.
Competitive on precision but sensitive to hyperparameter choices. Required a grid search over kernel type and regularization parameter C to avoid overfitting on the resampled training set.
Random Forest provided the strongest overall performance on the minority class (extreme drought weeks), with higher recall than SVM at comparable precision thresholds. The feature importance analysis was a useful secondary output: groundwater depth emerged as the top predictor, followed by a 30-day rolling soil moisture deficit and minimum daily temperature anomaly. This ordering makes physical sense — groundwater responds to sustained dry conditions over weeks and months, making it a more reliable signal than short-term precipitation gaps.
What PCA contributed beyond dimensionality reduction
Beyond its use as a preprocessing step, the PCA analysis surfaced something interpretively valuable. The first two principal components together explained roughly 65% of variance in the climate feature set, and plotting county-weeks in PC space revealed clear seasonal structure — a predictable annual arc — with drought periods clustering in a distinct region of that space, separated from normal conditions along the second principal component.
This told me that the supervised problem was geometrically tractable: drought periods are not randomly scattered through climate feature space. They occupy a consistent region that a linear or moderately non-linear boundary can capture. That finding increased my confidence in the supervised results and helped explain why Random Forest (which partitions feature space with axis-aligned splits) performed well without needing deep trees.
What I would do differently
The most significant gap in this analysis is temporal leakage. Splitting by random sample rather than time means that features from weeks adjacent to a drought event (which share correlated climate patterns) may appear in both training and test sets. A proper evaluation requires a strict chronological split — train on years 1–N, evaluate on years N+1 and beyond. I would implement this as the first change in a revised version of the project.
I would also experiment with time-series specific models — LSTMs or temporal convolutional networks — that can explicitly leverage the sequential structure of climate data rather than relying on manually engineered lag features to approximate temporal dependencies.