Research Summary: Predicting E. Coli Levels for Beaches at Presque Isle State Park

by Michael Rutter
— May 7, 2014

Presque Isle Bay State Park (PISP) is located in Erie, Pa and entertains more than 4.5 million visitors annually with seven miles of beaches along Lake Erie. One potential danger to swimmers at PISP and other Great Lakes beaches is the presence of bacteria in the water that can cause illness in humans. In order to protect swimmers from getting sick, PISP rangers monitor water quality and issue warnings and even close beaches when there is a potential for bacteria levels to reach dangerous levels. The bacteria that actually cause illnesses are varied and hard to detect, however studies have shown that dangerous bacteria levels are highly correlated with levels of Escherichia coli, commonly known as E. coli. E. coli is commonly found in the digestive system of humans and other warm-blooded organisms as well as in soil and water. While E. coli is usually not the cause of illness in swimmers at beaches, it can be easily detected and when E. coli levels are high, it is assumed that other levels of bacteria are also at dangerous levels.

Beach at Presque Isle State Park near Erie, Pennsylvania. (Credit: Wikimedia Commons User Kraus via Creative Commons)

The current, most cost-effective method of determining E. coli levels in water samples has the drawback of taking 24 hours for E. coli colony forming units (CFU) to become numerous enough to count on an agar plate. While rapid methods for determining E. coli have been developed, the costs of implementation can be prohibitive and other methods are being explored in order to predict when dangerous levels of bacteria may occur. One such approach is the use of data-driven prediction models that use readily available weather and lake condition data to determine when conditions at the beach are favorable for high levels of E. coli. These data-driven models have the ability to make predictions before swimmers arrive at PISP, thus increasing the ability to safeguard human health by warning swimmers and closing beaches when necessary.

Methods

The presence of E. coli at PISP beaches is sometimes observed to correlate with heavy rain events due to increased runoff and stream flow from the predominately agricultural area to the west of PISP. Not every rain event results in high levels of E. coli, and high levels of E. coli have also been observed after high wind events during hot, dry time periods. In order to create a data-driven statistical model, weather-related data are automatically collected from Erie International Airport via the internet. Information about temperature, rain fall, wind speed, wind direction, barometric pressure and humidity are collected not only for the day before the prediction needs to be made, but for multiple days. From this data, additional calculations regarding the number of days since it has rained, the number of days since a high wind event, and other variables of potential interest are calculated. To monitor changes in stream flow, data from USGS stream gauges found to the west of PISP are included in the data set. Additional data about current conditions in Lake Erie are taken from the Great Lakes Coastal Forecasting System (GLCFS), provided by the Great Lakes Environmental Research Lab. From the GLCFS, information about water temperature, lake water level, wave height, wave direction and current direction are able to be used as inputs into the model. In order to calibrate the model, data collected since 2006 on E. coli levels based on plate counts are used as the response variable, with plates containing more than 235 colony forming units per 100 ml of water (cfu/100 ml) considered a dangerous level.

In order to predict when high levels of E. coli are a possibility at PISP beaches, a model based on classification trees is used. Similar to dichotomous keys used to identify plants and animals, classification trees are a series of computer generated “yes/no” questions that can be used to determine if there is a risk for dangerous levels of E. coli. For example, the first question asked may be “has the average wave height this morning exceeded 1.1 feet?” If yes, then a question about wind direction is asked; if not, a question about current direction is used. The variable used in each question, the value the variable is compared to, and the number of questions used is determined by the statistical software package R to minimize the amount of classification errors – that is, to minimize the number of days incorrectly classified as having low or dangerous levels of E. coli.

An example prediction tree diagram. (Credit: Michael Rutter)

The probability that E. coli are present at dangerous levels is calculated using a random forest technique. Rather than create one classification tree, 5,000 trees are created based on a randomly selected subsets of the possible predictor variables. For a day in which a prediction needs to be made, weather and lake conditions are obtained from the internet and for each of the 5,000 trees, the predicted level of E. coli is determined. The percentage of the 5,000 tress classifying the observed conditions as those that will produce levels of E. coli greater than 235 cfu/100ml is the probability that the observed conditions will result in dangerous levels of E. coli. This information is provided to PISP managers multiple times a day and the results are used to help formulate when warnings or beach closings are issued at Presque Isle State Park.

Implementation

Currently, the data-driven, random, forest-based prediction model is being used by managers at Presque Isle State Park as one of the many tools to help monitor beaches for public safety. Since the model is based 100% on data available from the internet, the model is able to produce predictions earlier in the day than methods that require input derived from observations or samples. One of the drawbacks of random forest-based models is that a large number of observations is needed to properly calibrate the model. Even though data from 2006 is currently being used, the number of observations used in calibration is less than 325. After each summer swimming season, the E. coli plating results for the season are added to the model in an effort to increase the accuracy during the next swimming season.

To test the accuracy of the model, leave-one-out, cross-validation methods are applied. From the data set used to calibrate the model, one day’s worth of data is removed from the model and the model is re-fit. Using the observed weather conditions for the day removed from the data set, the model predicts the E. coli level for the removed day. This is repeated for every day available in the data set, and the accuracy of the model can be determined. Based on this cross-validation method, the model currently has a 94% negative predictive rate. That is, 94% of the time the model predicts E. coli levels below the dangerous level (235 cfu/100ml), the model is correct and there is a 6% false negative rate. On the flip side, the positive predictive value is only 50%, that is there is a 50% false positive rate. Additional research is ongoing in an effort to reduce the false positive rate, but PISP managers are able to use the model as an early-warning system, relying on other methods to determine if conditions truly represent a danger to human health. As part of a suite of tools available to managers, the data-driven prediction model is helping make the beaches at Presque Isle State Park safer for all visitors.

Full study presented at Tom Ridge Environmental Center 9th Annual research Symposium, November 2013.