San Diego Beachwatch Data, With Features

Water quality data for San Diego county beaches from CEDEN, with added features for log transformation, quantiles and group codes.

Resources | Packages | Documentation| Contacts| References



This datasets rebuilds with constant and null columns removed and many features added. It also breaks out station information into a seperate datasets, and enumerates the many difference combinations of methodname/analyte/unit, adding a code for each group to the dataset in measure_code. The measure code identifies sets of records that have compatible measurements.

The dataset adds counts, mean, median and quantiles for groups of station_code/measure_code. The dataset rows are grouped, for each station and measure code, and mean, median and quantiles computed for each group. The procedure is performed both for result and for lresult, the log of results.

After computing the group summary statistics, the processing creates dichotomous features for the relationship of result and lresult to the summary value, including:

  • Greater than the median
  • Greater than the mean
  • Less than or equal to the 25th percentile
  • Greater than or equal to the 7th percentile

These variables are particularly useful for doing logistic regressions across the measure code groups or stations.

Elided Columns

This datasets excludes the const and empty columns from the source dataset. These columns and their values are:

program                      BeachWatch
parentproject                BeachWatch_San Diego County
project                      BeachWatch_San Diego County
locationcode                 SurfZone
collectiondepth              -88
unitcollectiondepth          NR
sampletypecode               Grab
collectionreplicate          1
resultsreplicate             1
labsampleid                  Not Recorded
matrixname                   samplewater
mdl                          -88
rl                           -88
batchverification            NR
compliancecode               NR
eventcode                    WQ
protocolcode                 Not Recorded
collectionmethodname         Water_Grab
collectiondevicedescription  Not Recorded
calibrationdate              0000-00-00
positionwatercolumn          Not Recorded
preppreservationname         Not Recorded
preppreservationdate         0000-00-00 00:00:00
digestextractmethod          Not Recorded
digestextractdate            0000-00-00
analysisdate                 0000-00-00
dilutionfactor               -88
expectedvalue                0
submissioncode               NR
county                       San Diego
county_fips                  73
regional_board               San Diego
rb_number                    9
sampleid                     Not Recorded

The dataset also excludes these Null columns:

  • observation
  • samplecomments
  • collectioncomments
  • resultscomments
  • batchcomments
  • groupsamples
  • occupationmethod
  • startingbank
  • distancefrombank
  • unitdistancefrombank
  • streamwidth
  • unitstreamwidth
  • stationwaterdepth
  • unitstationwaterdepth
  • hydromod
  • hydromodloc
  • locationdetailwqcomments
  • channelwidth
  • upstreamlength
  • downstreamlength
  • totalreach
  • locationdetailbacomments
  • huc8
  • huc8_number
  • huc10
  • huc10_number
  • huc12
  • huc12_number
  • waterbody_type


The most prevalent measure code in this dataset is 24 for Enterococcus (analyte) meaured with Enterolert (methodname) in units of MPN/100 mL. This is probably because in 2004, the EPA changed recomendations to use Enterococcus as a primary indicator bacteria in coastal waters:

EPA subsequently recommended the use of E. coli or enterococci for fresh
recreational waters and enterococci for marine recreational waters because
levels of these organisms more accurately predict acute gastrointestinal
illness than levels of fecal coliforms.



Accessing Packages in Metapack

import metapack as mp
# ZIP Package
pkg = mp.open_package('')
# CSV Package
pkg = mp.open_package('') 

resource = pkg.resource('resource_name') # Get a resource
df = resource.dataframe() # Create a pandas Dataframe
gdf = resource.geoframe() # Create a GeoPandas GeoDataFrame


Urls used in the creation of this data package.

  • Beachwatch source data

Last Modified 2018-08-10T22:40:33