Demonstrate fitness for duty5 from the UGLI dataset, we explored how our skewed population distribution estimation method might affect policy-relevant inference compared to previously published estimates. We investigated two key environmental risks: forest fires and coastal flooding. In each case, raster data products are available and have recently been used to assess forest fires.24 and risk of flooding25. The U.S. Forest Service has produced a Wildfire Hazard Potential (WHP) map for the contiguous United States at 270 pixel resolution, with five categories of hazard26. The U.S. Federal Emergency Management Agency (FEMA) has released flood hazard data products for many U.S. counties, including the Water Surface Elevation (WSE) for a 1 % (which should occur once every 100 years). The WSE product comes with a pixel resolution of 10 meters27. These two products define a spatial area at risk and not the number of people in this area.
For each of the two hazard categories, we compared estimation methods from four different sources: (1) the UGLI (i.e., this study), (2) the United States Environmental Protection Agency the environment (EPA28), (3) Microsoft29and (4) Facebook9 (Table 2). Additionally, we compared all methods to a fifth basic method: assuming individuals are evenly distributed across the geographic area of each census block group.30. Note that each of these datasets uses different methods. Although these are freely available datasets, only the EPA product can be updated by a third party. EPA’s product uses EPA’s Intelligent Dasymetric Mapping toolkit, which combines US Census and land cover and topography data28. The Microsoft product uses building footprints and their sizes to disaggregate census block group population estimates. And the Facebook dataset uses artificial intelligence algorithms of largely unknown structure. We chose these population sources because they are all available for the United States and are provided at roughly the same spatial resolution as the UGLI. The final method, which evenly distributes the population in each block group, was used as the benchmark against which all methods were compared.
Selection of counties for the case study
For the wildfire case study, we took a random sample, stratified by population, from 15 counties in the eleven contiguous western United States (WA, OR, CA, ID, NV, MT, WY, CO, UT, AZ, NM), sampling three counties in each population quintile. We chose five counties to display in the final visualization that have enough spatial variation in wildfire risk to differentiate population methods. For the flood case study, we took a population-stratified random sample from all the counties bordering a contiguous United States coastline. data products were available. Again, we’ve chosen five counties to display examples that best illustrate the differences in population methods.
We obtained the potential wildfire hazard raster product for the entire contiguous United States and the water surface elevation product of the 1% flood event for each of the counties selected. We obtained the gridded population estimates for the three comparison methods as raster layers covering the contiguous United States (Table 2). Finally, we used previously obtained population estimates for 2016 from the American Community Survey for each of the counties chosen for the case study, along with the boundaries of each block group as a polygon layer ( table 1).
Initial raster processing
We clipped the wildfire potential raster layer and the population raster layers (US EPA31Microsoft29and Facebook15) to the extent of each case study county. Water surface elevation rasters were already provided at a single county level. For simplicity, we have converted the two environmental raster layers to binary form (i.e. risky and non-risky). For the forest fire layer, we treated all pixels in the medium, high, and very high risk categories as at risk, and the rest as not at risk. For the flood layer, we treated all pixels with water surface elevation > 0 as at risk. Next, we converted the wildfire and flood rasters to polygons by merging all adjacent pixels with the same value into a polygon. Finally, we transformed these polygon layers into the coordinate reference system of each of the population rasters.
Estimated population totals in each risk category
We overlaid the feature layers representing fire and flood risk on the population raster layers for each dasymmetric estimation method. For each fire and flood polygon, we summed the population over all pixels contained within that polygon and then calculated grand totals for each risk category in each county.
For the block group population polygons, we calculated the intersection areas of each block group polygon within each fire or flood polygon. We multiplied the population total of the block group polygon by the proportional area of overlap between each environmental risk polygon to get the total population at risk in each block group, and then calculated the totals for each county.
The results of this technical validation confirm that our dataset provides a highly comparable assessment of population estimates at risk from floods and wildfires using dasymetric mapping (Fig. 3). In the case of floods, all four population estimates were lower than the naive method. While the naive method places people equally across census block groups, each of these methods correctly identifies floodplains as locations with fewer people than upland areas. The consistency observed here demonstrates that there are no logical errors in the UGLI processing and that it produces first-order data comparable to other methods. In the case of fire risk, we observed greater variation between population data sets. In three of the five counties, the difference was small. One of the last two counties has a very small population (750 people), which probably results in greater variance between the models. Although there is no obvious reason why the population methods diverge in some counties, we can assume that it has something to do with the model assumptions used in each case. Data is not available to know which method produces the most accurate result.
Our dataset offers several advantages over other dasymmetric population datasets and contains many of the same uncertainties. Because the code provided with this data publication can be updated as new data becomes available, it can be customized to include different data sources and has a high level of reproducibility. Moreover, it has been produced in a very intuitive way which should facilitate communication with stakeholders. Future improvements could include the inclusion of additional factors that influence where people live32. The uncertainty in the population estimates provided by UGLI likely stems from the underlying data (eg, estimates of impermeable surfaces) and assumptions (eg, population distribution on impervious surfaces). NLCD impermeable surface data indicated greater than 90% accuracy20, which is excellent, and we expect the accuracy of the UGLI to improve as the NLCD improves. The ability to map the population at 30m grain is essential for understanding populations at risk of natural disasters, but also for identifying populations that benefit from ecosystem services such as access to water and trails, cover forests and conservation lands. As more data exhibiting these qualities become available, we expect a greater diversity of uses to become apparent.