The Safety Data Lake: Enabling Analysis of Disparate Safety Data Sets

In July 2018, the UL Standards & Engagement Data Science team introduced a new portal that consolidates open source data from multiple safety-focused databases into a single environment. Named the Safety Data Lake, the platform simplifies access to multiple safety data sources—curated for content and quality—to support product safety analysis and research.

Amalgamating data from disparate sources

The Data Lake contains information from the U.S. Consumer Product Safety Commission (CPSC) SaferProducts.gov website and the National Electronic Injury Surveillance System (NEISS). Additional data sets within the Data Lake include recall data from the EU’s Safety Gate, the FDA’s Medical Device incident database (Manufacturer and User Device Experience – MAUDE), and Pipeline and Hazardous Materials Safety Administration (PHMSA) incident data. The PHMSA data is part of ongoing Underwriters Laboratories research into lithium-ion battery incidents on aircraft.

The site provides data visualizations and summary statistics for each of the data files, with the ability to search and query the data in straightforward ways (keyword search, for example). Data Lake visitors can use an API to bring the data into Excel spreadsheets and Power BI for further analysis. The tool was originally launched as an internal resource for UL Standards & Engagement and UL Research Institutes employees but is now publicly available at https://opendata.ul.org/.

If the phrase “data lake” is new to you, you’re not alone. It’s an emerging concept within the data science community. Unlike a traditional data warehouse which is highly structured, a data lake has minimal structure that enables data to be intermingled to allow tools like machine learning algorithms to work on the whole set, rather than on structured subsets. Ultimately, it is a more fluid, flexible and user-driven platform.

Where this work will take us

Future capabilities envisioned for the Safety Data Lake include natural language processing of incident narratives based on user-defined subjects, comprehensive search capability across all data sets simultaneously, and recommendation engines that sort through data to find incidents of a similar nature to the ones currently being viewed. The team aims to include additional CPSC data (violations, fines, etc.) and plans to extend the concept to other sources of open data, such as National Fire Incident Reporting System (NFIRS), the European Injury Data Base and more.

Finally, the team is investigating the development of other potentially public-facing platforms to increase the usability of data to drive comprehensive research and analysis of safety data.

Interested in learning more about the Safety Data Lake and the Data Science team? Contact us.

Fast Facts

A data lake intermingles multiple sources of data to allow for simultaneous analysis of the entire set.
Data from seven unique sources comprises the Safety Data Lake with additional data sources and capabilities to be added over time.