Kathy Szigeti


2022

DOI bib
Current State of Microplastic Pollution Research Data: Trends in Availability and Sources of Open Data
Tia Jenkins, Bhaleka Persaud, Win Cowger, Kathy Szigeti, Dominique G. Roche, Erin Clary, Stephanie Slowinski, Benjamin Lei, Amila Abeynayaka, Ebenezer S. Nyadjro, Thomas Maes, Leah M. Thornton Hampton, Melanie Bergmann, Julian Aherne, Sherri A. Mason, John F. Honek, Fereidoun Rezanezhad, Amy Lusher, Andy M. Booth, Rodney D. L. Smith, Philippe Van Cappellen
Frontiers in Environmental Science, Volume 10

The rapid growth in microplastic pollution research is influencing funding priorities, environmental policy, and public perceptions of risks to water quality and environmental and human health. Ensuring that environmental microplastics research data are findable, accessible, interoperable, and reusable (FAIR) is essential to inform policy and mitigation strategies. We present a bibliographic analysis of data sharing practices in the environmental microplastics research community, highlighting the state of openness of microplastics data. A stratified (by year) random subset of 785 of 6,608 microplastics articles indexed in Web of Science indicates that, since 2006, less than a third (28.5%) contained a data sharing statement. These statements further show that most often, the data were provided in the articles’ supplementary material (38.8%) and only 13.8% via a data repository. Of the 279 microplastics datasets found in online data repositories, 20.4% presented only metadata with access to the data requiring additional approval. Although increasing, the rate of microplastic data sharing still lags behind that of publication of peer-reviewed articles on environmental microplastics. About a quarter of the repository data originated from North America (12.8%) and Europe (13.4%). Marine and estuarine environments are the most frequently sampled systems (26.2%); sediments (18.8%) and water (15.3%) are the predominant media. Of the available datasets accessible, 15.4% and 18.2% do not have adequate metadata to determine the sampling location and media type, respectively. We discuss five recommendations to strengthen data sharing practices in the environmental microplastic research community.

2021

DOI bib
Rescuing historical climate observations to support hydrological research
Ogundepo Odunayo, Naveela N. Sookoo, Gautam Bathla, Anthony Cavallin, Bhaleka Persaud, Kathy Szigeti, Philippe Van Cappellen, Jimmy Lin
Proceedings of the 21st ACM Symposium on Document Engineering

The acceleration of climate change and its impact highlight the need for long-term reliable climate data at high spatiotemporal resolution to answer key science questions in cold regions hydrology. Prior to the digital age, climate records were archived on paper. For example, from the 1950s to the 1990s, solar radiation data from recording stations worldwide were published in booklets by the former Union of Soviet Socialist Republics (USSR) Hydrometeorological Service. As a result, the data are not easily accessible by most researchers. The overarching aim of this research is to develop techniques to convert paper-based climate records into a machine-readable format to support environmental research in cold regions. This study compares the performance of a proprietary optical character recognition (OCR) service with an open-source OCR tool for digitizing hydrometeorological data. We built a digitization pipeline combining different image preprocessing techniques, semantic segmentation, and an open-source OCR engine for extracting data and metadata recorded in the scanned documents. Each page contains blocks of text with station names and tables containing the climate data. The process begins with image preprocessing to reduce noise and to improve quality before the page content is segmented to detect tables and finally run through an OCR engine for text extraction. We outline the digitization process and report on initial results, including different segmentation approaches, preprocessing image algorithms, and OCR techniques to ensure accurate extraction and organization of relevant metadata from thousands of scanned climate records. We evaluated the performance of Tesseract OCR and ABBYY FineReader on text extraction. We find that although ABBY FineReader has better accuracy on the sample data, our custom extraction pipeline using Tesseract is efficient and scalable because it is flexible and allows for more customization.