MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and BLINK systems. MMEAD is an easy-to-install Python package, allowing users to load the link data and entity embeddings effortlessly. Using MMEAD takes only a few lines of code. Finally, we show how MMEAD can be used for IR research that uses entity information. We show how to improve recall@1000 and MRR@10 on more complex queries on the MS MARCO v1 passage dataset by using this resource. We also demonstrate how entity expansions can be used for interactive search applications.
Lexical and semantic matching capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust than either alone. Prior work performs hybrid retrieval by conducting lexical and semantic matching using different systems (e.g., Lucene and Faiss, respectively) and then fusing their model outputs. In contrast, our work integrates lexical representations with dense semantic representations by densifying high-dimensional lexical representations into what we call low-dimensional dense lexical representations (DLRs). Our experiments show that DLRs can effectively approximate the original lexical representations, preserving effectiveness while improving query latency. Furthermore, we can combine dense lexical and semantic representations to generate dense hybrid representations (DHRs) that are more flexible and yield faster retrieval compared to existing hybrid techniques. In addition, we explore jointly training lexical and semantic representations in a single model and empirically show that the resulting DHRs are able to combine the advantages of the individual components. Our best DHR model is competitive with state-of-the-art single-vector and multi-vector dense retrievers in both in-domain and zero-shot evaluation settings. Furthermore, our model is both faster and requires smaller indexes, making our dense representation framework an attractive approach to text retrieval. Our code is available at https://github.com/castorini/dhr .
We present a hybrid text and geospatial search application for hydrographic datasets built on the open-source Lucene search library. Our goal is to demonstrate that it is possible to build custom GIS applications by integrating existing open-source components and data sources, which contrasts with existing approaches based on monolithic platforms such as ArcGIS and QGIS. Lucene provides rich index structures and search capabilities for free text and geometries; the former has already been integrated and exposed via our group's Anserini and Pyserini IR toolkits. In this work, we extend these toolkits to include geospatial capabilities. Combining knowledge extracted from Wikidata with the HydroSHEDS dataset, our application enables text and geospatial search of rivers worldwide.
Ten best practices to strengthen stewardship and sharing of water science data in Canada
K. A. Dukacz,
Gopal Chandra Saha,
Jason J. Venkiteswaran,
Homa Kheyrollah Pour,
Brent B. Wolfe,
Sean K. Carey,
John W. Pomeroy,
C. M. DeBeer,
J. M. Waddington,
Philippe Van Cappellen,
Hydrological Processes, Volume 35, Issue 11
Water science data are a valuable asset that both underpins the original research project and bolsters new research questions, particularly in view of the increasingly complex water issues facing Canada and the world. Whilst there is general support for making data more broadly accessible, and a number of water science journals and funding agencies have adopted policies that require researchers to share data in accordance with the FAIR (Findable, Accessible, Interoperable, Reusable) principles, there are still questions about effective management of data to protect their usefulness over time. Incorporating data management practices and standards at the outset of a water science research project will enable researchers to efficiently locate, analyze and use data throughout the project lifecycle, and will ensure the data maintain their value after the project has ended. Here, some common misconceptions about data management are highlighted, along with insights and practical advice to assist established and early career water science researchers as they integrate data management best practices and tools into their research. Freely available tools and training opportunities made available in Canada through Global Water Futures, the Portage Network, Gordon Foundation's DataStream, Compute Canada, and university libraries, among others are compiled. These include webinars, training videos, and individual support for the water science community that together enable researchers to protect their data assets and meet the expectations of journals and funders. The perspectives shared here have been developed as part of the Global Water Futures programme's efforts to improve data management and promote the use of common data practices and standards in the context of water science in Canada. Ten best practices are proposed that may be broadly applicable to other disciplines in the natural sciences and can be adopted and adapted globally. This article is protected by copyright. All rights reserved.
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.
Great Lakes Runoff Intercomparison Project Phase 3: Lake Erie (GRIP-E)
Bryan A. Tolson,
Helen C. Shen,
Tricia A. Stadnyk,
Lauren M. Fry,
Emily A. Bradley,
André Guy Tranquille Temgoua,
N. B. Basu,
Narayan Kumar Shrestha,
James R. Craig,
Journal of Hydrologic Engineering, Volume 26, Issue 9
AbstractHydrologic model intercomparison studies help to evaluate the agility of models to simulate variables such as streamflow, evaporation, and soil moisture. This study is the third in a sequen...
Accurate streamflow prediction largely relies on historical meteorological records and streamflow measurements. For many regions, however, such data are only scarcely available. Facing this problem, many studies simply trained their machine learning models on the region's available data, leaving possible repercussions of this strategy unclear. In this study, we evaluate the sensitivity of tree- and LSTM-based models to limited training data, both in terms of geographic diversity and different time spans. We feed the models meteorological observations disseminated with the CAMELS dataset, and individually restrict the training period length, number of training basins, and input sequence length. We quantify how additional training data improve predictions and how many previous days of forcings we should feed the models to obtain best predictions for each training set size. Further, our findings show that tree- and LSTM-based models provide similarly accurate predictions on small datasets, while LSTMs are superior given more training data.
The acceleration of climate change and its impact highlight the need for long-term reliable climate data at high spatiotemporal resolution to answer key science questions in cold regions hydrology. Prior to the digital age, climate records were archived on paper. For example, from the 1950s to the 1990s, solar radiation data from recording stations worldwide were published in booklets by the former Union of Soviet Socialist Republics (USSR) Hydrometeorological Service. As a result, the data are not easily accessible by most researchers. The overarching aim of this research is to develop techniques to convert paper-based climate records into a machine-readable format to support environmental research in cold regions. This study compares the performance of a proprietary optical character recognition (OCR) service with an open-source OCR tool for digitizing hydrometeorological data. We built a digitization pipeline combining different image preprocessing techniques, semantic segmentation, and an open-source OCR engine for extracting data and metadata recorded in the scanned documents. Each page contains blocks of text with station names and tables containing the climate data. The process begins with image preprocessing to reduce noise and to improve quality before the page content is segmented to detect tables and finally run through an OCR engine for text extraction. We outline the digitization process and report on initial results, including different segmentation approaches, preprocessing image algorithms, and OCR techniques to ensure accurate extraction and organization of relevant metadata from thousands of scanned climate records. We evaluated the performance of Tesseract OCR and ABBYY FineReader on text extraction. We find that although ABBY FineReader has better accuracy on the sample data, our custom extraction pipeline using Tesseract is efficient and scalable because it is flexible and allows for more customization.
Abstract. Long Short-Term Memory (LSTM) networks have been applied to daily discharge prediction with remarkable success. Many practical applications, however, require predictions at more granular timescales. For instance, accurate prediction of short but extreme flood peaks can make a lifesaving difference, yet such peaks may escape the coarse temporal resolution of daily predictions. Naively training an LSTM on hourly data, however, entails very long input sequences that make learning difficult and computationally expensive. In this study, we propose two multi-timescale LSTM (MTS-LSTM) architectures that jointly predict multiple timescales within one model, as they process long-past inputs at a different temporal resolution than more recent inputs. In a benchmark on 516 basins across the continental United States, these models achieved significantly higher Nash–Sutcliffe efficiency (NSE) values than the US National Water Model. Compared to naive prediction with distinct LSTMs per timescale, the multi-timescale architectures are computationally more efficient with no loss in accuracy. Beyond prediction quality, the multi-timescale LSTM can process different input variables at different timescales, which is especially relevant to operational applications where the lead time of meteorological forcings depends on their temporal resolution.
Abstract. Long Short-Term Memory Networks (LSTMs) have been applied to daily discharge prediction with remarkable success. Many practical scenarios, however, require predictions at more granular timescales. For instance, accurate prediction of short but extreme flood peaks can make a life-saving difference, yet such peaks may escape the coarse temporal resolution of daily predictions. Naively training an LSTM on hourly data, however, entails very long input sequences that make learning hard and computationally expensive. In this study, we propose two Multi-Timescale LSTM (MTS-LSTM) architectures that jointly predict multiple timescales within one model, as they process long-past inputs at a single temporal resolution and branch out into each individual timescale for more recent input steps. We test these models on 516 basins across the continental United States and benchmark against the US National Water Model. Compared to naive prediction with a distinct LSTM per timescale, the multi-timescale architectures are computationally more efficient with no loss in accuracy. Beyond prediction quality, the multi-timescale LSTM can process different input variables at different timescales, which is especially relevant to operational applications where the lead time of meteorological forcings depends on their temporal resolution.
Data-intensive research and decision-making continue to gain adoption across diverse organizations. As researchers and practitioners increasingly rely on analyzing large data products to both answer scientific questions and for operational needs, data acquisition and pre-processing become critical tasks. For environmental science, the Canadian Surface Prediction Archive (CaSPAr) facilitates easy access to custom subsets of numerical weather predictions. We demonstrate a new open-source interface for CaSPAr that provides easy-to-use map-based querying capabilities and automates data ingestion into the CaSPAr batch processing server.