2023 NSF HDR Ecosystem Conference

October 16-18, 2023

NSF Harnessing the Data Revolution (HDR) 2023 Ecosystem Conference Unites Data-intensive Research Community

More than 100 scientists, engineers and educators gathered in Denver, Colorado in October to discuss solutions to some of the most pressing challenges in data-intensive research, education and workforce development. The 2023 NSF Harnessing the Data Revolution (HDR) Ecosystem Conference brought together representatives of academia, government and the private sector from all career stages, from graduate students to senior professors, with the goal of expanding the HDR “ecosystem” to other related NSF-supported initiatives.

Since 2019, the NSF Harnessing the Data Revolution (HDR) initiative has led the charge in embedding foundational advances in artificial intelligence/machine learning (AI/ML) into grand challenges spanning diverse scientific domains. Represented at the conference were the HDR Institutes for Data-Intensive Research in Science and Engineering (Institutes), the HDR Data Science Corps (DSCs), and the HDR Transdisciplinary Research In Principles Of Data Science (TRIPODS), as well as members of the wider data-intensive research community.

The overarching goals of the conference were four-fold:

  1. Community building to strengthen relationships among HDR entities and wider data-intensive research communities
  2. Cross-learning to build on successes, best practices and innovative products
  3. Providing a forum for reflection on accomplishments and future goals of each HDR entity, and
  4. Identifying cross-cutting challenges in data intensive research among HDR entities and beyond, develop new collaborations, and seek future opportunities.

Over three full days, conference attendees participated in a variety of keynote presentations, discussion panels, and “unconference” activities including a pitch session and subsequent asset mapping for the winning pitches. Topical sessions, some of which were developed prior to the conference and others which arose out of facilitated conversations at the conference, included geospatial knowledge discovery harnessing pre-trained language models on CyberGISX, uncertainty quantification, 10 simple rules for model meta-data (FAIR data), data ethics, and resources for training the next generation of scientists and engineers.

The keynote speakers, Yisong Yue from the California Institute of Technology and David W. Hogg from New York University and the Flatiron Institute, inspired attendees with examples of using ML approaches ranging from Baysian optimization to guide experimental design to assisting clinicians with real-world decision making. The keynote speakers also challenged participants to explore ways to build trust in ML methods, and to be aware of and limit biases and systematic errors that can be present in ML simulations.

While the Institutes work on extremely diverse domain challenges – from melting glaciers to particle physics – conference attendees found common ground with data science and team science challenges:

  • How to detect out-of-distribution events and lay the groundwork for uncertainty quantification and generalizability?
  • How to move beyond search to more complex hypothesis generation and ultimately causality?
  • How to advance past out-of-sample predictive performance in evaluating models?

At the intersection of data science and team science, the Institutes collectively face challenges involving:

  • How do we efficiently onboard team members across disciplines?
  • How do we create streamlined data workflows and manage heterogeneous data sources?
  • How do we catalyze projects that are compelling across disciplines?
  • How do we ensure that the projects bring intellectual and professional value for participants from both the domain side and the data side?

Strong commonality was found in the area of broader impact-related challenges, as well.

  • How do we work together to recruit and onboard undergraduates and post-baccalaureate students into interdisciplinary research projects?
  • How do we work together to engage the broader computer science community in domain challenges?

The final day of the conference culminated in an ideation expo, in which six teams proposed ways to address challenges that were discussed during the conference. Two ideas that emerged that participants were particularly excited about are (i) creating machine learning challenges to spur cross-disciplinary research and (ii) developing an HDR-wide education repository. Both of these ideas are moving forward after the conference. The A3D3 Institute is spearheading a cross-domain ML challenge and proposed to present awards for the challenge at next year’s annual HDR ecosystem conference. The HDR-wide education repository is being developed by members of all five HDR Institutes, with input from members of DSCs. At the individual level, two thirds of participants surveyed responded that they had formed a new collaboration at the conference.

Perspectives and insights from conference participants include:

  • Despite very different scientific objectives, there were clear common themes that emerged across the institutes: organizing data, evaluating complex models, and managing people with different goals and incentives.
  • I enjoyed participating in the mentorship panel/dinner. It was a nice opportunity to provide some guidance on how to navigate life as junior faculty.
  • It was interesting talking to other data scientists and hearing how they navigated both collaborations within their institutes and also how they communicated the scientific value of interdisciplinary work to their home departments.

The 2023 NSF HDR Ecosystem Conference successfully fostered a collaborative environment, uniting diverse experts to address the multifaceted challenges of data-intensive research. Looking to the future, we expect the active efforts to converge around (i) data challenges, (ii) post-baccalaureate scholars, and (iii) education and outreach repositories will rapidly pay dividends. The 2024 HDR Ecosystem Conference is planned for September 9-12 on the campus of the University of Illinois at Urbana-Champaign. This conference will build on the previous HDR ecosystem events. During this event, HDR ecosystem participants will showcase science accomplishments from their projects and prospects for advancing further data-driven discovery. We will involve AI institutes, key industry partners and representatives from high-performance computing providers to strategize together with all participants regarding collaborative ways to best support data and AI-driven workflows for research, education, outreach. The current results from the cross-HDR machine learning challenges will be presented and discussed, along with awards for challenge winners and strategies to evolve the challenges to make the most of interdisciplinary activities at the intersection of AI and data-intensive science.