The 3rd International Workshop on Data-Centric AI

Data-Centric AI: shifting research focus from model to data.

Home

Machine learning typically focus on techniques for developing mod- els for a given dataset. However, in real-world applications, data is often messy, and improving models is not always the best way to enhance performance. One can also improve the dataset itself instead of treating it as a fixed input. Data-Centric AI (DCAI) is an emerging field that studies techniques to improve datasets sys- tematically, often resulting in significant improvements in practical ML applications. While good data scientists have traditionally used ad hoc trial-and-error methods and intuition to improve datasets manually, DCAI approaches data improvement as a systematic engineering discipline. DCAI marks a recent shift in focus from modeling to the underlying data used to train and evaluate models. Common model architectures now dominate many tasks, and scaling rules have become predictable. However, building and using datasets is still a labor-intensive and expensive process, with little infrastructure or best practices to make it easier, cheaper, and more repeatable. The DCAI movement aims to address these issues by developing high- productivity, efficient open data engineering tools for managing data in modern ML systems. This workshop aims to foster a vibrant interdisciplinary DCAI community that can tackle practical data problems. These problems include data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Many of these areas are still in their early stages, and the workshop seeks to bring them together to define and shape the DCAI movement that will influence the future of AI and ML. Interested parties can take an active role in shaping this future by submitting papers in response to the call for papers provided below.

Agenda

Date: December 15th 14:00-18:00
Time	Title	Format	Presenter/Author
14:00-14:05	Opening Remarks	5 min	Organizers
14:05-14:45	Keynote Presentation	30 min + 10 min QA session	Chandan Reddy
14:45-15:25	Keynote Presentation	30 min + 10 min QA session	Dawei Zhou
15:25-16:00	Coffee Break	35 min	Organizers
16:00-16:40	Keynote Presentation	30 min + 10 min QA session	Dongjin Song
16:40-16:55	Paper Presentation: Can Causal DAGs Generate Data-based Explanations of Black-box Models?	10 min + 5 min QA session	Arman Ashkari and El Kindi Rezig
16:55-17:10	Paper Presentation: Leveraging Structured and Unstructured Data for Tabular Data Cleaning	10 min + 5 min QA session	Pavitra Mehra and El Kindi Rezig
17:10-17:25	Paper Presentation: From Words to Actions: A Comprehensive Approach to Identifying Incel Behavior on Reddit	10 min + 5 min QA session	Ahmet Y. Demirbas, Jakir Hossain, and Ahmet Erdem Sariyuce
17:25-17:40	Paper Presentation: Active Learning with Alternating Acquisition Functions: Balancing the Exploration-Exploitation Dilemma	10 min + 5 min QA session	Cédric Jung, Shirin Salehi, and Anke Schmeink
17:40-17:55	Paper Presentation: Integrating Flow and Structure in Diagrams for Data Science	10 min + 5 min QA session	Enea Vincenzo Napolitano, Elio Masciari, and Carlos Ordonez
17:55-18:00	Closing Remarks	5 min	Organizers

Topics

We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal

Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition

Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches

Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data

Submission Details

We invite the submission of short paper (up to 6 pages) and full paper (up to 10 pages), including all content and references. Submissions must be in PDF format, and formatted according to the new Standard IEEE Conference Proceedings Template . The review process is single-blind. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the online submission system . For more questions about the workshop and submissions, please send email to yanjie.fu@asu.edu.

Important Dates (Anywhere on Earth)

Workshop Papers Submission: Nov 5, 2024
Notification of Workshop Papers Acceptance: Nov 15, 2024
Camera-ready Deadline and Copyright Form: Nov 23, 2024
Conference: Dec 15, 2024

Keynote Presentations

Long-Tailed Learning in the Open and Dynamic World: Theories, Algorithms, and Applications

Presenter: Dawei Zhou

Bio: Dawei Zhou is an Assistant Professor at the Computer Science Department of Virginia Tech and the director of the Virginia Tech Learning on Graphs (VLOG) Lab. Zhou’s primary research focuses on open-world machine learning, with applications in hypothesis generation and validation, financial fraud detection, cyber security, risk management, predictive maintenance, and healthcare. He obtained his Ph.D. degree from the Computer Science Department of the University of Illinois Urbana-Champaign (UIUC). He has authored more than 60 publications in premier academic venues across AI, data mining, and information retrieval (e.g., ICML, NeurIPS, AAAI, IJCAI, KDD, ICDM, SDM, TKDD, DMKD, WWW, CIKM) and has served as Vice Program Chair/Proceeding Chair/Local Chair/Social Media and Publicity Chair/Session Chairs/(Senior) Program Committee Members in various top ML and AI conferences (e.g., KDD, NeurIPS, ICML, WWW, SIGIR, ICLR, AAAI, IJCAI, BigData, etc.). His research is generously supported by Virginia Tech, NSF, DARPA, DHS, Commonwealth Cyber Initiative, 4VA, Deloitte, Amazon, and Cisco. His work has been recognized by the 24th CNSF Capitol Hill Science Exhibition, Cisco Faculty Research Award (2023), AAAI New Faculty Highlights roster (2024), Amazon-Initiative Research Award (2024), and NSF Career Award (2024).

Abstract: A common and fundamental property of real-world data is the long-tailed distribution, where the majority of examples come from a few head categories, while the rest of the examples belong to a massive number of tail categories. This data characteristic finds broad applicability across various domains, including financial fraud detection, e-commerce recommendation, scientific discovery, and rare disease diagnosis. Despite the tremendous progress that has been made, the vast majority of existing long-tailed learning work is essentially conducted in a closed-world environment with predefined domains, data distributions, and downstream tasks. A natural and fundamental research question largely remains nascent: How can we enable open-world long-tailed learning (OpenLT) in which data is collected from heterogeneous sources with varying data distribution and the patterns of interest are evolving and open-ended? In this talk, I will discuss our group’s recent work on 1) OpenLT Theory – characterizing the task complexity and generalization performance of long-tailed learning, 2) OpenLT Algorithm – developing a generic computational framework for long-tailed learning with label scarcity and highly-skewed data distribution, and 3) OpenLT Application - hinging on key application domains to discuss our proposed techniques and theoretical results for open-world long-tailed learning. Finally, I will conclude this talk and share thoughts about my future research.

Towards Data-Centric Time Series Analysis

Presenter: Dongjin Song

Bio: Dongjin Song has been an Assistant Professor in the School of Computing at the University of Connecticut since Fall 2020. Previously, he was a Research Staff Member at NEC Labs America in Princeton, NJ. He earned his Ph.D. in Electrical and Computer Engineering (ECE) from the University of California, San Diego (UCSD) in 2016. His research interests include machine learning, data science, and their applications in time series data analysis and graph representation learning. His work has been published in top-tier data science and artificial intelligence venues, including NeurIPS, ICML, ICLR, KDD, ICDM, SDM, AAAI, IJCAI, CVPR, and ICCV. Three of his papers have been recognized as the most influential papers by paperdigest.org. He serves as an Associate Editor for Pattern Recognition and Neurocomputing, and has contributed as an Area Chair or Senior Program Committee Member for conferences such as AAAI, IJCAI, ICDM, and CIKM. He has also co-organized the AI for Time Series (AI4TS) Workshop at IJCAI, AAAI, ICDM, and SDM, as well as the MiLeTS workshops at KDD. He won with the prestigious NSF CAREER Award and the Frontiers of Science Award (FSA) in 2024.

Abstract: The increasing ubiquity of time series data across various domains, from healthcare to finance, calls for innovative approaches to analyze and interpret complex temporal dynamics. This talk explores the emerging field of data-centric methodologies in time series analysis, drawing insights from two recent advances: Rank Supervised Contrastive Learning (RankSCL) and Semantic Space Informed Prompt Learning with Large Language Models (S2IP-LLM). RankSCL enhances time series classification by integrating rank-aware contrastive learning, assigning differential importance to positive pairs and employing targeted augmentation to enrich class-specific boundaries. S2IP-LLM, on the other hand, introduces a novel alignment of pre-trained semantic spaces with time series embeddings, leveraging large language models to achieve superior forecasting performance by augmenting time series embeddings based on pre-trained word token embeddings. Together, these approaches highlight the potential of integrating advanced representation learning and data augmentation techniques to push the boundaries of predictive performance and interpretability in time series tasks. The talk will showcase empirical results on benchmark datasets, emphasizing how data-centric innovations enable robust and generalizable solutions to real-world time series challenges.

Scientific Equation Discovery via Programming with Large Language Models

Presenter: Chandan Reddy

Bio: Chandan Reddy is a Professor in the Department of Computer Science at Virginia Tech. He received his Ph.D. from Cornell University and his M.S. from Michigan State University. His primary research interests include Machine Learning and Natural Language Processing, with applications in Healthcare, Software, E-commerce, and Human Resource Management. Dr. Reddy's research has been funded by organizations such as the NSF, NIH, DOE, DOT, and various industries. He has authored over 190 peer-reviewed articles in leading conferences and journals. He received several awards for his research work including the Best Application Paper Award at the ACM SIGKDD conference in 2010, the Best Poster Award at the IEEE VAST conference in 2014, and the Best Student Paper Award at the IEEE ICDM conference in 2016. He was also a finalist in the INFORMS Franz Edelman Award Competition in 2011. Dr. Reddy serves (or has served) on the editorial boards of journals such as ACM TKDD, ACM TIST, NPJ AI, and IEEE Big Data. He is a Senior Member of the IEEE and a Distinguished Member of the ACM. More information about his work is given at https://creddy.net.

Abstract: Mathematical equation discovery is a crucial aspect of computational scientific discovery, traditionally approached through symbolic regression (SR) methods that focus mainly on data-driven equation search. Current approaches often struggle to fully leverage the rich domain-specific knowledge that scientists typically rely on. We present LLM-SR, an iterative approach that combines the power of large language models (LLMs) with evolutionary program search and data-driven optimization to discover scientific equations more effectively and efficiently while incorporating scientific prior knowledge. LLM-SR integrates several key aspects of the discovery process, namely, scientific knowledge representation and reasoning (via LLMs’ prompting and prior knowledge), hypothesis generation (equation skeleton proposals produced by LLMs), data-driven evaluation and optimization, and evolutionary search for iterative refinement. Through this integration, our approach discovers interpretable and physically meaningful equations while ensuring efficient exploration of the equation search space and generalization to out-of-domain data. We will demonstrate LLM-SR’s effectiveness across various scientific domains - nonlinear oscillators, bacterial growth, and material stress behavior. This work not only improves the accuracy and interpretability of discovered equations but also enhances the efficiency of the search process by leveraging scientific prior knowledge.

Speakers

Dr. Dawei Zhou, Virginia Tech
Dr. Dongjin Song, University of Connecticut
Dr. Chandan Reddy, Virginia Tech

Organizing Committee

Hui Xiong

The Hong Kong University of Science and Technology (Guangzhou)

Yanjie Fu

Arizona State University

Haifeng Chen

NEC Laboratories America, Inc

Kunpeng Liu

Portland State University

Dongjie Wang

University of Kansas

Charu Aggarwal

IBM T. J. Watson Research Center