Home
Machine learning typically focus on techniques for developing mod- els for a given dataset. However, in real-world applications, data is often messy, and improving models is not always the best way to enhance performance. One can also improve the dataset itself instead of treating it as a fixed input. Data-Centric AI (DCAI) is an emerging field that studies techniques to improve datasets sys- tematically, often resulting in significant improvements in practical ML applications. While good data scientists have traditionally used ad hoc trial-and-error methods and intuition to improve datasets manually, DCAI approaches data improvement as a systematic engineering discipline. DCAI marks a recent shift in focus from modeling to the underlying data used to train and evaluate models. Common model architectures now dominate many tasks, and scaling rules have become predictable. However, building and using datasets is still a labor-intensive and expensive process, with little infrastructure or best practices to make it easier, cheaper, and more repeatable. The DCAI movement aims to address these issues by developing high- productivity, efficient open data engineering tools for managing data in modern ML systems. This workshop aims to foster a vibrant interdisciplinary DCAI community that can tackle practical data problems. These problems include data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Many of these areas are still in their early stages, and the workshop seeks to bring them together to define and shape the DCAI movement that will influence the future of AI and ML. Interested parties can take an active role in shaping this future by submitting papers in response to the call for papers provided below.
Agenda
Date: December 15th 14:00-18:00 | ||||
Time | Title | Format | Presenter/Author | |
14:00-14:05 | Opening Remarks | 5 min | Organizers | |
14:05-14:45 | Keynote Presentation | 30 min + 10 min QA session | Chandan Reddy | |
14:45-15:25 | Keynote Presentation | 30 min + 10 min QA session | Dawei Zhou | |
15:25-16:00 | Coffee Break | 35 min | Organizers | |
16:00-16:40 | Keynote Presentation | 30 min + 10 min QA session | Dongjin Song | |
16:40-16:55 | Paper Presentation: Can Causal DAGs Generate Data-based Explanations of Black-box Models? | 10 min + 5 min QA session | Arman Ashkari and El Kindi Rezig | |
16:55-17:10 | Paper Presentation: Leveraging Structured and Unstructured Data for Tabular Data Cleaning | 10 min + 5 min QA session | Pavitra Mehra and El Kindi Rezig | |
17:10-17:25 | Paper Presentation: From Words to Actions: A Comprehensive Approach to Identifying Incel Behavior on Reddit | 10 min + 5 min QA session | Ahmet Y. Demirbas, Jakir Hossain, and Ahmet Erdem Sariyuce | |
17:25-17:40 | Paper Presentation: Active Learning with Alternating Acquisition Functions: Balancing the Exploration-Exploitation Dilemma | 10 min + 5 min QA session | Cédric Jung, Shirin Salehi, and Anke Schmeink | |
17:40-17:55 | Paper Presentation: Integrating Flow and Structure in Diagrams for Data Science | 10 min + 5 min QA session | Enea Vincenzo Napolitano, Elio Masciari, and Carlos Ordonez | |
17:55-18:00 | Closing Remarks | 5 min | Organizers |
Topics
We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:
- Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal
- Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition
- Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches
- Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data
Submission Details
Important Dates (Anywhere on Earth)
- Workshop Papers Submission: Nov 5, 2024
- Notification of Workshop Papers Acceptance: Nov 15, 2024
- Camera-ready Deadline and Copyright Form: Nov 23, 2024
- Conference: Dec 15, 2024
Keynote Presentations
Long-Tailed Learning in the Open and Dynamic World: Theories, Algorithms, and Applications
Presenter: Dawei Zhou
Bio: Dawei Zhou is an Assistant Professor at the Computer Science Department of Virginia Tech and the director of the Virginia Tech Learning on Graphs (VLOG) Lab. Zhou’s primary research focuses on open-world machine learning, with applications in hypothesis generation and validation, financial fraud detection, cyber security, risk management, predictive maintenance, and healthcare. He obtained his Ph.D. degree from the Computer Science Department of the University of Illinois Urbana-Champaign (UIUC). He has authored more than 60 publications in premier academic venues across AI, data mining, and information retrieval (e.g., ICML, NeurIPS, AAAI, IJCAI, KDD, ICDM, SDM, TKDD, DMKD, WWW, CIKM) and has served as Vice Program Chair/Proceeding Chair/Local Chair/Social Media and Publicity Chair/Session Chairs/(Senior) Program Committee Members in various top ML and AI conferences (e.g., KDD, NeurIPS, ICML, WWW, SIGIR, ICLR, AAAI, IJCAI, BigData, etc.). His research is generously supported by Virginia Tech, NSF, DARPA, DHS, Commonwealth Cyber Initiative, 4VA, Deloitte, Amazon, and Cisco. His work has been recognized by the 24th CNSF Capitol Hill Science Exhibition, Cisco Faculty Research Award (2023), AAAI New Faculty Highlights roster (2024), Amazon-Initiative Research Award (2024), and NSF Career Award (2024).
Abstract: A common and fundamental property of real-world data is the long-tailed distribution, where the majority of examples come from a few head categories, while the rest of the examples belong to a massive number of tail categories. This data characteristic finds broad applicability across various domains, including financial fraud detection, e-commerce recommendation, scientific discovery, and rare disease diagnosis. Despite the tremendous progress that has been made, the vast majority of existing long-tailed learning work is essentially conducted in a closed-world environment with predefined domains, data distributions, and downstream tasks. A natural and fundamental research question largely remains nascent: How can we enable open-world long-tailed learning (OpenLT) in which data is collected from heterogeneous sources with varying data distribution and the patterns of interest are evolving and open-ended? In this talk, I will discuss our group’s recent work on 1) OpenLT Theory – characterizing the task complexity and generalization performance of long-tailed learning, 2) OpenLT Algorithm – developing a generic computational framework for long-tailed learning with label scarcity and highly-skewed data distribution, and 3) OpenLT Application - hinging on key application domains to discuss our proposed techniques and theoretical results for open-world long-tailed learning. Finally, I will conclude this talk and share thoughts about my future research.
Towards Data-Centric Time Series Analysis
Presenter: Dongjin Song
Bio: Dongjin Song has been an Assistant Professor in the School of Computing at the University of Connecticut since Fall 2020. Previously, he was a Research Staff Member at NEC Labs America in Princeton, NJ. He earned his Ph.D. in Electrical and Computer Engineering (ECE) from the University of California, San Diego (UCSD) in 2016. His research interests include machine learning, data science, and their applications in time series data analysis and graph representation learning. His work has been published in top-tier data science and artificial intelligence venues, including NeurIPS, ICML, ICLR, KDD, ICDM, SDM, AAAI, IJCAI, CVPR, and ICCV. Three of his papers have been recognized as the most influential papers by paperdigest.org. He serves as an Associate Editor for Pattern Recognition and Neurocomputing, and has contributed as an Area Chair or Senior Program Committee Member for conferences such as AAAI, IJCAI, ICDM, and CIKM. He has also co-organized the AI for Time Series (AI4TS) Workshop at IJCAI, AAAI, ICDM, and SDM, as well as the MiLeTS workshops at KDD. He won with the prestigious NSF CAREER Award and the Frontiers of Science Award (FSA) in 2024.
Abstract: The increasing ubiquity of time series data across various domains, from healthcare to finance, calls for innovative approaches to analyze and interpret complex temporal dynamics. This talk explores the emerging field of data-centric methodologies in time series analysis, drawing insights from two recent advances: Rank Supervised Contrastive Learning (RankSCL) and Semantic Space Informed Prompt Learning with Large Language Models (S2IP-LLM). RankSCL enhances time series classification by integrating rank-aware contrastive learning, assigning differential importance to positive pairs and employing targeted augmentation to enrich class-specific boundaries. S2IP-LLM, on the other hand, introduces a novel alignment of pre-trained semantic spaces with time series embeddings, leveraging large language models to achieve superior forecasting performance by augmenting time series embeddings based on pre-trained word token embeddings. Together, these approaches highlight the potential of integrating advanced representation learning and data augmentation techniques to push the boundaries of predictive performance and interpretability in time series tasks. The talk will showcase empirical results on benchmark datasets, emphasizing how data-centric innovations enable robust and generalizable solutions to real-world time series challenges.
Scientific Equation Discovery via Programming with Large Language Models
Presenter: Chandan Reddy
Bio: Chandan Reddy is a Professor in the Department of Computer Science at Virginia Tech. He received his Ph.D. from Cornell University and his M.S. from Michigan State University. His primary research interests include Machine Learning and Natural Language Processing, with applications in Healthcare, Software, E-commerce, and Human Resource Management. Dr. Reddy's research has been funded by organizations such as the NSF, NIH, DOE, DOT, and various industries. He has authored over 190 peer-reviewed articles in leading conferences and journals. He received several awards for his research work including the Best Application Paper Award at the ACM SIGKDD conference in 2010, the Best Poster Award at the IEEE VAST conference in 2014, and the Best Student Paper Award at the IEEE ICDM conference in 2016. He was also a finalist in the INFORMS Franz Edelman Award Competition in 2011. Dr. Reddy serves (or has served) on the editorial boards of journals such as ACM TKDD, ACM TIST, NPJ AI, and IEEE Big Data. He is a Senior Member of the IEEE and a Distinguished Member of the ACM. More information about his work is given at https://creddy.net.
Abstract: Mathematical equation discovery is a crucial aspect of computational scientific discovery, traditionally approached through symbolic regression (SR) methods that focus mainly on data-driven equation search. Current approaches often struggle to fully leverage the rich domain-specific knowledge that scientists typically rely on. We present LLM-SR, an iterative approach that combines the power of large language models (LLMs) with evolutionary program search and data-driven optimization to discover scientific equations more effectively and efficiently while incorporating scientific prior knowledge. LLM-SR integrates several key aspects of the discovery process, namely, scientific knowledge representation and reasoning (via LLMs’ prompting and prior knowledge), hypothesis generation (equation skeleton proposals produced by LLMs), data-driven evaluation and optimization, and evolutionary search for iterative refinement. Through this integration, our approach discovers interpretable and physically meaningful equations while ensuring efficient exploration of the equation search space and generalization to out-of-domain data. We will demonstrate LLM-SR’s effectiveness across various scientific domains - nonlinear oscillators, bacterial growth, and material stress behavior. This work not only improves the accuracy and interpretability of discovered equations but also enhances the efficiency of the search process by leveraging scientific prior knowledge.
Speakers
- Dr. Dawei Zhou, Virginia Tech
- Dr. Dongjin Song, University of Connecticut
- Dr. Chandan Reddy, Virginia Tech
Organizing Committee
Hui Xiong
The Hong Kong University of Science and Technology (Guangzhou)
Yanjie Fu
Arizona State University
Haifeng Chen
NEC Laboratories America, Inc
Kunpeng Liu
Portland State University
Dongjie Wang
University of Kansas
Charu Aggarwal
IBM T. J. Watson Research Center