Home
Machine learning typically focus on techniques for developing mod- els for a given dataset. However, in real-world applications, data is often messy, and improving models is not always the best way to enhance performance. One can also improve the dataset itself instead of treating it as a fixed input. Data-Centric AI (DCAI) is an emerging field that studies techniques to improve datasets sys- tematically, often resulting in significant improvements in practical ML applications. While good data scientists have traditionally used ad hoc trial-and-error methods and intuition to improve datasets manually, DCAI approaches data improvement as a systematic engineering discipline. DCAI marks a recent shift in focus from modeling to the underlying data used to train and evaluate models. Common model architectures now dominate many tasks, and scaling rules have become predictable. However, building and using datasets is still a labor-intensive and expensive process, with little infrastructure or best practices to make it easier, cheaper, and more repeatable. The DCAI movement aims to address these issues by developing high- productivity, efficient open data engineering tools for managing data in modern ML systems. This workshop aims to foster a vibrant interdisciplinary DCAI community that can tackle practical data problems. These problems include data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Many of these areas are still in their early stages, and the workshop seeks to bring them together to define and shape the DCAI movement that will influence the future of AI and ML. Interested parties can take an active role in shaping this future by submitting papers in response to the call for papers provided below.
Topics
We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:
- Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal
- Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition
- Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches
- Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data
Submission Details
Important Dates (Anywhere on Earth)
- Workshop Papers Submission: Nov 5, 2024
- Notification of Workshop Papers Acceptance: Nov 15, 2024
- Camera-ready Deadline and Copyright Form: Nov 23, 2024
- Conference: TBD
Keynote Presentations
Long-Tailed Learning in the Open and Dynamic World: Theories, Algorithms, and Applications
Presenter: Dawei Zhou
Bio: Dawei Zhou is an Assistant Professor at the Computer Science Department of Virginia Tech and the director of the Virginia Tech Learning on Graphs (VLOG) Lab. Zhou’s primary research focuses on open-world machine learning, with applications in hypothesis generation and validation, financial fraud detection, cyber security, risk management, predictive maintenance, and healthcare. He obtained his Ph.D. degree from the Computer Science Department of the University of Illinois Urbana-Champaign (UIUC). He has authored more than 60 publications in premier academic venues across AI, data mining, and information retrieval (e.g., ICML, NeurIPS, AAAI, IJCAI, KDD, ICDM, SDM, TKDD, DMKD, WWW, CIKM) and has served as Vice Program Chair/Proceeding Chair/Local Chair/Social Media and Publicity Chair/Session Chairs/(Senior) Program Committee Members in various top ML and AI conferences (e.g., KDD, NeurIPS, ICML, WWW, SIGIR, ICLR, AAAI, IJCAI, BigData, etc.). His research is generously supported by Virginia Tech, NSF, DARPA, DHS, Commonwealth Cyber Initiative, 4VA, Deloitte, Amazon, and Cisco. His work has been recognized by the 24th CNSF Capitol Hill Science Exhibition, Cisco Faculty Research Award (2023), AAAI New Faculty Highlights roster (2024), Amazon-Initiative Research Award (2024), and NSF Career Award (2024).
Abstract: A common and fundamental property of real-world data is the long-tailed distribution, where the majority of examples come from a few head categories, while the rest of the examples belong to a massive number of tail categories. This data characteristic finds broad applicability across various domains, including financial fraud detection, e-commerce recommendation, scientific discovery, and rare disease diagnosis. Despite the tremendous progress that has been made, the vast majority of existing long-tailed learning work is essentially conducted in a closed-world environment with predefined domains, data distributions, and downstream tasks. A natural and fundamental research question largely remains nascent: How can we enable open-world long-tailed learning (OpenLT) in which data is collected from heterogeneous sources with varying data distribution and the patterns of interest are evolving and open-ended? In this talk, I will discuss our group’s recent work on 1) OpenLT Theory – characterizing the task complexity and generalization performance of long-tailed learning, 2) OpenLT Algorithm – developing a generic computational framework for long-tailed learning with label scarcity and highly-skewed data distribution, and 3) OpenLT Application - hinging on key application domains to discuss our proposed techniques and theoretical results for open-world long-tailed learning. Finally, I will conclude this talk and share thoughts about my future research.
Organizing Committee
Hui Xiong
The Hong Kong University of Science and Technology (Guangzhou)
Yanjie Fu
Arizona State University
Haifeng Chen
NEC Laboratories America, Inc
Chandan Reddy
Virginia Tech
Kunpeng Liu
Portland State University
Dongjie Wang
University of Kansas
Charu Aggarwal
IBM T. J. Watson Research Center
Wei Ding
University of Massachusetts Boston