Home
Machine learning typically focus on techniques for developing mod- els for a given dataset. However, in real-world applications, data is often messy, and improving models is not always the best way to enhance performance. One can also improve the dataset itself instead of treating it as a fixed input. Data-Centric AI (DCAI) is an emerging field that studies techniques to improve datasets sys- tematically, often resulting in significant improvements in practical ML applications. While good data scientists have traditionally used ad hoc trial-and-error methods and intuition to improve datasets manually, DCAI approaches data improvement as a systematic engineering discipline. DCAI marks a recent shift in focus from modeling to the underlying data used to train and evaluate models. Common model architectures now dominate many tasks, and scaling rules have become predictable. However, building and using datasets is still a labor-intensive and expensive process, with little infrastructure or best practices to make it easier, cheaper, and more repeatable. The DCAI movement aims to address these issues by developing high- productivity, efficient open data engineering tools for managing data in modern ML systems. This workshop aims to foster a vibrant interdisciplinary DCAI community that can tackle practical data problems. These problems include data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Many of these areas are still in their early stages, and the workshop seeks to bring them together to define and shape the DCAI movement that will influence the future of AI and ML. Interested parties can take an active role in shaping this future by submitting papers in response to the call for papers provided below.
Topics
We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:
- Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal
- Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition
- Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches
- Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data
Submission Details
Important Dates
- Workshop Papers Submission: Nov. 15, 2023
- Notification of Workshop Papers Acceptance: Nov. 18, 2023
- Camera-ready Deadline and Copyright Form: Nov. 22, 2023
- Workshop Day: Dec. 15, 2023
Agenda
Date: Dec. 15th (Central European Standard) | ||||
Time | Title | Format | Presenter/Author | |
8:00-8:05 | Opening Remarks | |||
8:05-8:40 | Keynote Presentation: Label-efficient Learning for Time Series | 30 min + 5 min QA session | Min Wu | |
8:40-9:15 | Keynote Presentation: Unified Neuro-Symbolic Models for Mathematical Understanding and Generation | 30 min + 5 min QA session | Chandan K. Reddy | |
9:15-9:50 | Keynote Presentation: Addressing Data Quality Issues with Data-Centric AI Approaches | 30 min + 5 min QA session | Jae-Gil Lee | |
9:50-10:02 | Tensor Space Model-based Textual Data Augmentation for Text Classification | 10 min + 2 min QA session | Minsuk Chang and Han-joon Kim | |
10:02-10:14 | LLM-TAKE: Theme-Aware Keyword Extraction Using Large Language Models | 10 min + 2 min QA session | Reza Yousefi Maragheh, Chenhao Fang, Charan chand Irugu, Parth Parikh, Jason Cho, Jianpeng Xu, Saranyan Sukumar, Malay Patel, Evren Korpeoglu, Sushant Kumar, and Kannan Achan | |
10:14-10:26 | Mutually Exclusive Learning for Generators with Multi-Label Classifiers | 10 min + 2 min QA session | Digya Acharya, Hera Siddiqui, Eduardo Pasiliao Jr., and Chaity Banerjee | |
10:26-10:38 | MIAE:A Mobile Application Recommendation Method Based on a NTK Model | 10 min + 2 min QA session | Jiahui Han, Qufei Zhang, Xiaoying Yang, and Jinyi Wang | |
10:38-10:50 | ASI: Accuracy-Stability Index for Evaluating Deep Learning Models | 10 min + 2 min QA session | Wei Dai and Daniel Berleant | |
10:50-11:02 | Combining Block Bootstrap with Exponential Smoothing for Reinforcing Non-Emergency Urban Services Prediction | 10 min + 2 min QA session | Kshira Sagar Sahoo, Shivam Krishana, and Monowar Bhuyan | |
11:02-11:14 | Novel NBA Fantasy League driven by Engineered Team Chemistry and Scaled Position Statistics | 10 min + 2 min QA session | Ganesh Arkanath, Nishad Gupta, Hasan Kurban, Parichit Sharma, Madhavan K R, Elham Khorasani Buxton, and Mehmet M Dalkilic | |
11:14-11:26 | Enabling Cross-Language Data Integration and Scalable Analytics in Decentralized Finance | 10 min + 2 min QA session | Conor Flynn, Kristin Bennett, John Erickson, Aaron Green, and Oshani Seneviratne | |
11:26-11:38 | FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning | 10 min + 2 min QA session | Ensiye Kiyamousavi, Boris Kraychev, and Ivan Koychev | |
11:38-11:50 | GPT in Data Science: A Practical Exploration of Model Selection | 10 min + 2 min QA session | Nathalia Nascimento, Cristina Tavares, Paulo Alencar, and Donald Cowan | |
11:50-12:02 | Tissue-Specific Color Encoding and GAN Synthesis for Enhanced Medical Image Generation | 10 min + 2 min QA session | Yu Shi, Hannah Tang, Jianxin Sun, Xinyan Xie, Huijing Du, Dandan Zheng, Chi Zhang, and Hongfeng Yu | |
12:02-12:14 | Effect of Varied Datasets on Training of a Segmentation Model Used in Visual Navigation | 10 min + 2 min QA session | Marin Wada, Miho Adachi, and Ryusuke Miyamoto | |
12:14-12:26 | Does a Dense Point Cloud for Training Data Generation Improve Segmentation Accuracy? | 10 min + 2 min QA session | Marin Wada, Hiroaki Sudo, Miho Adachi, and Ryusuke Miyamoto | |
12:26-12:30 | Closing Remarks |
Organizing Committee
Hui Xiong
The Hong Kong University of Science and Technology (Guangzhou)
Yanjie Fu
Arizona State University
Kunpeng Liu
Portland State University
Chang-Tien Lu
Virginia Tech
Speakers
- Dr. Jae-Gil Lee, Korea Advanced Institute of Science and Technology
- Dr. Min Wu, Institute for Infocomm Research, A*STAR
- Dr. Chandan K. Reddy, Virginia Polytechnic Institute and State University