Home


Machine learning typically focus on techniques for developing mod- els for a given dataset. However, in real-world applications, data is often messy, and improving models is not always the best way to enhance performance. One can also improve the dataset itself instead of treating it as a fixed input. Data-Centric AI (DCAI) is an emerging field that studies techniques to improve datasets sys- tematically, often resulting in significant improvements in practical ML applications. While good data scientists have traditionally used ad hoc trial-and-error methods and intuition to improve datasets manually, DCAI approaches data improvement as a systematic engineering discipline. DCAI marks a recent shift in focus from modeling to the underlying data used to train and evaluate models. Common model architectures now dominate many tasks, and scaling rules have become predictable. However, building and using datasets is still a labor-intensive and expensive process, with little infrastructure or best practices to make it easier, cheaper, and more repeatable. The DCAI movement aims to address these issues by developing high- productivity, efficient open data engineering tools for managing data in modern ML systems. This workshop aims to foster a vibrant interdisciplinary DCAI community that can tackle practical data problems. These problems include data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Many of these areas are still in their early stages, and the workshop seeks to bring them together to define and shape the DCAI movement that will influence the future of AI and ML. Interested parties can take an active role in shaping this future by submitting papers in response to the call for papers provided below.


Topics


We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

  • Automated Data Science Methods
    • Data cleaning, denoising, and interpolation
    • Feature selection and generation
    • Data refinement, feature-instance joint selection
    • Data quality improvement, representation learning, reconstruction
    • Outlier detection and removal
  • Tools and Methodologies for Expediting Open-source Dataset Preparation
    • Time acceleration tools for sourcing and preparing high-quality data
    • Tools for consistent data labeling, data quality improvement
    • Tools for generating high-quality supervised learning training data
    • Tools for dataset control, high-level editing, searching public resources
    • Tools for dataset feedback incorporation, coverage understanding, editing
    • Dataset importers and exporters for easy data combination and consumption
    • System architectures and interfaces for dataset tool composition
  • Algorithms for Handling Limited Labeled Data and Label Efficiency
    • Data selection techniques, semi-supervised learning, few-shot learning
    • Weak supervision methods, transfer learning, self-supervised learning approaches
  • Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
    • Datasets for bias evaluation and analysis
    • Algorithms for automated bias elimination, model training with biased data

Submission Details


We invite the submission of short paper (up to 6 pages) and full paper (up to 10 pages), including all content and references. Submissions must be in PDF format, and formatted according to the new Standard IEEE Conference Proceedings Template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the online submission system. For more questions about the workshop and submissions, please send email to yanjie.fu@asu.edu.

Important Dates


  • Workshop Papers Submission: Nov. 15, 2023
  • Notification of Workshop Papers Acceptance: Nov. 18, 2023
  • Camera-ready Deadline and Copyright Form: Nov. 22, 2023
  • Workshop Day: Dec. 15, 2023

Agenda


Date: Dec. 15th (Central European Standard)
Time Title Format Presenter/Author
8:00-8:05 Opening Remarks
8:05-8:40 Keynote Presentation: Label-efficient Learning for Time Series 30 min + 5 min QA session Min Wu
8:40-9:15 Keynote Presentation: Unified Neuro-Symbolic Models for Mathematical Understanding and Generation 30 min + 5 min QA session Chandan K. Reddy
9:15-9:50 Keynote Presentation: Addressing Data Quality Issues with Data-Centric AI Approaches 30 min + 5 min QA session Jae-Gil Lee
9:50-10:02 Tensor Space Model-based Textual Data Augmentation for Text Classification 10 min + 2 min QA session Minsuk Chang and Han-joon Kim
10:02-10:14 LLM-TAKE: Theme-Aware Keyword Extraction Using Large Language Models 10 min + 2 min QA session Reza Yousefi Maragheh, Chenhao Fang, Charan chand Irugu, Parth Parikh, Jason Cho, Jianpeng Xu, Saranyan Sukumar, Malay Patel, Evren Korpeoglu, Sushant Kumar, and Kannan Achan
10:14-10:26 Mutually Exclusive Learning for Generators with Multi-Label Classifiers 10 min + 2 min QA session Digya Acharya, Hera Siddiqui, Eduardo Pasiliao Jr., and Chaity Banerjee
10:26-10:38 MIAE:A Mobile Application Recommendation Method Based on a NTK Model 10 min + 2 min QA session Jiahui Han, Qufei Zhang, Xiaoying Yang, and Jinyi Wang
10:38-10:50 ASI: Accuracy-Stability Index for Evaluating Deep Learning Models 10 min + 2 min QA session Wei Dai and Daniel Berleant
10:50-11:02 Combining Block Bootstrap with Exponential Smoothing for Reinforcing Non-Emergency Urban Services Prediction 10 min + 2 min QA session Kshira Sagar Sahoo, Shivam Krishana, and Monowar Bhuyan
11:02-11:14 Novel NBA Fantasy League driven by Engineered Team Chemistry and Scaled Position Statistics 10 min + 2 min QA session Ganesh Arkanath, Nishad Gupta, Hasan Kurban, Parichit Sharma, Madhavan K R, Elham Khorasani Buxton, and Mehmet M Dalkilic
11:14-11:26 Enabling Cross-Language Data Integration and Scalable Analytics in Decentralized Finance 10 min + 2 min QA session Conor Flynn, Kristin Bennett, John Erickson, Aaron Green, and Oshani Seneviratne
11:26-11:38 FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning 10 min + 2 min QA session Ensiye Kiyamousavi, Boris Kraychev, and Ivan Koychev
11:38-11:50 GPT in Data Science: A Practical Exploration of Model Selection 10 min + 2 min QA session Nathalia Nascimento, Cristina Tavares, Paulo Alencar, and Donald Cowan
11:50-12:02 Tissue-Specific Color Encoding and GAN Synthesis for Enhanced Medical Image Generation 10 min + 2 min QA session Yu Shi, Hannah Tang, Jianxin Sun, Xinyan Xie, Huijing Du, Dandan Zheng, Chi Zhang, and Hongfeng Yu
12:02-12:14 Effect of Varied Datasets on Training of a Segmentation Model Used in Visual Navigation 10 min + 2 min QA session Marin Wada, Miho Adachi, and Ryusuke Miyamoto
12:14-12:26 Does a Dense Point Cloud for Training Data Generation Improve Segmentation Accuracy? 10 min + 2 min QA session Marin Wada, Hiroaki Sudo, Miho Adachi, and Ryusuke Miyamoto
12:26-12:30 Closing Remarks

Organizing Committee


Placeholder

Hui Xiong

The Hong Kong University of Science and Technology (Guangzhou)

Placeholder

Yanjie Fu

Arizona State University

Placeholder

Kunpeng Liu

Portland State University

Placeholder

Chang-Tien Lu

Virginia Tech


Speakers


  • Dr. Jae-Gil Lee, Korea Advanced Institute of Science and Technology
  • Dr. Min Wu, Institute for Infocomm Research, A*STAR
  • Dr. Chandan K. Reddy, Virginia Polytechnic Institute and State University