BIGDATA 2023's Workshop

Data-Centric AI: shifting research focus from model to data.

Home

Machine learning typically focus on techniques for developing mod- els for a given dataset. However, in real-world applications, data is often messy, and improving models is not always the best way to enhance performance. One can also improve the dataset itself instead of treating it as a fixed input. Data-Centric AI (DCAI) is an emerging field that studies techniques to improve datasets sys- tematically, often resulting in significant improvements in practical ML applications. While good data scientists have traditionally used ad hoc trial-and-error methods and intuition to improve datasets manually, DCAI approaches data improvement as a systematic engineering discipline. DCAI marks a recent shift in focus from modeling to the underlying data used to train and evaluate models. Common model architectures now dominate many tasks, and scaling rules have become predictable. However, building and using datasets is still a labor-intensive and expensive process, with little infrastructure or best practices to make it easier, cheaper, and more repeatable. The DCAI movement aims to address these issues by developing high- productivity, efficient open data engineering tools for managing data in modern ML systems. This workshop aims to foster a vibrant interdisciplinary DCAI community that can tackle practical data problems. These problems include data collection and generation, data labeling, data preprocessing and augmentation, data quality evaluation, data debt, and data governance. Many of these areas are still in their early stages, and the workshop seeks to bring them together to define and shape the DCAI movement that will influence the future of AI and ML. Interested parties can take an active role in shaping this future by submitting papers in response to the call for papers provided below.

Topics

We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal

Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition

Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches

Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data

Submission Details

We invite the submission of short paper (up to 6 pages) and full paper (up to 10 pages), including all content and references. Submissions must be in PDF format, and formatted according to the new Standard IEEE Conference Proceedings Template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the online submission system. For more questions about the workshop and submissions, please send email to yanjie.fu@asu.edu.

Important Dates

Workshop Papers Submission: Nov. 15, 2023
Notification of Workshop Papers Acceptance: Nov. 18, 2023
Camera-ready Deadline and Copyright Form: Nov. 22, 2023
Workshop Day: Dec. 15, 2023

Agenda

Date: Dec. 15th (Central European Standard)
Time	Title	Format	Presenter/Author
8:00-8:05	Opening Remarks
8:05-8:40	Keynote Presentation: Label-efficient Learning for Time Series	30 min + 5 min QA session	Min Wu
8:40-9:15	Keynote Presentation: Unified Neuro-Symbolic Models for Mathematical Understanding and Generation	30 min + 5 min QA session	Chandan K. Reddy
9:15-9:50	Keynote Presentation: Addressing Data Quality Issues with Data-Centric AI Approaches	30 min + 5 min QA session	Jae-Gil Lee
9:50-10:02	Tensor Space Model-based Textual Data Augmentation for Text Classification	10 min + 2 min QA session	Minsuk Chang and Han-joon Kim
10:02-10:14	LLM-TAKE: Theme-Aware Keyword Extraction Using Large Language Models	10 min + 2 min QA session	Reza Yousefi Maragheh, Chenhao Fang, Charan chand Irugu, Parth Parikh, Jason Cho, Jianpeng Xu, Saranyan Sukumar, Malay Patel, Evren Korpeoglu, Sushant Kumar, and Kannan Achan
10:14-10:26	Mutually Exclusive Learning for Generators with Multi-Label Classifiers	10 min + 2 min QA session	Digya Acharya, Hera Siddiqui, Eduardo Pasiliao Jr., and Chaity Banerjee
10:26-10:38	MIAE:A Mobile Application Recommendation Method Based on a NTK Model	10 min + 2 min QA session	Jiahui Han, Qufei Zhang, Xiaoying Yang, and Jinyi Wang
10:38-10:50	ASI: Accuracy-Stability Index for Evaluating Deep Learning Models	10 min + 2 min QA session	Wei Dai and Daniel Berleant
10:50-11:02	Combining Block Bootstrap with Exponential Smoothing for Reinforcing Non-Emergency Urban Services Prediction	10 min + 2 min QA session	Kshira Sagar Sahoo, Shivam Krishana, and Monowar Bhuyan
11:02-11:14	Novel NBA Fantasy League driven by Engineered Team Chemistry and Scaled Position Statistics	10 min + 2 min QA session	Ganesh Arkanath, Nishad Gupta, Hasan Kurban, Parichit Sharma, Madhavan K R, Elham Khorasani Buxton, and Mehmet M Dalkilic
11:14-11:26	Enabling Cross-Language Data Integration and Scalable Analytics in Decentralized Finance	10 min + 2 min QA session	Conor Flynn, Kristin Bennett, John Erickson, Aaron Green, and Oshani Seneviratne
11:26-11:38	FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning	10 min + 2 min QA session	Ensiye Kiyamousavi, Boris Kraychev, and Ivan Koychev
11:38-11:50	GPT in Data Science: A Practical Exploration of Model Selection	10 min + 2 min QA session	Nathalia Nascimento, Cristina Tavares, Paulo Alencar, and Donald Cowan
11:50-12:02	Tissue-Specific Color Encoding and GAN Synthesis for Enhanced Medical Image Generation	10 min + 2 min QA session	Yu Shi, Hannah Tang, Jianxin Sun, Xinyan Xie, Huijing Du, Dandan Zheng, Chi Zhang, and Hongfeng Yu
12:02-12:14	Effect of Varied Datasets on Training of a Segmentation Model Used in Visual Navigation	10 min + 2 min QA session	Marin Wada, Miho Adachi, and Ryusuke Miyamoto
12:14-12:26	Does a Dense Point Cloud for Training Data Generation Improve Segmentation Accuracy?	10 min + 2 min QA session	Marin Wada, Hiroaki Sudo, Miho Adachi, and Ryusuke Miyamoto
12:26-12:30	Closing Remarks

Organizing Committee

Hui Xiong

The Hong Kong University of Science and Technology (Guangzhou)

Yanjie Fu

Arizona State University

Kunpeng Liu

Portland State University

Chang-Tien Lu

Virginia Tech

Speakers

Dr. Jae-Gil Lee, Korea Advanced Institute of Science and Technology
Dr. Min Wu, Institute for Infocomm Research, A*STAR
Dr. Chandan K. Reddy, Virginia Polytechnic Institute and State University