Home
Machine learning focuses on developing models for datasets, but real-world data is often messy. Improving the dataset itself can be a better way to enhance performance instead of just improving the models. Data-Centric AI (DCAI) is an emerging field that systematically improves datasets, resulting in significant improvements in ML applications. DCAI treats data improvement as an engineering discipline, offering a shift in focus from modeling to the underlying data. This workshop aims to build an interdisciplinary DCAI community to tackle data problems such as collection, labeling, preprocessing, quality evaluation, debt, and governance. Interested parties can shape the future of AI and ML by submitting papers in response to the call for papers.
Agenda
Date: October 25th | ||||
Time | Title | Format | Presenter/Author | |
14:00-14:05 | Opening Remarks | 5 min | Organizers | |
14:05-14:45 | Keynote Presentation: Data diversity for understanding safety across modalities | 30 min + 10 min QA session | Alicia Parrish | |
14:45-15:30 | Keynote Presentation: Future Insights: Harnessing AI and Social Media for Advanced Event and Epidemic Forecasting | 30 min + 10 min QA session | Chang-Tien Lu | |
15:30-16:00 | Coffee Break | 30 min | Organizers | |
16:00-16:40 | Keynote Presentation: Actionable Decision Making with Small Data | 30 min + 10 min QA session | Hua Wei | |
16:40-16:48 | Paper Presentation: AugTriever: Unsupervised Dense Retrival by Scalable Data Augmentation | 6 min + 2 min QA session | Rui Meng, Ye Liu, Semih Yavuz, Lifu Tu, Ning Yu, jianguo Zhang, Meghana Bhat, and Yingbo Zhou | |
16:48-16:56 | Paper Presentation: Transfer Learning for E-commerce Query Product Type Prediction | 6 min + 2 min QA session | Anna Tigunova, Ghadir Eraisha, and Thomas Ricatte | |
16:56-17:04 | Paper Presentation: Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models | 6 min + 2 min QA session | Bryan Zhang, Taichi Nakatani, and Stephan Walter | |
17:04-17:12 | Paper Presentation: Towards Scalability and Extensibility of Query Reformulation Modeling in E-commerce Search | 6 min + 2 min QA session | Ziqi Zhang, Yupin Huang, Quan Deng, Jinghui Xiao, Vivek Mittal, and Jingyuan Deng | |
17:12-17:20 | Paper Presentation: Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval | 6 min + 2 min QA session | Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, and William M. Campbell | |
17:20-17:28 | Paper Presentation: Assessing Data Copyright in Large Language Model via Partial Information Probing | 6 min + 2 min QA session | Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang | |
17:28-17:30 | Closing Remarks | 2 min | Organizers |
Topics
We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:
- Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal
- Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition
- Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches
- Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data
Submission Details
Important Dates (Anywhere on Earth)
- Workshop Papers Submission: July 29, 2024
- Notification of Workshop Papers Acceptance: August 30, 2024
- Camera-ready Deadline and Copyright Form: September 15, 2024
- Workshop Day: October 25, 2024
Organizing Committee
Steering Co-Chairs
Hui Xiong
The Hong Kong University of Science and Technology (Guangzhou)
Vipin Kumar
University of Minnesota
Organizing Committee
Yanjie Fu
Arizona State University
Kunpeng Liu
Portland State University
Pengyang Wang
University of Macau
Pengfei Wang
Chinese Academy of Sciences
Dongjie Wang
University of Kansas
Meng Xiao
Chinese Academy of Sciences
Yanyong Huang
Southwestern University of Finance and Economics
Wei Fan
University of Oxford
Ziyue Qiao
Great Bay University
Zhengzhang Chen
NEC Laboratories America
Publicity Co-Chairs
Pengyang Wang
University of Macau
Dongjie Wang
University of Kansas
Web Co-Chairs
Nanxu Gong
Arizona State University
Wangyang Ying
Arizona State University
Accepted Paper
- Rui Meng, Ye Liu, Semih Yavuz, Lifu Tu, Ning Yu, jianguo Zhang, Meghana Bhat, and Yingbo Zhou, "AugTriever: Unsupervised Dense Retrival by Scalable Data Augmentation"
- Anna Tigunova, Ghadir Eraisha, and Thomas Ricatte, "Transfer Learning for E-commerce Query Product Type Prediction"
- Bryan Zhang, Taichi Nakatani, and Stephan Walter, "Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models"
- Ziqi Zhang, Yupin Huang, Quan Deng, Jinghui Xiao, Vivek Mittal, and Jingyuan Deng, "Towards Scalability and Extensibility of Query Reformulation Modeling in E-commerce Search"
- Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, and William M. Campbell, "Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval"
- Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang, "Assessing Data Copyright in Large Language Model via Partial Information Probing"
Keynote Presentations
Data diversity for understanding safety across modalities
Presenter: Alicia Parrish
Bio: Alicia Parrish is a research scientist working on data-centric responsible AI at Google Deepmind in NYC. Her work focuses on better understanding how to ensure data diversity and high data quality through the entire pipeline of data, from the way the data is collected, who/where it's collected from, how it gets aggregated or sampled, etc. By looking across the entire pipeline, her work traces the impact of early data choices all the way through to the kinds of conclusions we're able to make about model behavior. Alicia holds a PhD in linguistics from New York University.
Abstract: Though datasets to test the safety of large generative models often rely on binary labels, human perspectives are much more nuanced. How people interpret safety is deeply influenced by their socio-cultural backgrounds and lived experiences. As we aim to evaluate systems that will be deployed in real-world settings, it is critical to understand how the diverse perspectives of humans involved in creating the model evaluations can be leveraged. In this talk, I provide an overview of two recent datasets that enable the study human diversity in relation to safety evaluation: (i) the DICES dataset, which captures diverse demographic perspectives on safety ratings for text-to-text models, highlighting the impact of individual backgrounds on safety assessments, and (ii) the Adversarial Nibbler dataset, which engages red teamers from distinct geographic regions to gather adversarial prompts for text-to-image models, revealing the importance of geographic diversity in identifying novel potential harms. I show that embracing the diversity of those involved in creating and rating data enables us to uncover nuanced perspectives that are relevant to ensuring the safety of generative models.
Future Insights: Harnessing AI and Social Media for Advanced Event and Epidemic Forecasting
Presenter: Chang-Tien Lu
Bio: Chang-Tien Lu is a Professor in the Department of Computer Science, Curriculum Lead at the Innovation Campus, and Associate Director of the Sanghani Center for AI and Data Analytics at Virginia Tech. He received his Ph.D. from the University of Minnesota, Twin Cities, in 2001. Dr. Lu currently serves as an Associate Editor for ACM Transactions on Spatial Algorithms and Systems, Data & Knowledge Engineering, IEEE Transactions on Big Data, and GeoInformatica. He has held prominent roles in organizing major conferences, including serving as General Co-Chair of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems in 2009, 2020, and 2021, the International Symposium on Spatial and Temporal Databases in 2017, and the IEEE International Conference on Big Data in 2024. He also served as Secretary (2008–2011) and Vice Chair (2011–2014) of ACM SIGSPATIAL. Dr. Lu's research spans spatial informatics, urban computing, artificial intelligence, and intelligent transportation systems. He has authored over 250 publications in top-tier journals and conferences, with funding from the NSF, NIH, DoD, and DoE. He is recognized as an ACM Distinguished Scientist and an IEEE Fellow.
Abstract: In an era where information flows rapidly through digital platforms, social media data offers a powerful tool for predicting and responding to global challenges. This talk presents advanced techniques for forecasting societal events and epidemics by leveraging social media data. I will introduce an automated system that analyzes open-source data from platforms such as Twitter, blogs, and news articles to predict events like natural disasters and public demonstrations, addressing challenges like unstructured text and dynamic relationships. Additionally, I will discuss SimNest, a deep learning framework that integrates computational epidemiology with social media data to enhance real-time disease outbreak monitoring and improve epidemic response strategies. Finally, I will explore a multi-task learning framework that improves event forecasting across regions by addressing issues such as data imbalances and geographic differences, with notable success demonstrated using Twitter data from Latin America. These approaches highlight the power of combining social media and machine learning to improve prediction accuracy and enable timely interventions for both societal events and health crises.
Actionable Decision Making with Small Data
Presenter: Hua Wei
Bio: Hua Wei is an assistant professor at the School of Computing and Augmented Intelligence (SCAI) in Arizona State University (ASU). He also affiliates with the Lawrence Berkeley National Laboratory. Before joining ASU, he worked as an Assistant Professor at New Jersey Institute of Technology and a Staff Researcher at Tencent AI Lab. He got his PhD from Pennsylvania State University in 2020 under the supervision of Dr. Zhenhui (Jessie) Li. Before that, he received his master and bachelor degree from Beihang University (BUAA) majoring in Computer Science, working with Prof. Jinpeng Huai and Dr. Tianyu Wo.
Abstract: Decision making in the real world has been challenging due to the challenge of small data, i.e., the real-world data could be sparse and hard to obtain. This talk explores the strategies of data generation in the real-world setting of small data to support decision making. The objective is to improve the accuracy and efficiency of data generation by training machine learning models to learn from real-world data. Specifically, this talk will introduce some of our latest work on spatio-temporal data, and the follow-up generative model and control model facing the small data.
Program Committee
- Dr. Yong Ge, University of Arizona
- Dr. Hao Liu, The Hong Kong University of Science and Technology (Guangzhou)
- Dr. Kunpeng Liu, Portland State University
- Dr. Qi Liu, University of Science and Technology of China
- Dr. Yanchi Liu, NEC Labs America
- Dr. Leilei Sun, Beihang University
- Dr. Pengfei Wang, Chinese Academy of Sciences
- Dr. Pengyang Wang, University of Macau
- Dr. Senzhang Wang, Central South University
- Dr. Keli Xiao, Stony Brook University
- Dr. Yang Yang, Nanjing University of Science and Technology
- Dr. Zijun Yao, University of Kansas
- Dr. Denghui Zhang, Stevens Institute of Technology
- Dr. Wei Zhang, University of Central Florida
- Dr. Xi Zhang, Chinese Academy of Sciences
- Dr. Dongjie Wang, University of Kansas
- Dr. Muhammad Zunnurain Hussain, Bahria University Lahore Campus
- Dr. Venkata Nedunoori, Dentsu International
Volunteers
- Mr. Haihua Xu, University of Macau
- Ms. Qi Hao, University of Macau