頂尖會議分享論壇

CVGIP’24 Special Section – Recent Advances in Computer Vision, Machine Learning, and Their Applications

Top Conference Sharing Forum: CVPR 2024 Highlights

SS04 特別議程「Recent Advances in Computer Vision, Machine Learning, and Their Applications”」特別邀請 2024 年發表於國際頂尖電腦視覺會議 CVPR 的作者前來分享其在電腦視覺、機器學習與相關應用的最新發現。應用題材多元,包括三維場景合成、室內平面預測、虛擬試衣、機器人夾取與互動交通情境行為辨識。講者來自台灣頂尖大學——台大、清大、陽明交大及頂尖業界新創公司 XYZ Robotics。敬邀國內外與會者一同共襄盛舉。

TOPICS
  1. GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
  2. No More Ambiguity in 360° Room Layout via Bi-Layout Estimation
  3. Artifact Does Matter! Low-artifact High-resolution Virtual Try-On via Diffusion-based Warp-and-Fuse Consistent Texture
  4. MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images
  5. Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes

❶ GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

Abstract:
Accurate 3D scene understanding is essential for various applications such as autonomous driving, robotics, and augmented reality. In this talk, I will introduce our work Generalizable Semantic Neural Radiance Fields (GSNeRF) published in CVPR 2024, a novel approach that integrates neural radiance fields with semantic understanding to improve 3D scene synthesis. GSNeRF features two key stages: Semantic Geo-Reasoning, which combines semantic and geometric features from multi-view inputs, and Depth-Guided Visual Rendering, which uses depth information to generate high-quality novel-view images and semantic maps. Our experiments demonstrate that GSNeRF performs well in producing both novel-view images and accurate semantic segmentation for unseen scenes, addressing challenges in integrating semantics with novel-view synthesis, contributing to advancements in the field of 3D vision.

Speaker Bio:
Sheng-Yu Huang (黃聖喻) received his B.S. degree in Electrical Engineering from National Taiwan University in 2019. He then joined the Vision and Learning Lab and advised by Prof. Yu-Chiang Frank Wang as a Ph.D. student in the same year. Sheng-Yu’s research topics are mainly on 3D vision, including point cloud perception, point cloud completion, multi-modal 3D generation, 3D scene understanding, and 3D scene reconstruction. Several of Sheng-Yu’s research papers are published in international conferences and journals such as CVPR, NeurIPS, ECCV, ICIP, and TPAMI. With such research experiences, he was also invited as reviewer for multiple international conferences and journals such as CVPR, ECCV, ICLR, ICML, NeurIPS, and IJCV.
Advisor Bio:
Yu-Chiang Frank Wang (王鈺強) received his B.S. degree in Electrical Engineering from National Taiwan University in 2001. He received his M.S. and Ph.D. degrees in Electrical and Computer Engineering from Carnegie Mellon University in 2004 and 2009, respectively. In 2009, Dr. Wang joined the Research Center for Information Technology Innovation (CITI) of Academia Sinica, leading the Multimedia and Machine Learning Lab. Dr. Wang joined the Department of Electrical Engineering at National Taiwan University as an Associate Professor in 2017, and was promoted to Professor in 2019. Since 2022, Dr. Wang joins NVIDIA, where he serves as the Research Director in Deep Learning & Computer Vision and leads NVIDIA Research Taiwan. With continuing research focuses on computer vision and machine learning, Dr. Wang’s recent research topics include deep learning for vision & language, transfer learning, and 3D vision. Dr. Wang serves as organizing committee members and area chairs of multiple international conferences such as CVPR, ICCV, ECCV, and ACCV. Several of his papers are nominated for the best paper awards, including IEEE ICIP, ICME, AVSS and MVA. Dr. Wang is twice selected as the Outstanding Young Researcher by the Ministry of Science and Technology of Taiwan (2013-2015 and 2017-2019), as well as the Technological Research Innovation Award from the College of EECS at NTU. In 2022, Dr. Wang receives the Y. Z. Hsu Scientific Paper Award in Artificial Intelligence from the Far Eastern Y. Z. Hsu Science & Technology Memorial Foundation. In 2023, Dr. Wang is recognized as the Outstanding Young Scholar by the Foundation for the Advancement of Outstanding Scholarship.

❷ No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

Abstract:
Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360° room layout estimation models. To address this issue, we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions, while the other extends to encompass all visible areas. Our model employs two global context embeddings, where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module, the image feature retrieves relevant context from these embeddings, generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing, we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets, notably outperforming leading approaches. Specifically, on the MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity.

Speaker Bio:
Jin-Cheng Jhang (張晉承) is currently a member of the Vision Science Lab (VSLab), advised by Prof. Min Sun. He is pursuing his master’s degree in the Department of Electrical Engineering at National Tsing Hua University (NTHU). His research primarily focuses on 3D Indoor Scene Understanding, encompassing Room Layout Estimation and 3D Object Detection. Before joining VSLab, Jin-Cheng received his B.S. degree in Power Mechanical Engineering from NTHU. During his undergraduate years, he was actively involved with the DIT Robotics team and served as the 9th team leader. Notably, he graduated with the prestigious Dr. Mei Yi-Chi Memorial Prize, the highest honor for newly graduated students at NTHU.
Advisor Bio:
Min Sun (孫民) received the graduate degree from UofM with Ann Arbor, the MSc and PhD degrees from Stanford University. He is currently an associate professor of the EE Department at National Tsing Hua University (NTHU). Before joining NTHU, he was a postdoc in CSE with UW. His research interests include 2D+3D object recognition, 2D+3D scene understanding, and human pose estimation. He recently focuses on developing self-evolved neural network architectures for visual perception and designing specialized perception methods for 360-degree images/videos. He won the Best Paper Award in 3DRR 2007, was a recipient of the W. Michael Blumenthal Family Fund Fellowship in 2007, the Outstanding Research Award from MOST Taiwan, in 2018, and the Ta-You Wu Memorial Award From MOST Taiwan, in 2018.

❸ Artifact Does Matter! Low-artifact High-resolution Virtual Try-On via Diffusion-based Warp-and-Fuse Consistent Texture

Abstract:
In virtual try-on technology, achieving realistic fitting of clothing on human subjects without sacrificing detail is a significant challenge. Traditional approaches, especially those using Generative Adversarial Networks (GANs), often produce noticeable artifacts, while diffusion-based methods struggle with maintaining consistent texture and suffer from high computational demands. To overcome these limitations, we propose the Low-artifact High-resolution Virtual Try-on via Diffusion-based Warp-and-Fuse Consistent Texture (LA-VTON). This novel framework introduces Conditional Texture Warping (CTW) and Conditional Texture Fusing (CTF) modules. CTW improves warping stability through simplified denoising steps, and CTF ensures texture consistency and enhances computational efficiency, achieving inference times 17X faster than existing diffusion-based methods. Experiments show that LA-VTON surpasses current SOTA high-resolution virtual try-on methods in both visual quality and efficiency, marking a significant advancement in high-resolution virtual try-on and setting a new standard in digital fashion realism.

Speaker Bio:
Chiang Tseng (曾薔). I am a Ph.D. student in Institute of Electrical and Computer Engineering at National Yang Ming Chiao Tung University. My research focuses on generative AI and virtual try-on technologies. I am passionate about exploring innovative solutions and contributing to advancements in these fields.
Advisor Bio:
Hong-Han Shuai received the B.S. degree from the Department of Electrical Engineering, National Taiwan University (NTU), Taipei, Taiwan, R.O.C., in 2007, the M.S. degree in computer science from NTU in 2009, and the Ph.D. degree from Graduate Institute of Communication Engineering, NTU, in 2015. He is currently an Associate Professor at National Yang Ming Chiao Tung University (NYCU). His research interests are in the areas of multimedia processing, machine learning, social network analysis, and data mining. His works have appeared in top-tier conferences such as MM, CVPR, AAAI, KDD, WWW, ICDM, CIKM, and VLDB, and top-tier journals such as TKDE, TMM, TNNLS, and JIOT. Moreover, he has served as a PC member for international conferences, including MM, AAAI, IJCAI, and WWW, and the invited reviewer for journals including TKDE, TMM, JVCI, and JIOT.

❹ Matching Unseen Objects for 6D Pose Estimation from RGB-D Images

Abstract:
Recent learning methods for object pose estimation require resource-intensive training for each individual object instance or category, hampering their scalability in real applications when confronted with previously unseen objects. In this paper, we propose MatchU, a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images. MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We rely on learning geometric 3D descriptors that are rotation-invariant by design. By encoding pose-agnostic geometry, the learned descriptors naturally generalize to unseen objects and capture symmetries. To tackle ambiguous associations using 3D geometry only, we fuse additional RGB information into our descriptor. This is achieved through a novel attention-based mechanism that fuses cross-modal information, together with a matching loss that leverages the latent space learned from RGB data to guide the descriptor learning process. Extensive experiments reveal the generalizability of both the RGB-D fusion strategy as well as the descriptor efficacy. Benefiting from the novel designs, MatchU surpasses all existing methods by a significant margin in terms of both accuracy and speed, even without the requirement of expensive re-training or rendering.

Speaker Bio:
Peter Yu (俞冠廷), co-founder and CTO of XYZ Robotics, received his Ph.D. in EECS from MIT in 2018 and started XYZ afterwards. His research interest was robotic manipulation, specifically, state estimation with physical contact to track object state during manipulation. He received Best Paper Finalist in IROS 2016. Besides academic research, he fought on the front line of robot systems with a belief that they could improve people’s lives. In 2013, he participated in DARPA Robotics Challenge to solve disaster relieve tasks. In 2015-2017, he served as the system lead in Amazon Robotics Challenge and won First Place in Stowing task in 2017.
Advisor Bio:
Benjamin Busam is a Senior Research Scientist with the Technical University of Munich coordinating the Computer Vision activities at the Chair for Computer Aided Medical Procedures. Formerly Head of Research at FRAMOS Imaging Systems, he led the 3D Computer Vision Team at Huawei Research, London from 2018 to 2020. Benjamin studied Mathematics at TUM. In his subsequent postgraduate programme, he continued in Mathematics and Physics at ParisTech, France and at the University of Melbourne, Australia, before he graduated with distinction at TU Munich in 2014. In continuation to a mathematical focus on projective geometry and 3D point cloud matching, he now works on 2D/3D computer vision for pose estimation, depth mapping and mobile AR as well as multi-modal sensor fusion. For his work on adaptable high-resolution real-time stereo tracking he received the EMVA Young Professional Award 2015 from the European Machine Vision Association and was awarded Innovation Pioneer of the Year 2019 by Noah’s Ark Laboratory, London. He was given multiple Outstanding Reviewer Awards at 3DV 2020, 3DV 2021, ECCV 2022, and CVPR 2024.

❺ Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes

Abstract:
In this paper, we study multi-label atomic activity recognition. Despite the notable progress in action recognition, it is still challenging to recognize atomic activities due to a deficiency in a holistic understanding of both multiple road users’ motions and their contextual information. In this paper, we introduce Action-slot, a slot attention-based approach that learns visual action-centric representations, capturing both motion and contextual information. Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur, without the need for explicit perception guidance. To further enhance slot attention, we introduce a background slot that competes with action slots, aiding the training process in avoiding unnecessary focus on background regions devoid of activities. Yet, the imbalanced class distribution in the existing dataset hampers the assessment of rare activities. To address the limitation, we collect a synthetic dataset called TACO, which is four times larger than OATS and features a balanced distribution of atomic activities. To validate the effectiveness of our method, we conduct comprehensive experiments and ablation studies against various action recognition baselines. We also show that the performance of multi-label atomic activity recognition on real-world datasets can be improved by pretraining representations on TACO. We will release our source code and dataset.

Speaker Bio:
Chi-Hsi Kung (孔啟熙) is a research assistant at National Chiao Tung University in Taiwan, focusing on visual action representations learning and event identification. He received his M.Sc from National Tsing-Hua University and B.Sc from National Taipei University.
Advisor Bio:
Yi-Ting Chen (陳奕廷) is an assistant professor in the Department of Computer Science at National Yang Ming Chiao Tung University. He received my Ph.D. degree in Electrical and Computer Engineering from Purdue University in 2015. Before joining NYCU, he worked as a senior research scientist at Honda Research Institute USA from 2015 to 2020. He wants to develop human-centered intelligent systems such as intelligent driving system and assistive robotics that aim to empower human capabilities. He is a recipient of the NSTC 2030 International Outstanding Scholar Program in 2024, the Yushan Fellow Program Administrative Support Grant from the Ministry of Education, Taiwan, in 2021, the Junior Faculty Award from NYCU in 2021, and the Government Scholarship to Study Abroad from the Ministry of Education, Taiwan.