Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

CVPR 2023 Highlight

Jiarui Xu1*, Sifei Liu2†, Arash Vahdat2†, Wonmin Byeon2, Xiaolong Wang1, Shalini De Mello2
1University of California San Diego, 2NVIDIA
(* the work was done at an internship at NVIDIA, † equal contribution)

Segment and categorize any object, even ones not seen during training

Abstract

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have shown the remarkable capability of generating high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We propose to leverage the frozen representation of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over previous state of the art.

Video

Problem Overview

We propose to learn open-vocabulary panoptic segmentation with the internal representation of text-to-image diffusion models. K-Means clustering of the diffusion model's internal representation shows semantically differentiated and localized information wherein objects are nicely grouped together (middle figure). We leverage these dense and rich diffusion features to perform open-vocabulary panoptic segmentation (right figure).

ODISE Training Pipeline

ODISE leverages both text-to-image diffusion model and discriminative model to learn open-vocabulary panoptic segmentation. We first encode the input image into an implicit text embedding with an implicit captioner (image encoder \(\mathcal{V}\) and MLP). With the image and its caption as input, we extract their diffusion features from a frozen text-to-image diffusion UNet. With the UNet features, a mask generator predicts class-agnostic binary masks and their associated mask embedding features. We perform a dot product between the mask embeddings and text embeddings of training category names (red box) or nouns in the image caption (green box) to categorize them. The similarity matrix for mask classification is supervised by either cross entropy loss on the ground truth category label (red solid path), or via a grounding loss on the paired image captions (green dash path).

Qualitative Results

To demonstrate open-vocabulary recognition capabilities, we merge category names of LVIS, COCO, ADE20K together and perform open-vocabulary inference with \({\sim} 1.5k\) classes directly (hover to view the input image).

Open-Vocabulary Panoptic Segmentation on COCO

Click here for more results on COCO

Open-Vocabulary Panoptic Segmentation on ADE20K

Click here for more results on ADE20K

Open-Vocabulary Panoptic Segmentation on Ego4D

Click here for more results on Ego4D

BibTeX


@article{xu2022odise,
  author    = {Xu, Jiarui and Liu, Sifei and Vahdat, Arash and Byeon, Wonmin and Wang, Xiaolong and De Mello, Shalini},
  title     = {{ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models}},
  journal   = {arXiv preprint arXiv: 2303.04803},
  year      = {2023},
}

玻璃钢生产厂家玻璃钢雕塑厂单价徐州商场大型美陈安徽玻璃钢雕塑厂家服务至上腾冲市玻璃钢雕塑哪里有卖青岛校园玻璃钢雕塑制作青羊玻璃钢造型雕塑万圣节商场门头美陈玻璃钢雕塑十二生肖价格丹阳商场新年美陈上海特色商场美陈市场价玻璃钢艺术雕塑售价赣州玻璃钢雕塑哪家便宜吉林城市景观雕塑玻璃钢温州特色玻璃钢雕塑包头卡通人像玻璃钢雕塑洛阳镜面玻璃钢卡通雕塑公司江苏通道商场美陈研发公司杭州方形玻璃钢花盆金堂玻璃钢卡通雕塑河北玻璃钢雕塑哪家价格低杭州玻璃钢动物马雕塑玻璃钢食物雕塑品牌玻璃钢雕塑厂海南玻璃钢金属雕塑设计朝阳玻璃钢花盆定制郑州玻璃钢雕塑价位佛山玻璃钢雕塑品牌深圳龙岗玻璃钢雕塑公司玻璃钢仿铜牛雕塑奉贤区知名玻璃钢雕塑推荐香港通过《维护国家安全条例》两大学生合买彩票中奖一人不认账让美丽中国“从细节出发”19岁小伙救下5人后溺亡 多方发声单亲妈妈陷入热恋 14岁儿子报警汪小菲曝离婚始末遭遇山火的松茸之乡雅江山火三名扑火人员牺牲系谣言何赛飞追着代拍打萧美琴窜访捷克 外交部回应卫健委通报少年有偿捐血浆16次猝死手机成瘾是影响睡眠质量重要因素高校汽车撞人致3死16伤 司机系学生315晚会后胖东来又人满为患了小米汽车超级工厂正式揭幕中国拥有亿元资产的家庭达13.3万户周杰伦一审败诉网易男孩8年未见母亲被告知被遗忘许家印被限制高消费饲养员用铁锨驱打大熊猫被辞退男子被猫抓伤后确诊“猫抓病”特朗普无法缴纳4.54亿美元罚金倪萍分享减重40斤方法联合利华开始重组张家界的山上“长”满了韩国人?张立群任西安交通大学校长杨倩无缘巴黎奥运“重生之我在北大当嫡校长”黑马情侣提车了专访95后高颜值猪保姆考生莫言也上北大硕士复试名单了网友洛杉矶偶遇贾玲专家建议不必谈骨泥色变沉迷短剧的人就像掉进了杀猪盘奥巴马现身唐宁街 黑色着装引猜测七年后宇文玥被薅头发捞上岸事业单位女子向同事水杯投不明物质凯特王妃现身!外出购物视频曝光河南驻马店通报西平中学跳楼事件王树国卸任西安交大校长 师生送别恒大被罚41.75亿到底怎么缴男子被流浪猫绊倒 投喂者赔24万房客欠租失踪 房东直发愁西双版纳热带植物园回应蜉蝣大爆发钱人豪晒法院裁定实锤抄袭外国人感慨凌晨的中国很安全胖东来员工每周单休无小长假白宫:哈马斯三号人物被杀测试车高速逃费 小米:已补缴老人退休金被冒领16年 金额超20万

玻璃钢生产厂家 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化