視覺基礎(chǔ)模型架構(gòu)設(shè)計新思路PPT課件_第1頁
視覺基礎(chǔ)模型架構(gòu)設(shè)計新思路PPT課件_第2頁
視覺基礎(chǔ)模型架構(gòu)設(shè)計新思路PPT課件_第3頁
視覺基礎(chǔ)模型架構(gòu)設(shè)計新思路PPT課件_第4頁
視覺基礎(chǔ)模型架構(gòu)設(shè)計新思路PPT課件_第5頁
已閱讀5頁,還剩23頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、張祥雨曠視研究院視覺基礎(chǔ)模型架構(gòu)設(shè)計新思路Inspiration from ViTs近年研究熱點回顧v層數(shù)更深、性能更強的架構(gòu)v輕量級架構(gòu)、高推理效率v自動化模型設(shè)計、神經(jīng)網(wǎng)絡架構(gòu)搜索(NAS)v動態(tài)模型vAttention Models、Vision Transformers (ViTs)近年研究熱點回顧v層數(shù)更深、性能更強的架構(gòu)v輕量級架構(gòu)、高推理效率v自動化模型設(shè)計、神經(jīng)網(wǎng)絡架構(gòu)搜索(NAS)v動態(tài)模型vAttention Models、Vision Transformers (ViTs)如何看待Vision Transformers?哇,性能好強,各種SOTA!ImageNet用了這么

2、多trick,完全不公平。有本事用同樣的Setting比一下ResNet-D+SE?1相對CNN,下游任務簡直吊打!無需trick,無需復雜設(shè)計數(shù)據(jù)量越大,任務越難,性能越好!趕緊水一波SSL+Transformer慢、耗顯存,沒卡漲點本質(zhì)嗎?貌似只用Q+R做attention就行了。這還能叫transformer?Transformer解放了CNN的inductive bias,大數(shù)據(jù)下理應更好!大Kernel+動態(tài)網(wǎng)絡是下游漲點的關(guān)鍵Large kernel matters!Making large kernel CNN great again!Transformer統(tǒng)一了跨domain建

3、模范式,可以做多模態(tài)大模型了!解放inductive bias?Making MLP great again!Transformer is all you need!不管什么任務,都能有統(tǒng)一的形式高效解決CNN設(shè)計理念:用大kernel(如dw7x7)提升感受野,用1x1 conv提深度。而動態(tài)網(wǎng)絡是一種壓縮手段Transformer可以魔改得Efficient!用好Reparam,大家都可以很強!234567理解ViTs:潛在的優(yōu)勢v靈活的數(shù)據(jù)形式(Tensor,集合,序列,圖)v長程關(guān)系建模能力v更強的表示能力 1v架構(gòu)的合理性 2v對遮擋、噪聲的穩(wěn)健性 3, 41 Cordonnier,

4、 Jean-Baptiste, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. ICLR 2020.2 Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth.3 Xie, Cihang, et al.

5、Feature denoising for improving adversarial robustness. CVPR 2019.4 Paul, Sayak, and Pin-Yu Chen. Vision transformers are robust learners.Rethinking ViTsvMulti-head self-attention (MHSA)可能不是必須的Zhu, Xizhou, et al. An empirical study of spatial attention mechanisms in deep networks. ICCV 2019.Rethinki

6、ng ViTs (contd)vViTs的設(shè)計要素v Sparse connectivityv Weight sharingv Dynamic weight1 Han, Qi, et al. Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight.“2 Zhao, Yucheng, et al. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP. Rao

7、, Yongming, et al. Global filter networks for image classification. ViTs到底做對了什么?vViTs的設(shè)計要素、潛在優(yōu)勢等并非獨有并非獨有vWhat can we learn from vision transformers?v Large kernel designv High-order relation modelingInspiration from Vision TransformersvLarge kernel modelsvHigh-order relation modelingSpatial Modeling

8、 Vision Transformers Global MHSA 1 Local MHSA (e.g. =7x7) 2, 3 CNNs Large kernels (e.g. = 5x5) 4 Stack of 3x3 (DW, Group) Convs 51 Dosovitskiy, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021.2 Liu, Ze, et al. Swin transformer: Hierarchical vision

9、 transformer using shifted windows. ICCV 2021.3 Dong, Xiaoyi, et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows.4 Szegedy, Christian, et al. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI 2017.5 He et al. Deep residual

10、learning for image recognition. CVPR 2016.Spatial Modeling Vision Transformers Global MHSA 1 Local MHSA (e.g. =7x7) 2 CNNs Large kernels (e.g. = 5x5) 3 Stack of 3x3 (DW, Group) Convs 41 Dosovitskiy, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021.

11、2 Liu, Ze, et al. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV 2021.3 Szegedy, Christian, et al. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI 2017.4 He et al. Deep residual learning for image recognition. CVPR 2016.Advantage

12、s of Large Kernelsv大卷積核可以更高效地提升有效感受野(ERF)Luo, Wenjie, et al. Understanding the effective receptive field in deep convolutional neural networks. NIPS 2016.Advantages of Large Kernels (contd)v大卷積核可以部分回避模型深度增加帶來的優(yōu)化難題vVGG-style models 難以做深 1vResNets 的有效深度可能很淺 2, 31 Simonyan, Karen, and Andrew Zisserman.

13、 Very deep convolutional networks for large-scale image recognition. ICLR 2015.2 Veit, Andreas, Michael J. Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. NIPS 2016.3 De, Soham, and Samuel L. Smith. Batch normalization biases residual blocks toward

14、s the identity function in deep networks. NeurIPS 2020.Advantages of Large Kernels (contd)v大卷積核對FCN-based的下游任務提升明顯vDetection (e.g. deformable conv 1, DetNAS 2)vSegmentation (e.g. global conv 3, dilated conv 4)1 Dai, Jifeng, et al. Deformable convolutional networks. ICCV 2017.2 Chen, Yukang, et al. D

15、etnas: Backbone search for object detection. NeurIPS 2019.3 Peng, Chao, et al. Large kernel matters-improve semantic segmentation by global convolutional network. CVPR 2017.4 Chen, Liang-Chieh, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully co

16、nnected crfs. TPAMI 2017.為什么大卷積核CNN沒有成為主流?v架構(gòu)超參 1, 2v寬度v深度v輸入分辨率v卷積核大小卷積核大小1 Tan, Mingxing, and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. ICML 2019.2 Radosavovic, Ilija, et al. Designing network design spaces. CVPR 2020.Large Kernel CNN Designv問題:ImageNet的局限性

17、vImageNet分類可能更偏向紋理特征 1vImageNet分類任務對感受野要求不高 2, 3v對策:更強的訓練和數(shù)據(jù)增廣策略1 Geirhos, Robert, et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR 2019.2 Peng, Chao, et al. Large kernel matters-improve semantic segmentation by global convolutiona

18、l network. CVPR 2017.3 Srinivas, Aravind, et al. Bottleneck transformers for visual recognition. CVPR 2021.Large Kernel CNN Design (Contd)v問題:大卷積核不夠高效v對策v更淺的結(jié)構(gòu) 1v卷積核分解 2vFFT Conv 3v稀疏算子 41 Liu, Ze, et al. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV 2021.2 Han, Qi, e

19、t al. Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight.”3 Rao, Yongming, et al. Global filter networks for image classification. 4 Zhu, Xizhou, et al. Deformable detr: Deformable transformers for end-to-end object detection. ICLR 2021.Large Kernel CNN De

20、sign (Contd)v問題:大卷積核難以兼顧局部特征v對策v結(jié)構(gòu)重參數(shù)化方法 1, 21 Ding, Xiaohan, et al. Repvgg: Making vgg-style convnets great again. CVPR 2021.2 Ding, Xiaohan, et al. RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition. Large Kernel CNN Design (Contd)v問題:spatial modeling和semantic

21、 modeling對深度的要求并不一致v Spatial modeling:感受野的大小、Relation的階數(shù)v Semantic modeling:語義的復雜程度v對策:摒棄堆疊單一重復單元的架構(gòu)設(shè)計范式,對Spatial和Depth分別設(shè)計小結(jié):通向大感受野視覺模型設(shè)計v采用更強的訓練和數(shù)據(jù)增廣策略v淺且高效的Spatial算子v使用結(jié)構(gòu)重參數(shù)化等方法添加架構(gòu)先驗v對Spatial和Depth分別設(shè)計Inspiration from Vision TransformersvLarge kernel modelsvHigh-order relation modelingHigh-order Relatio

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論