深度神經(jīng)網(wǎng)絡(luò)2_第1頁
深度神經(jīng)網(wǎng)絡(luò)2_第2頁
深度神經(jīng)網(wǎng)絡(luò)2_第3頁
深度神經(jīng)網(wǎng)絡(luò)2_第4頁
深度神經(jīng)網(wǎng)絡(luò)2_第5頁
已閱讀5頁,還剩62頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、深度神經(jīng)網(wǎng)絡(luò) IIDeep Neural Networks中國科學(xué)院自動化研究所吳高巍2015-10-27內(nèi)容 深度學(xué)習(xí)常用模型 Deep Boltzmann Machines, DBM Convolutional Neural Network, CNN Recurrent Neural Networks, RNN 深度學(xué)習(xí)的使用 Deep Learning的常用模型 Deep Boltzmann Machines, DBM回顧:Deep Belief Network, DBN 概率生成模型 深層結(jié)構(gòu)多層 二值隨機(jī)神經(jīng)元 非監(jiān)督的預(yù)學(xué)習(xí) 監(jiān)督微調(diào)(fine-tuning)DBN Greedy

2、training RBM 學(xué)習(xí)目標(biāo):極大似然 能量函數(shù) E(v,h) = vTWh bTv aTh 波爾茲曼分布Niivp1)(log(maxRBM CD-K算法RBM CD-1算法AutoEncoder vs. RBMDeep Boltzmann Machines所有層間無向連接同層神經(jīng)元間無連接高層表示由無標(biāo)注數(shù)據(jù)建立標(biāo)注數(shù)據(jù)僅用來微調(diào)網(wǎng)絡(luò)Salakhutdinov & Hinton, 2009DBM vs DBN 多層模型中,層間的無向連接構(gòu)成完全的Boltzmann機(jī)Deep Boltzmann MachineDeep Belief NetworkDBM 訓(xùn)練 訓(xùn)練時采用雙方向(上下兩

3、層) 在訓(xùn)練單層時需同時考慮兩個或者多個隱含層 能量模型與RBM不一樣兩層DBMDBMPre-training:Can (must) initialize from stacked RBMsGenerative fine-tuning:Positive phase: variational approximation (mean-field)Negative phase: persistent chain (stochastic approxiamtion)Discriminative fine-tuning:backpropagationDeep Boltzmann Machine例:兩層B

4、M MNIST: 2-layer BM60,000 training and 10,000 testing examples 0.9 million parameters Gibbs sampler for 100,000 stepsAfter discriminative fine-tuning: 0.95% error rate Compare with DBN 1.2%, SVM 1.4%例:NORB dataset NORB dataset例:NORB datasetR. SalskhutdinovWhy Greedy Layer Wise Training Works Regular

5、ization Hypothesis Pre-training is “constraining” parameters in a region relevant to unsupervised dataset Better generalization Representations that better describe unlabeled data are more discriminative for labeled data Optimization Hypothesis Unsupervised training initializes lower level parameter

6、s near localities of better minima than random initialization can Deep Learning的常用模型 Convolutional Neural Networks, CNNConvolutional Neural Networks卷積神經(jīng)網(wǎng)絡(luò)20世紀(jì)60年代,Hubel和Wiesel研究貓腦皮層用于局部敏感和方向選擇的神經(jīng)元,其獨特的網(wǎng)絡(luò)結(jié)構(gòu)可以有效地降低反饋神經(jīng)網(wǎng)絡(luò)的復(fù)雜性卷積神經(jīng)網(wǎng)絡(luò)是一種特殊的深層神經(jīng)網(wǎng)絡(luò)模型它的神經(jīng)元間的連接是非全連接的同一層中某些神經(jīng)元之間的連接的權(quán)重是共享的(即相同的)。Hubel-Wiesel結(jié)構(gòu)

7、基于貓的初級視皮層(VI區(qū))的研究。 簡單細(xì)胞 復(fù)雜細(xì)胞 兩層神經(jīng)網(wǎng)絡(luò)模擬初級視皮層中的簡單細(xì)胞和復(fù)雜細(xì)胞 每層的神經(jīng)元被組織成二維平面 “簡單細(xì)胞”層提取其輸入中的局部特征 “復(fù)雜細(xì)胞”層組合“簡單細(xì)胞”層中相應(yīng)的子區(qū)域,使得整個網(wǎng)絡(luò)對局部變換具有一定的不變性。局部連接 局部感知野 圖像的空間聯(lián)系也是局部的像素聯(lián)系較為緊密,而距離較遠(yuǎn)的像素相關(guān)性則較弱。 減少了需要訓(xùn)練的權(quán)值數(shù)目局部連接 參數(shù)共享 圖像的一部分的統(tǒng)計特性與其他部分是一樣的。 在輸入的不同位置檢測同一種特征 平移不變性Convolution卷積 一維卷積 二維卷積jjTjc: 1|mxmx:c:卷積 稀疏連接 參數(shù)共享多卷積核

8、 每個卷積核都會將圖像生成為另一幅圖像。 兩個卷積核就可以將生成兩幅圖像,這兩幅圖像可以看做是一張圖像的不同的通道。由4個通道卷積得到2個通道的過程Pooling池化通過卷積獲得了特征 之后,下一步利用這些特征去做分類。使用卷積時是利用了圖像的“靜態(tài)”特征Pooling, 對不同位置的特征進(jìn)行聚合統(tǒng)計子采樣Average poolMax poolL2 poolConvolved FeaturePooled Featureixs21maxix221ixsCNN基本結(jié)構(gòu) 卷積層 子采樣層CNN結(jié)構(gòu)Input LayerC1: 4 feature mapsS1: 4 feature mapsC2:

9、6 feature mapsS2: 6 feature maps 卷積神經(jīng)網(wǎng)絡(luò)是一個多層的神經(jīng)網(wǎng)絡(luò) 每層由多個二維平面組成 每個平面由多個獨立神經(jīng)元組成CNN訓(xùn)練過程 監(jiān)督訓(xùn)練 Bp算法 向前傳播從樣本集中取一個樣本(X,Yp),將X輸入網(wǎng)絡(luò)計算相應(yīng)的實際輸出OpOp=Fn( F2( F1( XpW(1) )W(2) )W(n) ) 向后傳播計算實際輸出Op與相應(yīng)的理想輸出Yp的差按極小化誤差的方法反向傳播調(diào)整權(quán)矩陣CNN反向傳播 代價函數(shù) 最小化平方誤差(MSE),最小化相對熵(Relative Entropy) 反向傳播主要考慮三個方面: 輸出層,代價函數(shù)的確定及求導(dǎo) Pooling,數(shù)據(jù)

10、的下采樣及殘差的上采樣 卷積層,數(shù)據(jù)的卷積運算及殘差的反卷積運算例:文字識別系統(tǒng)LeNet-5 當(dāng)年美國大多數(shù)銀行就是用它來識別支票上面的手寫數(shù)字的。文字識別系統(tǒng)LeNet-5 卷積網(wǎng)絡(luò)的核心思想: 將局部感受野、權(quán)值共享以及時間或空間亞采樣這三種結(jié)構(gòu)思想結(jié)合起來獲得了某種程度的位移、尺度、形變不變性。 層間聯(lián)系和空域信息的緊密關(guān)系,使其適于圖像處理和理解。 圖像和網(wǎng)絡(luò)的拓?fù)浣Y(jié)構(gòu)能很好的吻合 避免了顯式的特征抽取,而隱式地從訓(xùn)練數(shù)據(jù)中進(jìn)行學(xué)習(xí) 特征提取和模式分類同時進(jìn)行,并同時在訓(xùn)練中產(chǎn)生; 權(quán)重共享可以減少網(wǎng)絡(luò)的訓(xùn)練參數(shù),使神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)變得更簡單,適應(yīng)性更強(qiáng)。ImageNet CNN Str

11、ucture (conv-relu-maxpool-norm) Very good implementation, running on two GPUs ReLU transfer function. Dropout trick. Also trains on full ImageNet (15M images, 15000 classes)(Kirzhevsky, Sutskever, Hinton, 2012)ImageNet CNNCNN的改進(jìn) Rectified linear function 加速收斂 稀疏化 dropout 將隱層節(jié)點以一定概率清0 可以將dropout看作是模型

12、平均的一種CNN的改進(jìn) local Contrast Normalization Subtracting a low-pass smoothed version of the layer Just another convolution in fact (with fixed coefficients) Lots of variants (per feature map, across feature maps, ) Empirically, seems to help a bit (1-2%) on ImageNetCNN改進(jìn) 非線性變換、池化(Jarret et al., 2009)CNN

13、s in the 2010s Deep Learning的常用模型 Recurrent Neural Networks, RNN Long Short Term Memory, LSTM序列數(shù)據(jù)建模(modeling sequences) 學(xué)習(xí)序列數(shù)據(jù),常需要轉(zhuǎn)換輸入序列到不同領(lǐng)域的輸出序列 eg. turn a sequence of sound pressures into a sequence of word identities. 如果沒有分離的目標(biāo)序列,可以通過預(yù)測輸入序列中的下一項來得到“教師信號”。 The target output sequence is the input

14、sequence with an advance of 1 step. For temporal sequences there is a natural order for the predictions. 預(yù)測序列的下一項,模糊了監(jiān)督學(xué)習(xí)與非監(jiān)督學(xué)習(xí)的差別 It uses methods designed for supervised learning, but it doesnt require a separate teaching signal.Recurrent Neural Networks, RNN Hidden Layer會有連向下一時間Hidden Layer的邊 RNN功

15、能強(qiáng)大 Distributed hidden state that allows them to store a lot of information about the past efficiently. Non-linear dynamics that allows them to update their hidden state in complicated ways.RNN 一般來說,RNN每一時間的輸入和輸出是不一樣的 例,序列學(xué)習(xí) 對于序列數(shù)據(jù)是將序列項依次傳入,每個序列項再對應(yīng)不同的輸出RNN 時序展開 在RNN中每一個時間步驟用到的參數(shù)都是一樣的unfoldTraining

16、RNNs with backpropagation 權(quán)值一致的BP算法 BP算法容易實現(xiàn)權(quán)值間的線性約束 同樣計算梯度,然后改變梯度以滿足約束 如果權(quán)值開始時滿足約束,那就會一直滿足約束2121212121:wandwforwEwEusewEandwEcomputewwneedwewwconstrainToBPTT, Back Propagation Through Time RNN可看作權(quán)值共享的多層、前向網(wǎng)絡(luò) 訓(xùn)練權(quán)值約束的前向網(wǎng)絡(luò) 時間域上的訓(xùn)練算法 The forward pass builds up a stack of the activities of all the units

17、 at each time step. The backward pass peels activities off the stack to compute the error derivatives at each time step. After the backward pass we add together the derivatives at all the different times for each weight.BPTT 前向傳播BPTT 后向傳播BP的困難 The backward pass is linear There is a big difference be

18、tween the forward and backward passes. In the forward pass we use squashing functions (like the logistic) to prevent the activity vectors from exploding. The backward pass, is completely linear. If you double the error derivatives at the final layer, all the error derivatives will double. The forwar

19、d pass determines the slope of the linear function used for backpropagating through each neuron.梯度的膨脹或消散 如果向后傳播很多層,梯度 If the weights are small, the gradients shrink exponentially. If the weights are big, the gradients grow exponentially. 訓(xùn)練長序列 (100 time steps) RNN中,梯度很容易膨脹或消散 Avoid this by initializ

20、ing the weights very carefully. 即使好的初始化,也難以檢測當(dāng)前目標(biāo)輸出對很多步之前的輸入的依賴關(guān)系 So RNNs have difficulty dealing with long-range dependencies. Long Short Term MemoryLong Short Term Memory, LSTM 解決了RNN長期(like hundreds of time steps)記憶的問題 Hochreiter & Schmidhuber (1997) LSTM是一個存儲單元,使用logistic和linear單元執(zhí)行乘法運算 Informat

21、ion gets into the cell whenever its “write” gate is on. Information stays in the cell so long as its “keep/forget” gate is on. Information can be read from the cell by turning on its “read” gate.LSTMBackpropagation through a memory cellread 1write 0keep 1 1.7read 0write 0 1.7read 0write 1 1.7 1.7 1.

22、7keep 1keep 0keep 0time LSTM 前向傳播Input Gates:Forget Gates:Cells:Output Gates:Cell Outputs:LSTM 后向傳播例:手寫體生成 Use LSTM. Trained on IAM online handwriting/graves/handwriting.htmlDeep learning in practice軟件 Theano Python Automatic differentiation Caffee C+ Fast Highly modularized

23、Torch 7 C+ and Lua PaddlePaddle NICTA deep learning toolkit 監(jiān)控梯度Repeat: update byUntil convergence.nwww,.,10),.,(0njjjwwLwwwFinite Difference:hhwLhwLwLwhwLhwLwLwjjjjjjhjj2,.)(,.)(,.)(,.)(,.)(lim,.)(0Mini-Batch Use mini-batch: Convergence speed. Parallelism.BxnijjjiwwxLwBww),.,(|10學(xué)習(xí)率 初始化 largest lea

24、rning rate that does not cause divergence of the training criterion. 漸減的學(xué)習(xí)率 AdaGrad (Duchi et al. 2011),max(0tt超參數(shù)搜索超參數(shù)learning rate.Early stopping.weight decay.Layer specific settings.Grid searchMulti-resolution searchRandom search James Bergstra and Yoshua Bengio, Random Search for Hyper-Parameter

25、 Optimization (2012), in: Journal of Machine Learning Research, 13(281-305)小結(jié) Deep Boltzmann Machines, DBM Convolutional Neural Network, CNN Recurrent Neural Networks, RNN 深度學(xué)習(xí)的使用References R. Salakhutdinov and G. Hinton. Deep Boltzmann Machines. ICAIS, 2009 R. Salakhutdinov and G. Hinton. A Better

26、Way to Pretrain Deep Boltzmann Machines. NIPS, 2013 Jake Bouvrie(2014). Notes on Convolutional Neural Networks. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. Gradient-Based Learning Applied to Document Recognition, Proceedings of IEEE, Vol. 86, No. 11, pp. 2278-2324, 1998. Y. LeCun, Y. Bengio: Convolutional Networks for Images, Speech, and Time-Series, In Arbib, M.A. The hand book of Brain Theory and Neural Networks, MIT Press, 1995, 255-258 S. Hochreiter and J. Schmidhuber. Long s

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論