FAFU機器學習 04-1eaturextractionndreprocessing中文_第1頁
FAFU機器學習 04-1eaturextractionndreprocessing中文_第2頁
FAFU機器學習 04-1eaturextractionndreprocessing中文_第3頁
FAFU機器學習 04-1eaturextractionndreprocessing中文_第4頁
FAFU機器學習 04-1eaturextractionndreprocessing中文_第5頁
已閱讀5頁,還剩48頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

機器學習基礎特征提取與預處理2020/12/3特征提取與預處理第4-1課特征提取與預處理在線性回歸中討論的例子使用了簡單的數(shù)字解釋變量,例如比薩餅的直徑。許多機器學習問題需要從分類變量,文本或圖像的觀察中學習。在本課中,您將學習預處理數(shù)據(jù)和創(chuàng)建這些觀察的特征表示的基本技術。這些技術可以用于回歸模型,線性回歸,以及我們將在下一課討論的模型2020/12/3特征提取與預處理第4-2課特征提取與預處理從分類變量中提取特征從文本中提取特征從圖像中提取特征數(shù)據(jù)規(guī)范化2020/12/3特征提取與預處理第4-3課從分類變量中提取特征變量的類型numinal(定類):類別,狀態(tài)或“事物的名稱”Hair_color={auburn,black,blond,brown,grey,red,white}婚姻狀況,職業(yè),身份證號碼,郵政編碼2020/12/3特征提取與預處理第4-4課從分類變量中提取特征變量的類型numinal(定類):類別,狀態(tài)或“事物的名稱”Hair_color={auburn,black,blond,brown,grey,red,white}婚姻狀況,職業(yè),身份證號碼,郵政編碼Binary(二類)只有2個狀態(tài)(0和1)的標稱屬性對稱二元:兩種結果同等重要,例如性別非對稱二元:結果不同等重要,例如醫(yī)學檢驗(陽性與陰性)公約:最重要的結果(如HIV陽性)為12020/12/3特征提取與預處理第4-5課從分類變量中提取特征變量的類型名義上二進制Ordinal

(定序)值有一個有意義的順序(排名),但連續(xù)值之間的大小是未知的。大小={小,中,大},等級,軍隊排名2020/12/3特征提取與預處理第4-6課從分類變量中提取特征變量的類型NominalBinaryOrdinalInterval(定距)以等大小單位為尺度測量的價值觀是有秩序的例如溫度(C度或F度),日歷日期沒有真正的零點2020/12/3特征提取與預處理第4-7課從分類變量中提取特征變量的類型NominalBinaryOrdinalIntervalRatio(定比)固有零點我們可以說值比測量單位大一個數(shù)量級(10K度是5K度的兩倍)。例如開爾文溫度,長度,計數(shù),貨幣數(shù)量2020/12/3特征提取與預處理第4-8課范疇變量一個K或一個熱(獨熱)分類變量通常使用一個K或一個熱編碼進行編碼,其中解釋變量使用每個變量可能值的一個二進制特征進行編碼。例如,假設我們的模型有一個城市解釋變量,可以取三個值之一:紐約、舊金山或教堂山。一個熱編碼使用三個可能的城市中的每一個使用一個二進制特征來表示這個解釋變量。2023/11/4FeatureExtractionandPreprocessingLesson4-9sklearn.feature_extraction:FeatureExtraction特征提取2023/11/4FeatureExtractionandPreprocessingLesson4-10sklearn.feature_extraction.DictVectorizer將特征值映射列表轉換為向量。這個轉換器將特性名到特性值的映射(dict-like對象)列表轉換成Numpy數(shù)組或稀疏稀疏使用SCIS矩陣學習工具。當特征值是字符串時,這個轉換器將執(zhí)行二進制onehot(aka-one-of-K)編碼:為特性可以接受的每個可能的字符串值構造一個布爾值特征。例如,一個特性“f”可以采用值“ham”和“spam”,它將在輸出中變成兩個特性,一個表示“f=ham”,另一個表示“f=spam”。樣本(映射)中沒有出現(xiàn)的特征在結果數(shù)組/矩陣中的值為零。2023/11/4FeatureExtractionandPreprocessingLesson4-11ExampleuseofDictVectorizer:它是用于對特征進行抽取和向量化2023/11/4FeatureExtractionandPreprocessingLesson4-12>>>from

sklearn.feature_extraction

import

DictVectorizer

>>>v

=

DictVectorizer(sparse=False)>>>D

=[{'foo':1,'bar':2},{'foo':3,'baz':1}]>>>X

=

v.fit_transform(D)>>>X

array([[2.,0.,1.],

[0.,1.,3.]])

>>>v.inverse_transform(X)==[{'bar':2.0,'foo':1.0},{'baz':1.0,'foo':3.0}]True

>>>v.transform({'foo':4,'unseen_feature':3})array([[0.,0.,4.]])

ExampleuseofDictVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-13>>>fromsklearn.feature_extractionimportDictVectorizer>>>onehot_encoder=DictVectorizer(sparse=False)>>>D=[{'city':'NewYork'},{'city':'SanFrancisco'},{'city':'ChapelHill'}]>>>X=onehot_encoder.fit_transform(D)>>>Xarray([[0.,1.,0.],[0.,0.,1.],[1.,0.,0.]])>>>onehot_encoder.feature_names_['city=ChapelHill','city=NewYork','city=SanFrancisco']>>>我們能用一個整數(shù)特征來表示一個范疇解釋變量的值嗎?FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtext(從文本中提取特征)ExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-14ExtractingfeaturesfromtextManymachinelearningproblemsusetextasanexplanatoryvariable.Textmustbetransformedtoadifferentrepresentationthatencodesasmuchofitsmeaningaspossibleinafeaturevector.Inthefollowingsectionswewillreviewvariationsofthemostcommonrepresentationoftextthatisusedinmachinelearning:thebag-of-wordsmodel.許多機器學習問題使用文本作為解釋變量。文本必須轉換為一種不同的表示形式,在特征向量中盡可能多地編碼其含義。在下面的章節(jié)中,我們將回顧機器學習中最常用的文本表示形式的變體:單詞袋模型。2023/11/4FeatureExtractionandPreprocessingLesson4-15Extractingfeaturesfromtext詞袋表征最常見的文本表示是單詞包(詞袋)模型。這種表示使用multiset或bag,它對文本中出現(xiàn)的單詞進行編碼;單詞包不編碼文本的任何語法,忽略單詞的順序,并忽略所有語法。單詞包可以看作是一個熱編碼的擴展。它為文本中每個感興趣的單詞創(chuàng)建一個特征。單詞袋模型的動機是直覺,即包含相似單詞的文檔通常具有相似的含義。盡管所編碼的信息有限,但詞袋模型可以有效地用于文檔分類和檢索。2023/11/4FeatureExtractionandPreprocessingLesson4-16ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe

sklearn.feature_extraction.text

submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.feature_extraction.text.CountVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokencountsfeature_extraction.text.HashingVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokenoccurrencesfeature_extraction.text.TfidfTransformer([…])Transformacountmatrixtoanormalizedtfortf-idfrepresentationfeature_extraction.text.TfidfVectorizer([…])ConvertacollectionofrawdocumentstoamatrixofTF-IDFfeatures.詞袋表征在sklearn.feature_提取.text子模塊收集實用程序以從文本文檔生成特征向量。特色_extraction.text.countVector([…])將文本文檔集合轉換為令牌計數(shù)矩陣特色_extraction.text.HashingVectorizer([…])將文本文檔集合轉換為令牌出現(xiàn)的矩陣特色_extraction.text.tfiddTransformer([…])將計數(shù)矩陣轉換為規(guī)范化的tf或tfidf表示特色_tfidfTorizer文本提?。╗…])將原始文檔集合轉換為TF-IDF功能的矩陣。2023/11/4FeatureExtractionandPreprocessingLesson4-17ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe

sklearn.feature_extraction.text

submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.sklearn.feature_extraction.text.CountVectorizerConvertacollectionoftextdocumentstoamatrixoftokencountsThisimplementationproducesasparserepresentationofthecountsusingscipy.sparse.csr_matrix.Ifyoudonotprovideana-prioridictionaryandyoudonotuseananalyzerthatdoessomekindoffeatureselectionthenthenumberoffeatureswillbeequaltothevocabularysizefoundbyanalyzingthedata.詞袋表征在sklearn.feature_提取.text子模塊收集實用程序以從文本文檔生成特征向量sklearn.feature_extraction.text.CountVectorizer將文本文檔集合轉換為令牌計數(shù)矩陣此實現(xiàn)使用scipy.sparse.csr_矩陣.如果不提供先驗字典,并且不使用進行某種特征選擇的分析器,那么特征的數(shù)量將等于通過分析數(shù)據(jù)得到的詞匯表大小。2023/11/4FeatureExtractionandPreprocessingLesson4-18Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-19>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[11010101][11101010]]>>>print(vectorizer.vocabulary_){'unc':7,'in':3,'the':6,'lost':4,'played':5,'basketball':0,'duke':1,'game':2}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-20>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[0110101001][0111010010][1000000100]]>>>print(vectorizer.vocabulary_){'unc':9,'in':4,'the':8,'lost':5,'sandwich':7,'played':6,'basketball':1,'duke':2,'game':3,'ate':0}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-21UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]Themeaningsofthefirsttwodocumentsaremoresimilartoeachotherthantheyaretothethirddocument,andtheircorrespondingfeaturevectorsaremoresimilartoeachotherthantheyaretothethirddocument'sfeaturevectorwhenusingametricsuchasEuclideandistance.當使用歐幾里德距離等度量時,前兩個文檔的含義比它們與第三個文檔的含義更相似,并且它們對應的特征向量彼此之間的相似性比它們與第三個文檔的特征向量更相似。Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-22UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]sklearn.metrics.pairwise.euclidean_distance>>>fromsklearn.metrics.pairwiseimporteuclidean_distances>>>counts=[[0,1,1,0,0,1,0,1],[0,1,1,1,1,0,0,0],[1,0,0,0,0,0,1,0]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[1]))Distancebetween1stand2nddocuments:[[2.]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[1],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]ExtractingfeaturesfromtextForrealapplications--High-dimensionalfeaturevectorsThefirstproblemisthathigh-dimensionalvectorsrequiremorememorythansmallervectors.Thesecondproblemisknownasthecurseofdimensionality(維數(shù)災難/維度詛咒),ortheHugheseffect.Asthefeaturespace'sdimensionalityincreases,moretrainingdataisrequiredtoensurethatthereareenoughtraininginstanceswitheachcombinationofthefeature'svalues.Ifthereareinsufficienttraininginstancesforafeature,thealgorithmmayoverfitnoiseinthetrainingdataandfailtogeneralize.對于實際應用——高維特征向量第一個問題是高維向量比小向量需要更多的內存。第二個問題被稱為維度的詛咒(數(shù),難/度詛咒),或休斯效應。隨著特征空間維數(shù)的增加,需要更多的訓練數(shù)據(jù)來保證每個特征值組合都有足夠的訓練實例。如果一個特征沒有足夠的訓練實例,則該算法可能會過度擬合訓練數(shù)據(jù)中的噪聲而無法進行泛化。2023/11/4FeatureExtractionandPreprocessingLesson4-23ExtractingfeaturesfromtextStop-wordfilteringRemovewordsthatarecommontomostofthedocumentsinthecorpus.Thesewords,calledstopwords,includedeterminerssuchasthe,a,andan;auxiliaryverbssuchasdo,be,andwill;andprepositionssuchason,around,andbeneath.Stopwordsareoftenfunctionalwordsthatcontributetothedocument‘smeaningthroughgrammarratherthantheirdenotations.TheCountVectorizerclasscanfilterstopwordsprovidedasthestop_wordskeywordargumentandalsoincludesabasicEnglishstoplist.2023/11/4FeatureExtractionandPreprocessingLesson4-24>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True,stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[01100101][01111000][10000010]]>>>print(vectorizer.vocabulary_){'unc':7,'lost':4,'sandwich':6,'played':5,'basketball':1,'duke':2,'game':3,'ate':0}ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationWhilestopfilteringisaneasystrategyfordimensionalityreduction,moststoplistscontainonlyafewhundredwords.Alargecorpusmaystillhavehundredsofthousandsofuniquewordsafterfiltering.Twosimilarstrategiesforfurtherreducingdimensionalityarecalledstemmingandlemmatization.(雖然停止過濾是一種簡單的降維策略,但大多數(shù)停止列表只包含幾百個單詞。一個龐大的語料庫經(jīng)過過濾后可能仍然有成千上萬個獨特的詞。進一步降低維度的兩種類似策略稱為詞干和檸檬化。)詞干提?。╯temming)是抽取詞的詞干或詞根形式(不一定能夠表達完整語義)。詞形還原(lemmatization),是把一個任何形式的語言詞匯還原為一般形式(能表達完整語義)WecanusetheNaturalLanguageToolKit(NTLK)tostemandlemmatizethecorpus.2023/11/4FeatureExtractionandPreprocessingLesson4-25ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsInsteadofusingabinaryvalueforeachelementinthefeaturevector,wewillnowuseanintegerthatrepresentsthenumberoftimesthatthewordsappearedinthedocument.現(xiàn)在我們將使用一個整數(shù)來表示單詞在文檔中出現(xiàn)的次數(shù),而不是對特征向量中的每個元素使用二進制值。2023/11/4FeatureExtractionandPreprocessingLesson4-26ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsNormalizedtermfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-27ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(對數(shù)詞頻調整方法)2023/11/4FeatureExtractionandPreprocessingLesson4-28ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(對數(shù)詞頻調整方法)Augmentedtermfrequency(詞頻放大法)2023/11/4FeatureExtractionandPreprocessingLesson4-29ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsNormalization,logarithmicallyscaledtermfrequencies,andaugmentedtermfrequenciescanrepresentthefrequenciesoftermsinadocumentwhilemitigatingtheeffectsofdifferentdocumentsizes.However,anotherproblemremainswiththeserepresentations.Thefeaturevectorscontainlargeweightsfortermsthatoccurfrequentlyinadocument,evenifthosetermsoccurfrequentlyinmostdocumentsinthecorpus.使用TF-IDF權重擴展單詞包規(guī)范化、對數(shù)縮放的術語頻率和增強的術語頻率可以表示文檔中術語的頻率,同時減輕不同文檔大小的影響。然而,這些表示法的另一個問題仍然存在。對于文檔中頻繁出現(xiàn)的術語,即使這些術語在語料庫中的大多數(shù)文檔中頻繁出現(xiàn),特征向量也包含了較大的權重。2023/11/4FeatureExtractionandPreprocessingLesson4-30ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)2023/11/4FeatureExtractionandPreprocessingLesson4-31ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)TF-IDFvalueistheproductofitstermfrequencyandinversedocumentfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-32ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsTfidfVectorizerclasswrapsCountVectorizerandTfidfTransformer2023/11/4FeatureExtractionandPreprocessingLesson4-33>>>fromsklearn.feature_extraction.textimportTfidfVectorizer>>>corpus=['ThedogateasandwichandIateasandwich','Thewizardtransfiguredasandwich’]>>>vectorizer=TfidfVectorizer(stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[0.754583970.377291990.536892710.0.][0.0.0.449436420.63166720.6316672]]>>>print(vectorizer.vocabulary_){'sandwich':2,'dog':1,'transfigured':3,'ate':0,'wizard':4}ExtractingfeaturesfromtextTF-IDF+機器學習分類器基于深度學習的文本分類

FastText:將整篇文檔的詞及N-gram向量疊加平均得到文檔向量,然后使用文檔向量做softmax多分類。涉及兩個技巧:字符級N-gram特征的引入以及分層Softmax分類。Word2Vec:Word2vec是WordEmbedding的方法之一。他是2013年由谷歌的Mikolov提出了一套新的詞嵌入方法。由于Word2vec會考慮上下文,跟之前的Embedding方法相比,效果要更好(但不如18年之后的方法)BERT(BidirectionalEncoderRepresentationsfromTransformers)詞向量模型,2018年10月在《BERT:Pre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding》這篇論文中被Google提出,在11種不同NLP測試中創(chuàng)出最佳成績。2023/11/4FeatureExtractionandPreprocessingLesson4-34FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-35ExtractingfeaturesfromimagesComputervisionisthestudyanddesignofcomputationalartifactsthatprocessandunderstandimages.Theseartifactssometimesemploymachinelearning.Anoverviewofcomputervisionisfarbeyondthescopeofthiscourse,butinthissectionwewillreviewsomebasictechniquesusedincomputervisiontorepresentimagesinmachinelearningproblems.計算機視覺是對處理和理解圖像的計算偽影的研究和設計。這些工件有時使用機器學習。對計算機視覺的概述遠遠超出了本課程的范圍,但在本節(jié)中,我們將回顧一些在計算機視覺中用來表示機器學習問題中的圖像的基本技術。2023/11/4FeatureExtractionandPreprocessingLesson4-36ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesThedigitsdatasetincludedwithscikit-learncontainsgrayscaleimagesofmorethan1,700hand-writtendigitsbetweenzeroandnine.Eachimagehaseightpixelsonaside.Eachpixelisrepresentedbyanintensityvaluebetweenzeroand16;whiteisthemostintenseandisindicatedbyzero,andblackistheleastintenseandisindicatedby16.Thefollowingfigureisanimageofahand-writtendigittakenfromthedataset:從像素強度中提取特征scikitlearn附帶的數(shù)字數(shù)據(jù)集包含1700多個0到9之間手寫數(shù)字的灰度圖像。每幅圖像的一側有八個像素。每個像素由0到16之間的強度值表示;白色最強烈,用0表示,黑色最不強烈,用16表示。下圖是從數(shù)據(jù)集獲取的手寫數(shù)字圖像:2023/11/4FeatureExtractionandPreprocessingLesson4-37ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.圖像的基本特征表示可以通過將矩陣的行連接在一起,將矩陣重塑為向量來構造。2023/11/4FeatureExtractionandPreprocessingLesson4-38ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.2023/11/4FeatureExtractionandPreprocessingLesson4-39LargefeaturevectorsSensitivetochangesinthescale,rotation,andtranslationofimagesFurthermore,learningfrompixelintensitiesisitselfproblematic,asthemodelcanbecomesensitivetochangesinilluminationModerncomputervisionapplicationsfrequentlyuseeitherhand-engineeredfeatureextractionmethodsthatareapplicabletomanydifferentproblems,orautomaticallylearnfeatureswithoutsupervisionproblemusingtechniquessuchasdeeplearningExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesHumanscanquicklyrecognizemanyobjectswithoutobservingeveryattributeoftheobject.Thisintuitionismotivationtocreaterepresentationsofonlythemostinformativeattributesofanimage.Theseinformativeattributes,orpointsofinterest,arepointsthataresurroundedbyrichtexturesandcanbereproduceddespiteperturbingtheimage.Edgesandcornersaretwocommontypesofpointsofinterest.2023/11/4FeatureExtractionandPreprocessingLesson4-40ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesLet'susescikit-imagetoextractpointsofinterestfromthefollowingfigure:2023/11/4FeatureExtractionandPreprocessingLesson4-41ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeatures經(jīng)典關鍵點檢測器:HARRIS-1988HarrisCornerDetectorShi,Tomasi-1996GoodFeaturestoTrack(Shi,Tomasi)SIFT-1999ScaleInvariantFeatureTransform(Lowe)SURF-2006SpeededUpRobustFeatures現(xiàn)代關鍵點檢測器:FAST-2006FeaturesfromAcceleratedSegmentTestBRIEF-2010BinaryRobustIndependentElementaryFeaturesORB-2011OrientedFASTandRotatedBRIEFBRISK-2011BinaryRobustInvariantScalableKeypointsFREAK-2012FastRetinaKeypointKAZE-2012KAZE2023/11/4FeatureExtractionandPreprocessingLesson4-42ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)(尺度不變特征轉換)isamethodforextractingfeaturesfromanimagethatislesssensitivetothescale,rotation,andilluminationoftheimagethantheextractionmethodswehavepreviouslydiscussed.EachSIFTfeature,ordescriptor,isavectorthatdescribesedgesandcornersinaregionofanimage.Unlikethepointsofinterestinourpreviousexample,SIFTalsocapturesinformationaboutthecompositionofeachpointofinterestanditssurroundings.2023/11/4FeatureExtractionandPreprocessingLesson4-43ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)(加速穩(wěn)健特征)isanothermethodofextractinginterestingpointsofanimageandcreatingdescriptionsthatareinvariantoftheimage'sscale,orientation,andillumination.SURFcanbecomputedmorequicklythanSIFT,anditismoreeffectiveatrecognizingfeaturesacrossimagesthathavebeentransformedincertainways.2023/11/4FeatureExtractionandPreprocessingLesson4-44ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)Liketheextractedpointsofinterest,theextractedSIFT(orSURF)areonlythefirststepincreatingafeaturerepresentationthatcouldbeusedinamachinelearningtask.2023/11/4FeatureExtractionandPreprocessingLesson4-45FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization(數(shù)據(jù)標準化/歸一化方法)2023/11/4FeatureExtractionandPreprocessingLesson4-46使用sklearn進行數(shù)據(jù)預處理--標準化/歸一化/正則化一、標準化(Z-Score),或者去除均值和方差縮放公式為:(X-mean)/std計算時對每個屬性/每列分別進行。將數(shù)據(jù)按期屬性(按列進行)減去其均值,并處以其方差。得到的結果是,對于每個屬性/每列來說所有數(shù)據(jù)都聚集在0附近,方差為1。實現(xiàn)時,有兩種不同的方式:使用sklearn.preprocessing.scale()函數(shù),可以直接將給定數(shù)據(jù)進行標準化。使用sklearn.preprocessing.StandardScaler類,使用該類的好處在于可以保存訓練集中的參數(shù)(均值、方差)直接使用其對象轉換測試集數(shù)據(jù)。2023/11/4FeatureExtractionandPreprocessingLesson4-472023/11/4FeatureExtractionandPreprocessingLesson4-48>>>fromsklearnimport

preprocessing>>>import

numpyas

np>>>X=np.array([[1.,-1.,

2.],...

[2.,

0.,

0.],...

[0.,

1.,-1.]])>>>X_scaled=preprocessing.scale(X)

>>>X_scaled

array([[0.

...,-1.22...,

1.33...],

[1.22...,

0.

...,-0.26...],

[-1.22...,

1.22...,-1.06...]])

>>>#處理后數(shù)據(jù)的均值和方差>>>X_scaled.mean(axis=0)array([0.,

0.,

0.])

>>>X_scaled.std(axis=0)arr

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論