專業(yè)英語(yǔ)_翻譯作業(yè)_1_第1頁(yè)
專業(yè)英語(yǔ)_翻譯作業(yè)_1_第2頁(yè)
專業(yè)英語(yǔ)_翻譯作業(yè)_1_第3頁(yè)
專業(yè)英語(yǔ)_翻譯作業(yè)_1_第4頁(yè)
專業(yè)英語(yǔ)_翻譯作業(yè)_1_第5頁(yè)
已閱讀5頁(yè),還剩17頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、DeepFeaturesforTextSpottingMaxJaderberg,AndreaVedaldi,AndrewZissermanVisualGeometryGroup,DepartmentofEngineeringScience,UniversityofOxfordAbstract.Thegoalofthisworkistextspottinginnaturalimages.Thisisdividedintotwosequentialtasks:detectingwordsregionsintheimage,andrecognizingthewordswithintheseregio

2、ns.Wemakethefollowingcontributions:rst,wedevelopaConvolutionalNeuralNet-work(CNN)classierthatcanbeusedforbothtasks.TheCNNhasanovelarchitecturethatenablesecientfeaturesharing(byusinganumberoflayersincommon)fortextdetection,charactercase-sensitiveandinsensitiveclassication,andbigramclassication.Itexce

3、edsthestate-of-the-artperformanceforallofthese.Second,wemakeanumberoftechnicalchangesoverthetraditionalCNNarchitectures,includingnodownsamplingforaper-pixelslidingwindow,andmulti-modelearn-ingwithamixtureoflinearmodels(maxout).Third,wehaveamethodofautomateddataminingofFlickr,thatgenerateswordandchar

4、acterlevelannotations.Finally,thesecomponentsareusedtogethertoformanend-to-end,state-of-the-arttextspottingsystem.Weevaluatethetext-spottingsystemontwostandardbenchmarks,theICDARRobustReadingdatasetandtheStreetViewTextdataset,anddemonstrateimprovementsoverthestate-of-the-artonmultiplemeasures.1Intro

5、ductionWhiletextrecognitionfromscanneddocumentsiswellstudiedandtherearemanyavailablesystems,theautomaticdetectionandrecognitionoftextwithinimagestextspotting(Fig.1)isfarlessdeveloped.However,textcontainedwithinimagescanbeofgreatsemanticvalue,andsoisanimportantstepto-wardsbothinformationretrievalanda

6、utonomoussystems.Forexample,textspottingofnumbersinstreetviewdataallowstheautomaticlocalizationofhousesnumbersinmaps20,readingstreetandshopsignsgivesroboticve-hiclesscenecontext39,andindexinglargevolumesofvideodatawithtextobtainedbytextspottingenablesfastandaccurateretrievalofvideodatafromatextsearc

7、h26.2MaxJaderberg,AndreaVedaldi,AndrewZisserman(a)(b)Fig.1.(a)Anend-to-endtextspottingresultfromthepresentedsystemontheSVTdataset.(b)RandomlysampledcroppedworddataautomaticallyminedfromFlickrwithaweakbaselinesystem,generatingextratrainingdata.pipeline.ToachievethisweuseaConvolutionalNeuralNetwork(CN

8、N)27andgenerateaper-pixeltext/no-textsaliencymap,acase-sensitiveandcase-insensitivecharactersaliencymap,andabigramsaliencymap.Thetextsaliencymapdrivestheproposalofwordboundingboxes,whilethecharacterandbigramsaliencymapsassistinrecognizingthewordwithineachboundingboxthroughacombinationofsoftcosts.Our

9、workisinspiredbytheexcellentperformanceofCNNsforcharacterclassication6,8,47.Ourcontributionsarethreefold:First,weintroduceamethodtosharefeatures44whichallowsustoextendourcharacterclassierstoothertaskssuchascharacterdetectionandbigramclassicationataverysmallextracost:werstgenerateasinglerichfeaturese

10、t,bytrainingastronglysupervisedcharacterclassier,andthenusetheinter-mediatehiddenlayersasfeaturesforthetextdetection,charactercase-sensitiveandinsensitiveclassication,andbigramclassication.Thisproceduremakesbestuseoftheavailabletrainingdata:plentifulforcharacter/non-characterbutlesssofortheothertask

11、s.ItisreminiscentoftheCaeidea14,buthereitisnotnecessarytohaveexternalsourcesoftrainingdata.AsecondkeynoveltyinthecontextoftextdetectionistoleveragetheconvolutionalstructureoftheCNNtoprocesstheentireimageinonegoinsteadofrunningCNNclassiersoneachcroppedcharacterproposal27.Thisallowsustogenerateecientl

12、y,inasinglepass,allthefeaturesrequiredtodetectwordboundingboxes,andthatweuseforrecognizingwordsfromaxedlexiconusingtheViterbialgorithm.WealsomakeatechnicalcontributioninshowingthatourCNNarchitectureusingmaxout21asthenon-linearactivationfunctionhassuperiorperformancetothemorestandardrectiedlinearunit

13、.Ourthirdcontributionisamethodforautomaticallyminingandannotatingdata(Fig.1).SinceCNNscanhavemanymillionsoftrainableparameters,werequirealargecorpusoftrainingdatatominimizeovertting,andminingisuse-fultocheaplyextendavailabledata.OurminingmethodcrawlsimagesfromtheInternettoautomaticallygeneratewordle

14、velandcharacterlevelboundingboxannotations,andaseparatemethodisusedtoautomaticallygeneratecharacterlevelboundingboxannotationswhenonlywordlevelboundingboxannotationsaresupplied.DeepFeaturesforTextSpotting3Inthefollowingwerstdescribethedataminingprocedure(Sect.2)andthentheCNNarchitectureandtraining(S

15、ect.3).Ourend-to-end(imagein,textout)textspottingpipelineisdescribedinSect.4.Finally,Sect.5evaluatesthemethodonanumberofstandardbenchmarks.Weshowthattheperformanceexceedsthestateoftheartacrossmultiplemeasures.RelatedWork.Decomposingthetext-spottingproblemintotextdetectionandtextrecognitionwasrstprop

16、osedby12.Authorshavesubsequentlyfocusedsolelyontextdetection7,11,16,50,51,ortextrecognition31,36,41,oroncombiningbothinend-to-endsystems40,39,49,3234,45,35,6,8,48.Textdetectionmethodsareeitherbasedonconnectedcomponents(CCs)11,16,50,49,3235orslidingwindows40,7,39,45.Connectedcomponentmeth-odssegmentp

17、ixelsintocharacters,thengrouptheseintowords.Forexample,Epshteinetal.takecharactersasCCsofthestrokewidthtransform16,whileNeumannandMatas34,33useExtremalRegions29,ormorerecentlyorientedstrokes35,asCCsrepresentingcharacters.Slidingwindowmethodsapproachtextspottingasastandardtaskofobjectdetection.Forexa

18、mple,Wangetal.45usearandomferns38slidingwindowclassiertondcharactersinanimage,groupingthemusingapictorialstructuresmodel18foraxedlexicon.Wang&Wuetal.47buildonthexedlexiconproblembyusingCNNs27withunsupervisedpre-trainingasin13.Alsharifetal.6andBissaccoetal.8,alsouseCNNsforcharacterclassicationbot

19、hmethodsover-segmentawordboundingboxandndanapproximatesolutiontotheoptimalwordrecognitionresult,in8usingbeamsearchandin6usingaHiddenMarkovModel.TheworksbyMishraetal.31andNovikovaetal.36focuspurelyontextrecognitionassumingaperfecttextdetectorhasproducedcroppedimagesofwords.In36,Novikovacombinesbothvi

20、sualandlexiconconsistencyintoasingleprobabilisticmodel.2DataminingforwordandcharacterannotationsInthissectionwedescribeamethodforautomaticallyminingsuitablephotosharingwebsitestoacquirewordandcharacterlevelannotateddata.Thisan-notationisusedtoprovideadditionaltrainingdatafortheCNNinSect.5.WordMining

21、.PhotosharingwebsitessuchasFlickr3containalargerangeofscenes,includingthosecontainingtext.Inparticular,the“TypographyandLettering”grouponFlickr4containsmainlyphotosorgraphicscontainingtext.Asthetextdepictedinthescenesarethefocusoftheimages,theusergiventitlesoftheimagesoftenincludethetextinthescene.C

22、apitalizingonthisweaklysupervisedinformation,wedevelopasystemtondtitletextwithintheimage,automaticallygeneratingwordandcharacterlevelboundingboxannotations.Usingaweakbaselinetext-spottingsystembasedontheStrokeWidthTrans-form(SWT)16anddescribedinSect.5,wegeneratecandidateworddetections4MaxJaderberg,A

23、ndreaVedaldi,AndrewZissermanforeachimagefromFlickr.Ifadetectedwordisthesameasanyoftheimagestitletextwords,andtherearethesamenumberofcharactersfromtheSWTdetectionphaseaswordcharacters,wesaythatthisisanaccurateworddetec-tion,andusethisdetectionaspositivetexttrainingdata.Wesettheparameterssothatthereca

24、llofthisprocessisverylow(outof130000images,only15000wordswerefound),buttheprecisionisgreaterthan99%.Thismeansthepre-cisionishighenoughfortheminedFlickrdatatobeusedaspositivetrainingdata,buttherecallistoolowforittobeusedforbackgroundno-texttrainingdata.WewillrefertothisdatasetasFlickrType,whichcontai

25、ns6792images,14920words,and71579characters.Fig.1showssomepositivecroppedwordsrandomlysampledfromtheautomaticallygeneratedFlickrTypedataset.Althoughthisprocedurewillcauseabiastowardsscenetextthatcanbefoundwithasimpleend-to-endpipeline,itstillgeneratesmoretrainingexamplesthatcanbeusedtopreventtheovert

26、tingofourmodels.AutomaticCharacterAnnotation.InadditiontominingdatafromFlickr,wealsousethewordrecognitionsystemdescribedinSect.4.2toautomaticallygeneratecharacterboundingboxannotationsfordatasetswhichonlyhavewordlevelboundingboxannotations.Foreachcroppedword,weperformtheoptimalttingofthegroundtrutht

27、exttothecharactermapusingthemethoddescribedinSect.4.2.Thisplacesinter-characterbreakpointswithimpliedcharactercen-ters,whichcanbeusedasroughcharacterboundingboxes.WedothisfortheSVTandOxfordCornmarketdatasets(thataredescribedinsection5),allowingustotrainandtestonanextra22,000croppedcharactersfromthos

28、edatasets.3FeaturelearningusingaConvolutionalNeuralNetworkTheworkhorseofatext-spottingsystemisthecharacterclassier.Theoutputofthisclassierisusedtorecognizewordsand,inoursystem,todetectim-ageregionsthatcontaintext.Text-spottingsystemsappeartobeparticularlysensitivetotheperformanceofcharacterclassicat

29、ion;forexample,in8in-creasingtheaccuracyofthecharacterclassierby7%ledtoa25%increaseinwordrecognition.Inthissectionwethereforeconcentrateonmaximizingtheperformanceofthiscomponent.Toclassifyanimagepatchxinoneofthepossiblecharacters(orbackground),weextractasetoffeatures(x)=(1(x),2(x),.,K(x)andthenlearn

30、abi-naryclassierfcforeachcharactercofthealphabetC.Classiersarelearnedtoyieldaposteriorprobabilitydistributionp(c|x)=fc(x)overcharactersandthelatterismaximizedtorecognizethecharacterc¯containedinpatchx:c¯=argmaxcCp(c|x).Traditionally,featuresaremanuallyengineeredandop-timizedthroughalaborio

31、ustrial-and-errorcycleinvolvingadjustingthefeaturesandre-learningtheclassiers.Inthiswork,weproposeinsteadtolearntherep-resentationusingaCNN27,jointlyoptimizingtheperformanceofthefeaturesaswellasoftheclassiers.Asnotedintherecentliterature,awelldesignedDeepFeaturesforTextSpotting5learnablerepresentati

32、onofthistypecaninfactyieldsubstantialperformancegains25.CNNsareobtainedbystackingmultiplelayersoffeatures.AconvolutionallayerconsistofKlinearltersfollowedbyanon-linearresponsefunction.Theinputtoaconvolutionallayerisafeaturemapzi(u,v)where(u,v)iarespatialcoordinatesandzi(u,v)RCcontainsCscalarfeatures

33、orchannelsckzi(u,v).Theoutputisanewfeaturemapzi+1suchthatzi+1=hi(Wikzi+bik),whereWikandbikdenotethek-thlterkernelandbiasrespectively,andhiisanon-linearactivationfunctionsuchastheRectiedLinearUnit(ReLU)hi(z)=max0,z.Convolutionallayerscanbeintertwinedwithnormalization,subsampling,andmax-poolinglayersw

34、hichbuildtranslationinvarianceinlocalneighborhoods.Theprocessstartswithz1=xandendsbyconnectingthelastfeaturemaptoalogisticregressorforclassication.AlltheparametersofthemodelarejointlyoptimizedtominimizetheclassicationlossoveratrainingsetusingStochasticGradientDescent(SGD),back-propagation,andotherim

35、provementsdiscussedinSect.3.1.InsteadofusingReLUsasactivationfunctionhi,inourexperimentsitwasfoundempiricallythatmaxout21yieldssuperiorperformance.Maxout,inpar-ticularwhenusedinthenalclassicationlayer,canbethoughtofastakingthemaximumresponseoveramixtureofnlinearmodels,allowingtheCNNtoeasily212ziissi

36、mplytheirpointwisemaximum:hi(zi(u,v)=maxzi(u,v),zi(u,v).Moregenerally,thek -thmaxoutoperatorhkisobtainedbyselectingasub-setGk i1,2,.,Koffeaturechannelsandcomputingthemaximumover kthem:hki(zi(u,v)=maxkGk izi(u,v).Whiledierentgroupingstrategiesarepossible,heregroupsareformedbytakinggconsecutivechannel

37、softheinputmap:G1i=1,2,.,g,G2i=g+1,g+2,.,2gandsoon.Hence,givenKfeaturechannelsasinput,maxoutconstructsK =K/gnewchannels.Thissectiondiscussesthedetailsoflearningthecharacterclassiers.Trainingisdividedintotwostages.Intherststage,acase-insensitiveCNNcharacterclassierislearned.Inthesecondstage,theresult

38、ingfeaturemapsareappliedtootherclassicationproblemsasneeded.Theoutputisfourstate-of-the-artCNNclassiers:acharacter/backgroundclassier,acase-insensitivecharacterclassier,acase-sensitivecharacterclassier,andabigramclassier.Stage1:Bootstrappingthecase-insensitiveclassier.Thecase-insensitiveclassieruses

39、afour-layerCNNoutputtingaprobabilityp(c|x)overanalpha-betCincludingall26letters,10digits,andanoise/background(no-text)class,givingatotalof37classes(Fig.2)Theinputz1=xoftheCNNaregrayscalecroppedcharacterimagesof24×24pixels,zero-centeredandnormalizedbysubtractingthepatchmeananddividingbythestanda

40、rddeviation.Duetothesmallinputsize,nospatialpoolingordownsamplingisperformed.Startingfromtherstlayer,theinputimageisconvolvedwith96ltersofsize6MaxJaderberg,AndreaVedaldi,AndrewZissermangroups.Fig.3.Visualizationsofeachcharacterclasslearntfromthe37-waycase-insensitivecharacterclassierCNN.Eachimageiss

41、yntheticallygeneratedbymaximizingtheposteriorprobabilityofaparticularclass.Thisisimplementedbyback-propagatingtheerrorfromacostlayerthataimstomaximizethescoreofthatclass43,17.9×9,resultinginamapofsize16×16(toavoidboundaryeects)and96channels.The96channelsarethenpooledwithmaxoutingroupofsize

42、g=2,resultingin48channels.Thesequencecontinuesbyconvolvingwith128,512,148ltersofside9,8,1andmaxoutgroupsofsizeg=2,4,4,resultinginfeaturemapswith64,128,37channelsandsize8×8,1×1,1×1respectively.Thelast37channelsarefedintoasoft-maxtoconvertthemintocharacterprobabilities.Inpracticeweuse48

43、channelsinthenalclassicationlayerratherthan37asthesoftwareweuse,basedoncuda-convnet25,isoptimizedformultiplesof16convolutionallterswedohoweverusetheadditional12classesasextrano-textclasses,abstractingthisto37outputclasses.Wetrainusingstochasticgradientdescentandback-propagation,andalsousedropout22in

44、alllayersexcepttherstconvolutionallayertohelppreventovertting.Dropoutsimplyinvolvesrandomlyzeroingaproportionofthepa-rameters;theproportionwekeepforeachlayeris1,0.5,0.5,0.5.Thetrainingdataisaugmentedbyrandomrotationsandnoiseinjection.Byomittinganydownsamplinginournetworkandensuringtheoutputforeachcl

45、assisonepixelinsize,itisimmediatetoapplythelearntltersonafullimageinaconvolu-tionalmannertoobtainaper-pixeloutputwithoutalossofresolution,asshownDeepFeaturesforTextSpotting7inthesecondimageofFig4.Fig.3illustratesthelearnedCNNbyusingthevisualizationtechniqueof43.Stage2:Learningtheothercharacterclassi

46、ers.Trainingonalargeamountofannotateddata,andalsoincludingano-textclassinouralphabet,meansthehiddenlayersofthenetworkproducefeaturemapshighlyadeptatdiscriminatingcharacters,andcanbeadaptedforotherclassicationtasksre-latedtotext.Weusetheoutputsofthesecondconvolutionallayerasoursetofdiscriminativefeat

47、ures,(x)=z2.Fromthesefeatures,wetraina2-waytext/no-textclassier1,a63-waycase-sensitivecharacterclassier,andabi-gramclassier,eachoneusingatwo-layerCNNactingon(x)(Fig.2).ThelasttwolayersofeachofthesethreeCNNsresultinfeaturemapswith128-2,128-63,and128-604channelsrespectively,allresultingfrommaxoutgroup

48、ingofsizeg=4.Thesearealltrainedwith(x)asinput,withdropoutof0.5onalllayers,andne-tunedbyadaptivelyreducingthelearningrate.Thebigramclassierrecognisesinstancesoftwoadjacentcharacters,e.g.Fig6.TheseCNNscouldhavebeenlearnedindependently.However,sharingthersttwolayershastwokeyadvantages.First,thelow-leve

49、lfeatureslearnedfromcase-insensitivecharacterclassicationallowssharingtrainingdataamongtasks,reducingoverttingandimprovingperformanceinclassicationtaskswithlessinformativelabels(text/no-textclassication),ortaskswithfewertrainingexamples(case-sensitivecharacterclassication,bigramclassication).Second,

50、itallowssharingcomputations,signicantlyincreasingtheeciency.4End-to-EndPipelineThissectiondescribesthevariousstagesoftheproposedend-to-endtextspot-tingsystem,makinguseofthefeatureslearntinSect.3.Thepipelinestartswithadetectionphase(Sect.4.1)thattakesarawimageandgeneratescandidateboundingboxesofwords

51、,makinguseofthetext/no-textclassifer.Thewordscontainedwithintheseboundingboxesarethenrecognizedagainstaxedlex-iconofwords(Sect.4.2),drivenbythecharacterclassiers,bigramclassier,andothergeometriccues.Theaimofthedetectionphaseistostartfromalarge,rawpixelinputimageandgenerateasetofrectangularboundingbo

52、xes,eachofwhichshouldcontaintheimageofaword.Thisdetectionprocess(Fig.4)istunedforhighrecall,andgeneratesasetofcandidatewordboundingboxes.Theprocessstartsbycomputingatextsaliencymapbyevaluatingthecharacter/backgroundCNNclassierinaslidingwindowfashionacrosstheim-age,whichhasbeenappropriatelyzero-padde

53、dsothattheresultingtextsaliency1Trainingadedicatedclassierwasfoundtoyieldsuperiorperformancetousingthebackgroundclassinthe37-waycase-sensitivecharacterclassier.8MaxJaderberg,AndreaVedaldi,AndrewZissermanFig.4.Thedetectorphaseforasinglescale.Fromlefttoright:inputimage,CNNgeneratedtextsaliencymapusing

54、thattext/no-textclassier,aftertherunlengthsmoothingphase,afterthewordsplittingphase,theimpliedboundingboxes.Subse-quently,theboundingboxeswillbecombinedatmultiplescalesandundergolteringandnon-maximalsuppression.mapisthesameresolutionastheoriginalimage.AstheCNNistrainedtodetecttextatasinglecanonicalh

55、eight,thisprocessisrepeatedfor16dierentscalestotargettextheightsbetween16and260pixelsbyresizingtheinputimage.Giventhesesaliencymaps,wordboundingboxesaregeneratedindependentlyateachscaleintwosteps.Therststepistoidentifylinesoftext.Tothisend,theprobabilitymapisrstthresholdedtondlocalregionsofhighproba

56、bility.Thentheseregionsareconnectedintextlinesbyusingtherunlengthsmoothingalgorithm(RLSA):foreachrowofpixelsthemeanµandstandarddeviationofthespacingsbetweenprobabilitypeaksarecomputedandneighboringregionsareconnectedifthespacebetweenthemislessthan3µ0.5.Findingconnectedcomponentsofthelinked

57、regionsresultsincandidatetextlines.Thenextstepistosplittextlinesintowords.Forthis,theimageiscroppedtojustthatofatextlineandOtsuthresholding37isappliedtoroughlysegmentforegroundcharactersfrombackground.Adjacentconnectedcomponents(whicharehopefullysegmentedcharacters)arethenconnectediftheirhorizontals

58、pacingsarelessthanthemeanhorizontalspacingforthetextline,againusingRLSA.Theresultingconnectedcomponentsgivecandidateboundingboxesforindividualwords,whicharethenaddedtotheglobalsetofboundingboxesatallscales.Finally,theseboundingboxesarelteredbasedongeometricconstraints(boxheight,aspectratio,etc.)andundergonon-maximalsuppressionsortingthembydecreasingaverageper-pixeltextsaliencyscore.TheaimofthewordrecognitionstageistotakethecandidatecroppedwordimagesIRW×HofwidthWandheightHandestimatethetextcontainedinthem.Inordertorecognizeawordfromaxedlexicon,eachwordhypoth-esisisscoredusin

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論