《信息科學(xué)類專業(yè)英語》課件第3章_第1頁
《信息科學(xué)類專業(yè)英語》課件第3章_第2頁
《信息科學(xué)類專業(yè)英語》課件第3章_第3頁
《信息科學(xué)類專業(yè)英語》課件第3章_第4頁
《信息科學(xué)類專業(yè)英語》課件第3章_第5頁
已閱讀5頁,還剩40頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Lesson3UtilisationoftheGPUArchitectureforHPC

(第三課GPU用于高性能計(jì)算)

Vocabulary(詞匯)ImportantSentences(重點(diǎn)句)QuestionsandAnswers(問答)Problems(問題)1Introduction

GraphicsProcessingUnits(GPUs),whichcommonlyaccompanystandardCentralProcessingUnits(CPUs)inconsumerPCs,arespecialpurposeprocessorsdesignedtoefficientlyperformthecalculationsnecessarytogeneratevisualoutputfromprogramdata.Videogameshaveparticularlyhighrenderingdemands,andthismarkethasdriventhedevelopmentofGPUs,whichincomparisontoCPUs,offerextremelyhighperformanceforthemonetarycost.

Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.[1]Inparticular,thereispotentialtouseGPUstoboosttheperformanceofthetypesofsimulationscommonlydoneontraditionalHPC(HighPerformanceComputing).systemssuchasHPCx.Therearechallengestobeovercome,however,torealisethispotential.

ThedemandsplacedonGPUsfromtheirnativeapplicationsare,however,usuallyquiteunique,andassuchtheGPUarchitectureisquitedifferentfromthatoftheCPU.Graphicsprocessingisinherentlyextremelyparallelsocanbehighlythreadedandperformedonthelargenumbers(typicallyhundreds)ofprocessingcoresfoundintheGPUchip.TheGPUmemorysystemisquitedifferenttothestandardCPUequivalentsystem.Furthermore,theGPUarchitecturereflectsthefactthatgraphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation.SpecialisedsoftwaredevelopmentiscurrentlyrequiredtoenableapplicationstoefficientlyutilisetheGPUarchitecture.

ThisreportfirstgivesadiscussiononscientificcomputingonGPUs.Then,wedescribetheportingofanHPCbenchmarkapplicationtotheNVIDIATESLAGPUarchitecture,andgiveperformanceresultscomparingtouseofastandardCPU.2Background

2.1GPUs

ThekeydifferencebetweenGPUsandCPUsisthatwhileamodernCPUcontainsafewhigh-functionalitycores,GPUstypicallycontain100ormorebasiccores.GPUsalsoboastalargermemorybuswidththanCPUswhichresultsinfastermemoryaccess.TheGPUclockfrequencyistypicallylowerthanthatofaCPU,butthisgaphasbeenclosingoverthelastfewyears.Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelisation,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftwaredevelopment.[2]

ThissectionintroducesthearchitecturaldesignofGPUs.NVIDIA’sproductsarefocusedonherebutofferingsfromotherGPUmanufacturers,suchasATI,aresimilar.Fig.1illustratesthelayoutofaGPU.Itcanbeseenthattherearemanyprocessingcores(processors)toperformcomputation,eachgroupedintomultiprocessors.Thereareseverallevelsofmemorywhichdifferintermsofaccessspeedandscope.TheRegistershaveprocessorscope;theSharedMemory,ConstantCacheandTextureCachehavemultiprocessorscopeandtheDevice(orGlobal)memorycanbeaccessedbyallcoresonachip.NotethattheGPUmemoryaddressspaceisseparatefromthatfortheCPU,andcopyingofdatabetweenthedevicesmustbemanagedinsoftware.Typically,theCPUwillruntheprogramskeleton,andoffloadoneormorecomputationallydemandingcodesectionstotheGPU.Thus,theGPUeffectivelyacceleratestheapplication.TheCPUisreferredtoastheHostandtheGPUastheDevice.FunctionsthatrunontheDevicearecalledkernels.Fig.1architecturallayoutofNVIDIAGPUchipandmemory

OntheGPU,operationsareperformedbythreadsthataregroupedintoblocks,whichareinturnarrangedonagrid.Eachblockisexecutedbyasingleprocessor,howeverifthereareenoughresourcesavailable,severalblockscanbeactiveatthesametimeonaprocessor.Theprocessorwilltime-slicetheblockstoimproveperformance,oneblockperformingcalculationswhileanotheriswaitingforamemoryread,forexampleSomeofthememoryavailabletotheGPUexhibitsconsiderablylatency,howeverbyusingthismethodoftime-slicing,thislatencycanbehiddenforapplicationsthataresuitable.

Agroupof32threadsiscalledawarp,and16threadsahalf-warp.GPUsachievebestperformancewhenhalf-warpsofthreadsperformthesameoperation.Thisisbecauseinthissituation,thethreadscanbeexecutedinparallel.Conditionalscanmeanthatthreadsdonotperformthesameoperationsandsotheymustbeserialised.Suchthreadsaresaidtobedivergent.ThisalsoappliesforGlobalMemoryaccesses:ifthethreadsofahalf-warpaccessGlobalMemorytogetherandobeycertainrulestoqualifyasbeingcoalesced,thentheyaccessthememoryinparallelanditwillonlytakethetimeofasingleaccessforallthreadsofthehalf-warptoaccessthememory.

GlobalMemoryislocatedinthegraphicscard’sGDDR3memory.Thiscanbeaccessedbyallthreads,althoughitisusuallyslowerthanon-chipmemory.Memoryaccessissignificantlyimprovedifmemoryaccessesarecoalescedasthisallowsallthethreadsofahalf-warptoaccessthememorysimultaneously.

SharedMemorycanonlybeaccessedbythreadsinthesameblock.Becauseitisonchip,theSharedMemoryspaceismuchfasterthanthelocalandGlobalMemoryspaces.Approximately16KBofsharedmemoryareavailableoneachMP(multi-processor),howevertopermiteachMPtohaveseveralblocksactiveatatime(whichimprovesperformance)itisadvisabletouseaslittleSharedMemoryaspossibleperblock.Alittlebitlessthan16KBiseffectivelyavailableduetostorageofinternalvariables.

SharedMemoryconsistsof16memorybanks.WhenSharedMemoryisallocated,eachconsecutive32bitwordisplacedonadifferentmemorybank.Toachievemaximummemoryperformance,bankconflictsmustbeavoided(twothreadstryingtoaccessthesamebankatthesametime).Inthecaseofabankconflict,theconflictingmemoryaccessesareserialised,otherwisememoryaccessbyeachhalf-warpisdoneinparallel.ConstantMemoryisread-onlymemorythatiscached.ItislocatedinGlobalMemory,howeverthereisacachelocatedoneachMulti-processor.Iftherequestedmemoryisinthecache,thenaccessisasfastasSharedMemory,howeverifitisnotthentheaccesswillbethesameasaGlobalMemoryaccess.

TextureMemoryisread-onlymemorythatiscachedandisoptimizedfor2Dspatiallocality.Thismeansthataccessing[a][b]and[a+1][b],say,willprobablygetbetterspeedthanif[a][b]and[a+54][b]wereaccessedinstead.[3]TheTextureCacheis16KBperprocessor.Thisisadifferent16KBtotheSharedMemory,sousingtheTextureCachedoesnotreduceavailableSharedMemory.

RegistermemoryexistsandaccessspeedissimilartoSharedMemory.Eachthreadinablockhasitsownindependentversionofregistervariablesdeclared.VariablesthataretoolargewillbeplacedinLocalMemorywhichislocatedinGlobalMemory.TheLocalMemoryspaceisnotcached,soaccessestoitareasexpensiveasnormalaccessestoGlobalMemory.

2.2CUDA

CUDA(ComputeUnifiedDeviceArchitecture)isaprogramminglanguagedevelopedbyNVIDIAtofacilitatewritingprogramsthatrunonCUDA-enabledGPUs.ItisanextensionofCandiscompiledusingthenvcccompiler.ThemostcommonlyusedextensionsarecudaMalloc*toallocatememoryonthedevice,cudaMemcpy*tocopydatabetweenthehostanddeviceandbetweendifferentlocationsonthedevice,kernelname<<<griddimensions,blockdimensions>>>(parameters)tolaunchakernel,threadIdx.x,blockIdx.x,blockDim.x,andgridDim.xtoidentifythethread,block,blockdimension,andgriddimensioninthexdirection.

CUDAaddressedanumberofissuesthataffecteddevelopingprogramsforGPUs,whichpreviouslyrequiredmuchspecialistknowledge.CUDAisquitesimple,soitwillnottakemuchtimeforaprogrammeralreadyfamiliarwithCtobeginusingit.CUDAalsopossessesanumberofotherbenefitsoverpreviousmethodsofGPUprogramming.OneoftheseisthatitpermitsthreadstoaccessanylocationintheGPUmemoryandtoreadandwritetoasmanymemorylocationsasnecessary.Thesewerepreviouslyquitelimitingconstraints,andsoeasingthemrepresentsasignificantadvantageforCUDA.AnothermajorbenefitispermittingaccesstoSharedMemory,whichwaspreviouslynotpossible.

TomakeadoptionofCUDAaseasyaspossible,NVIDIAhascreatedCUDAUwhichcontainsawell-writtentutorialwithexercisesaswellaslinkstocoursenotesandvideosofCUDAcoursestaughtattheUniversityofIllinois.AReferenceManualandProgrammingGuidearealsoavailable.

TheCUDASDKcontainsmanyexamplecodesthatcanbeusedtotesttheinstallationofaGPUand,asthesourcecodesareprovided,demonstrateCUDAprogrammingtechniques.Oneoftheprovidedcodesisatemplate,providingthebasicstructureonwhichprogramscanbebased.

OneofthemainfeaturesofCUDAistheprovisionofaLinearAlgebralibrary(CUBLAS)andanFFTlibrary(CUFFT).ThesegreatlyeasetheimplementationofmanyscientificcodesonaGPU.

2.3ReviewofGPUSuccesses

Inthissection,somerecentworkinvolvingusingGPUsforscientificcomputingishighlighted.

·TheTheoreticalandComputationalBiophysicsgroupattheUniversityofIllinoisatUrbana-ChampaignhasusedGPUstoachieveaccelerationsofbetween20and100timesformolecularmodellingapplications.TProfessorMikeGilesofOxfordUniversityachieveda100timesspeed-upforaLIBOR.

·MonteCarloapplicationanda50timesspeed-upfora3DLaplaceSolver.TheLaplaceSolverwasimplementedontheGPUusingonlyGlobalandSharedMemory.ItusesaJacobiiterationofaLaplacediscretisationonauniform3Dgrid.TheLIBORMonteCarlocodeusedwasquitesimilartotheoriginalCPUcode.ItusesGlobalandConstantMemory.

·ManyotherUKresearchersarealsoexperimentingwithGPUs.NVIDIAhasashowcaseofapplicationsreportedtothem.GPGPU.orgalsomaintainsalistofresearchersusingGPUs.

·RapidMindachieveda2.4timesspeed-upforBLASSGEMM,2.7timesforFFT,and32.2timesforBlack-Sholes.

2.4GPUDisadvantagesandAlternativeAccelerationTechnologies

Inthissection,somedisadvantagesoftheGPUarchitecturearediscussed,andsomealternativeaccelerationtechnologiesarebrieflydescribed.ThekeylimitationofGPUsistherequirementforahighlevelofparallelismtobeinherenttotheapplicationtoenableexploitationofthemanycores.Furthermore,graphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation,andthisisreflectedinthefactthattypicallyGPUslackbotherrorcorrectionfunctionalityanddoubleprecisioncomputationalfunctionality.ThisisexpectedtoimprovewithfutureGPUarchitectures.

AnothercommoncriticismofGPUsisthelargepowerconsumption.TheNVIDIATeslaC870usesupto170Wpeak,and120Wtypical.TheamountofheatproducedwouldmakeitdifficulttoclusterlargenumbersofGPUstogether.

GPUsalsoplacegreaterconstraintsonprogrammersthanCPUs.Toavoidsignificantperformancedegradationitisnecessarytoavoidconditionalsinsidekernels.Avoidingnon-coalescedGlobalMemoryaccessesisverydifficultformanyapplications,whichcanalsoseverelydegradeperformance.Thelackofanyinter-blockcommunicationfunctionalitymeansthatitisnotpossibleforthreadsinablocktodeterminewhenthethreadsinanotherblockhavecompletedtheircalculation.Thismeansthatifresultsofcomputationfromotherblocksarerequiredthentheonlysolutionisforthekerneltoexitandanotherlaunch,guaranteeingthatalloftheblockshavecompleted.

Finally,GPUssufferfromlargelatencyinCPU-GPUcommunication.ThisbottleneckcanmeanthatunlesstheamountofprocessingthatisdoneontheGPUisgreatenough,itmaybefastertosimplyperformcalculationsontheCPU.Thereareotheralternativeaccelerationtechnologiesavailable,someofwhicharebrieflydescribedbelow.

ClearspeedOnealternativetoGPUsareprocessorsdesignedespeciallyforHPCapplications,suchasthoseofferedbyClearspeed.TheseproductsareusuallyquitesimilartoGPUs,withafewmodificationsthatusuallymakethemmoresuitableforHPCapplications.OneofthesedifferencesisthatallinternalandexternalmemorycontainsECC(ErrorCorrectionCode)todetectandcorrect‘softerrors’.‘Softerrors’arerandomone-biterrorsthatarecausedbyexternalfactorssuchascosmicrays.

Inthegraphicsmarketsucherrorsaretolerable,andsoGPUsdonotcontainECC,howeverforHPCapplicationsitisoftendesirableorrequired.ClearspeedproductsalsohavemorecoresthanGPUs,buttheyrunataslowerclockspeedtoreduceheatloss.Doubleprecisionisalsoavailable.

SpecialisedproductssuchasClearspeedprocessorshaveamuchsmallermarketthanthatofGPUs.ThisgivesGPUsanumberofadvantages,suchaseconomiesofscale,greateravailability,andmoremoneyspentonR&D.

IntelLarrabeeAnotheralternativethatislikelytogeneratemuchinterestwhenitisreleasedin2009-2010isIntel’sLarrabeeprocessor.Thiswillbeamany-corex86processorwithvectorcapability.IthasthesignificantadvantageoverGPUsofmakinginter-processorcommunicationpossible.ItshouldalsosolveanumberofotherproblemsthataffectGPUs,suchasthelatencyofCPU-GPUcommunication.Itwillinitiallybeaimedatthegraphicsmarket,althoughspecialisedHPCproductsbasedonitarepossibleinthefuture.ItislikelythatitwillalsocontainECCtominimise‘softerrors’.AMDisalsodevelopingasimilarproduct,currentlynamed‘AMDFusion’,howeverfewdetailshavebeenreleasedyet.

CellProcessorACellchipcontainsonePowerProcessorElement(PPE)andseveralSynergisticProcessingElements(SPEs).ThePPEactsmainlytocontroltheSPEs,whichdomostofthecalculations.CellprocessorsarequitesimilartoGPUs.ForsomeapplicationsGPUsoutperformCellProcessors,whileforotherstheoppositeistrue.

FPGAsFieldProgrammableGateArrays(FPGAs)areprogrammablesemiconductordevicesthatarebasedaroundamatrixofconfigurablelogicblocksconnectedviaprogrammableinterconnects.Asopposedtonormalmicroprocessors,wherethedesignofthedeviceisfixedbythemanufacturer,FPGAscanbeprogrammedtocomputetheexactalgorithmrequiredbyagivenapplication.Thismakesthemverypowerfulandversatile.Themaindisadvantagesarethattheyareusuallyquitedifficulttoprogram,andtheyarealsoslowifhigh-precisionisrequired.Forcertaintaskstheyarepopular,however.Severaltime-consumingalgorithmsinAstronomywhereonly4bitprecisionisnecessaryareverysuitableforFPGAs,forexample.3GPUAccelerationofanHPCBenchmark(Omitted)

4Conclusions

GPUs,originallydesignedtosatisfytherenderingcomputationaldemandsofvideogames,potentiallyofferperformancebenefitsformoregeneralpurposeapplications,includingHPCsimulations.ThedifferencesbetweentheGPUandstandardCPUarchitecturesresultintherequirementthatsignificanteffortmustbeinvestedtoenableefficientuseoftheGPUarchitectureforsuchapplications.

WedescribedtheGPUarchitectureandmethodsusedforsoftwaredevelopment,andreportedthatthereispotentialfortheuseofGPUsinHPC:therehavebeennotablesuccessesinseveralresearchareas.WedescribedtheportingofanHPCbenchmarkapplicationtotheGPUarchitecture,whereseveraldegreesofoptimisationwereperformed,andbenchmarkedtheresultingcodesagainstcoderunonastandardCPU.TheGPUwasseentoofferuptoafactorof7.5performanceimprovement.

1.?rendervt.報(bào)答,歸還,給予;呈遞,提供,開出;演出,演奏;翻譯;使,致使;使成為,使變得,使處于某狀態(tài);遞交,呈獻(xiàn);粉刷;將(脂肪)熬成油,熔化;(用其他語言)表達(dá),把……譯成;放棄,讓與,交出(與up連用);歸還,交回(與back連用);付給,交納,納貢;提供(幫助等),給予(服務(wù)等);表達(dá),描繪;給……重新措詞,翻譯(常與in或into連用)vi.給予補(bǔ)償;熬油n.在圖形學(xué)領(lǐng)域,render是染色器。

2.?harnessn.馬具,挽具;(防止墜落或摔倒的)背帶,保護(hù)帶vt.給(馬等)裝上挽具;治理,利用。Vocabulary

3.?susceptibleadj.易受影響的,易動(dòng)感情的;過敏的;易受……感染的;能經(jīng)受的;好動(dòng)感情的,感情豐富的,善感的;容許……的,可能……的,可以……的。

4.?scopen.(活動(dòng)或能力的)余地,機(jī)會;(處理、研究事務(wù)的)范圍;……鏡(觀察儀器);視野,視界;見識,眼界,理解的范圍;(活動(dòng))范圍,(影響、波及)面;能力,力量;長度。

5.?threadn.線,細(xì)線;線索,思路;線狀物;細(xì)細(xì)的一條;螺紋;衣服vt.將(針、線等)穿過……;將(影片)裝入放映機(jī);穿成串,串在一起;給……裝入(膠片、繩子);用……線縫;把……線編織進(jìn)。

6.?warpn.彎曲,歪斜;經(jīng)線;經(jīng)紗;vt.&vi.弄彎,變歪vt.使(行為等)不合情理;使乖戾。

7.?divergentadj.有分歧的;叉開的;發(fā)散的,擴(kuò)散的。

8.?texturen.手感,質(zhì)感,質(zhì)地;口感;(音樂或文學(xué)的)諧和統(tǒng)一感,神韻。

9.?coalescevi.聯(lián)合,合并。

[1]?Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.

很自然地,人們對GPU提供的處理能力是否能夠用來加強(qiáng)更多通用計(jì)算產(chǎn)生了興趣。asto,關(guān)于;Tobringundercontrolanddirecttheforceof,統(tǒng)治,管理,支配控制住和指揮……的力量:這里表示指揮和控制GPU的圖形處理能力使之加強(qiáng)通用計(jì)算。ImportantSentences

[2]?Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelization,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftw

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論