版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
Lesson3UtilisationoftheGPUArchitectureforHPC
(第三課GPU用于高性能計(jì)算)
Vocabulary(詞匯)ImportantSentences(重點(diǎn)句)QuestionsandAnswers(問答)Problems(問題)1Introduction
GraphicsProcessingUnits(GPUs),whichcommonlyaccompanystandardCentralProcessingUnits(CPUs)inconsumerPCs,arespecialpurposeprocessorsdesignedtoefficientlyperformthecalculationsnecessarytogeneratevisualoutputfromprogramdata.Videogameshaveparticularlyhighrenderingdemands,andthismarkethasdriventhedevelopmentofGPUs,whichincomparisontoCPUs,offerextremelyhighperformanceforthemonetarycost.
Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.[1]Inparticular,thereispotentialtouseGPUstoboosttheperformanceofthetypesofsimulationscommonlydoneontraditionalHPC(HighPerformanceComputing).systemssuchasHPCx.Therearechallengestobeovercome,however,torealisethispotential.
ThedemandsplacedonGPUsfromtheirnativeapplicationsare,however,usuallyquiteunique,andassuchtheGPUarchitectureisquitedifferentfromthatoftheCPU.Graphicsprocessingisinherentlyextremelyparallelsocanbehighlythreadedandperformedonthelargenumbers(typicallyhundreds)ofprocessingcoresfoundintheGPUchip.TheGPUmemorysystemisquitedifferenttothestandardCPUequivalentsystem.Furthermore,theGPUarchitecturereflectsthefactthatgraphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation.SpecialisedsoftwaredevelopmentiscurrentlyrequiredtoenableapplicationstoefficientlyutilisetheGPUarchitecture.
ThisreportfirstgivesadiscussiononscientificcomputingonGPUs.Then,wedescribetheportingofanHPCbenchmarkapplicationtotheNVIDIATESLAGPUarchitecture,andgiveperformanceresultscomparingtouseofastandardCPU.2Background
2.1GPUs
ThekeydifferencebetweenGPUsandCPUsisthatwhileamodernCPUcontainsafewhigh-functionalitycores,GPUstypicallycontain100ormorebasiccores.GPUsalsoboastalargermemorybuswidththanCPUswhichresultsinfastermemoryaccess.TheGPUclockfrequencyistypicallylowerthanthatofaCPU,butthisgaphasbeenclosingoverthelastfewyears.Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelisation,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftwaredevelopment.[2]
ThissectionintroducesthearchitecturaldesignofGPUs.NVIDIA’sproductsarefocusedonherebutofferingsfromotherGPUmanufacturers,suchasATI,aresimilar.Fig.1illustratesthelayoutofaGPU.Itcanbeseenthattherearemanyprocessingcores(processors)toperformcomputation,eachgroupedintomultiprocessors.Thereareseverallevelsofmemorywhichdifferintermsofaccessspeedandscope.TheRegistershaveprocessorscope;theSharedMemory,ConstantCacheandTextureCachehavemultiprocessorscopeandtheDevice(orGlobal)memorycanbeaccessedbyallcoresonachip.NotethattheGPUmemoryaddressspaceisseparatefromthatfortheCPU,andcopyingofdatabetweenthedevicesmustbemanagedinsoftware.Typically,theCPUwillruntheprogramskeleton,andoffloadoneormorecomputationallydemandingcodesectionstotheGPU.Thus,theGPUeffectivelyacceleratestheapplication.TheCPUisreferredtoastheHostandtheGPUastheDevice.FunctionsthatrunontheDevicearecalledkernels.Fig.1architecturallayoutofNVIDIAGPUchipandmemory
OntheGPU,operationsareperformedbythreadsthataregroupedintoblocks,whichareinturnarrangedonagrid.Eachblockisexecutedbyasingleprocessor,howeverifthereareenoughresourcesavailable,severalblockscanbeactiveatthesametimeonaprocessor.Theprocessorwilltime-slicetheblockstoimproveperformance,oneblockperformingcalculationswhileanotheriswaitingforamemoryread,forexampleSomeofthememoryavailabletotheGPUexhibitsconsiderablylatency,howeverbyusingthismethodoftime-slicing,thislatencycanbehiddenforapplicationsthataresuitable.
Agroupof32threadsiscalledawarp,and16threadsahalf-warp.GPUsachievebestperformancewhenhalf-warpsofthreadsperformthesameoperation.Thisisbecauseinthissituation,thethreadscanbeexecutedinparallel.Conditionalscanmeanthatthreadsdonotperformthesameoperationsandsotheymustbeserialised.Suchthreadsaresaidtobedivergent.ThisalsoappliesforGlobalMemoryaccesses:ifthethreadsofahalf-warpaccessGlobalMemorytogetherandobeycertainrulestoqualifyasbeingcoalesced,thentheyaccessthememoryinparallelanditwillonlytakethetimeofasingleaccessforallthreadsofthehalf-warptoaccessthememory.
GlobalMemoryislocatedinthegraphicscard’sGDDR3memory.Thiscanbeaccessedbyallthreads,althoughitisusuallyslowerthanon-chipmemory.Memoryaccessissignificantlyimprovedifmemoryaccessesarecoalescedasthisallowsallthethreadsofahalf-warptoaccessthememorysimultaneously.
SharedMemorycanonlybeaccessedbythreadsinthesameblock.Becauseitisonchip,theSharedMemoryspaceismuchfasterthanthelocalandGlobalMemoryspaces.Approximately16KBofsharedmemoryareavailableoneachMP(multi-processor),howevertopermiteachMPtohaveseveralblocksactiveatatime(whichimprovesperformance)itisadvisabletouseaslittleSharedMemoryaspossibleperblock.Alittlebitlessthan16KBiseffectivelyavailableduetostorageofinternalvariables.
SharedMemoryconsistsof16memorybanks.WhenSharedMemoryisallocated,eachconsecutive32bitwordisplacedonadifferentmemorybank.Toachievemaximummemoryperformance,bankconflictsmustbeavoided(twothreadstryingtoaccessthesamebankatthesametime).Inthecaseofabankconflict,theconflictingmemoryaccessesareserialised,otherwisememoryaccessbyeachhalf-warpisdoneinparallel.ConstantMemoryisread-onlymemorythatiscached.ItislocatedinGlobalMemory,howeverthereisacachelocatedoneachMulti-processor.Iftherequestedmemoryisinthecache,thenaccessisasfastasSharedMemory,howeverifitisnotthentheaccesswillbethesameasaGlobalMemoryaccess.
TextureMemoryisread-onlymemorythatiscachedandisoptimizedfor2Dspatiallocality.Thismeansthataccessing[a][b]and[a+1][b],say,willprobablygetbetterspeedthanif[a][b]and[a+54][b]wereaccessedinstead.[3]TheTextureCacheis16KBperprocessor.Thisisadifferent16KBtotheSharedMemory,sousingtheTextureCachedoesnotreduceavailableSharedMemory.
RegistermemoryexistsandaccessspeedissimilartoSharedMemory.Eachthreadinablockhasitsownindependentversionofregistervariablesdeclared.VariablesthataretoolargewillbeplacedinLocalMemorywhichislocatedinGlobalMemory.TheLocalMemoryspaceisnotcached,soaccessestoitareasexpensiveasnormalaccessestoGlobalMemory.
2.2CUDA
CUDA(ComputeUnifiedDeviceArchitecture)isaprogramminglanguagedevelopedbyNVIDIAtofacilitatewritingprogramsthatrunonCUDA-enabledGPUs.ItisanextensionofCandiscompiledusingthenvcccompiler.ThemostcommonlyusedextensionsarecudaMalloc*toallocatememoryonthedevice,cudaMemcpy*tocopydatabetweenthehostanddeviceandbetweendifferentlocationsonthedevice,kernelname<<<griddimensions,blockdimensions>>>(parameters)tolaunchakernel,threadIdx.x,blockIdx.x,blockDim.x,andgridDim.xtoidentifythethread,block,blockdimension,andgriddimensioninthexdirection.
CUDAaddressedanumberofissuesthataffecteddevelopingprogramsforGPUs,whichpreviouslyrequiredmuchspecialistknowledge.CUDAisquitesimple,soitwillnottakemuchtimeforaprogrammeralreadyfamiliarwithCtobeginusingit.CUDAalsopossessesanumberofotherbenefitsoverpreviousmethodsofGPUprogramming.OneoftheseisthatitpermitsthreadstoaccessanylocationintheGPUmemoryandtoreadandwritetoasmanymemorylocationsasnecessary.Thesewerepreviouslyquitelimitingconstraints,andsoeasingthemrepresentsasignificantadvantageforCUDA.AnothermajorbenefitispermittingaccesstoSharedMemory,whichwaspreviouslynotpossible.
TomakeadoptionofCUDAaseasyaspossible,NVIDIAhascreatedCUDAUwhichcontainsawell-writtentutorialwithexercisesaswellaslinkstocoursenotesandvideosofCUDAcoursestaughtattheUniversityofIllinois.AReferenceManualandProgrammingGuidearealsoavailable.
TheCUDASDKcontainsmanyexamplecodesthatcanbeusedtotesttheinstallationofaGPUand,asthesourcecodesareprovided,demonstrateCUDAprogrammingtechniques.Oneoftheprovidedcodesisatemplate,providingthebasicstructureonwhichprogramscanbebased.
OneofthemainfeaturesofCUDAistheprovisionofaLinearAlgebralibrary(CUBLAS)andanFFTlibrary(CUFFT).ThesegreatlyeasetheimplementationofmanyscientificcodesonaGPU.
2.3ReviewofGPUSuccesses
Inthissection,somerecentworkinvolvingusingGPUsforscientificcomputingishighlighted.
·TheTheoreticalandComputationalBiophysicsgroupattheUniversityofIllinoisatUrbana-ChampaignhasusedGPUstoachieveaccelerationsofbetween20and100timesformolecularmodellingapplications.TProfessorMikeGilesofOxfordUniversityachieveda100timesspeed-upforaLIBOR.
·MonteCarloapplicationanda50timesspeed-upfora3DLaplaceSolver.TheLaplaceSolverwasimplementedontheGPUusingonlyGlobalandSharedMemory.ItusesaJacobiiterationofaLaplacediscretisationonauniform3Dgrid.TheLIBORMonteCarlocodeusedwasquitesimilartotheoriginalCPUcode.ItusesGlobalandConstantMemory.
·ManyotherUKresearchersarealsoexperimentingwithGPUs.NVIDIAhasashowcaseofapplicationsreportedtothem.GPGPU.orgalsomaintainsalistofresearchersusingGPUs.
·RapidMindachieveda2.4timesspeed-upforBLASSGEMM,2.7timesforFFT,and32.2timesforBlack-Sholes.
2.4GPUDisadvantagesandAlternativeAccelerationTechnologies
Inthissection,somedisadvantagesoftheGPUarchitecturearediscussed,andsomealternativeaccelerationtechnologiesarebrieflydescribed.ThekeylimitationofGPUsistherequirementforahighlevelofparallelismtobeinherenttotheapplicationtoenableexploitationofthemanycores.Furthermore,graphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation,andthisisreflectedinthefactthattypicallyGPUslackbotherrorcorrectionfunctionalityanddoubleprecisioncomputationalfunctionality.ThisisexpectedtoimprovewithfutureGPUarchitectures.
AnothercommoncriticismofGPUsisthelargepowerconsumption.TheNVIDIATeslaC870usesupto170Wpeak,and120Wtypical.TheamountofheatproducedwouldmakeitdifficulttoclusterlargenumbersofGPUstogether.
GPUsalsoplacegreaterconstraintsonprogrammersthanCPUs.Toavoidsignificantperformancedegradationitisnecessarytoavoidconditionalsinsidekernels.Avoidingnon-coalescedGlobalMemoryaccessesisverydifficultformanyapplications,whichcanalsoseverelydegradeperformance.Thelackofanyinter-blockcommunicationfunctionalitymeansthatitisnotpossibleforthreadsinablocktodeterminewhenthethreadsinanotherblockhavecompletedtheircalculation.Thismeansthatifresultsofcomputationfromotherblocksarerequiredthentheonlysolutionisforthekerneltoexitandanotherlaunch,guaranteeingthatalloftheblockshavecompleted.
Finally,GPUssufferfromlargelatencyinCPU-GPUcommunication.ThisbottleneckcanmeanthatunlesstheamountofprocessingthatisdoneontheGPUisgreatenough,itmaybefastertosimplyperformcalculationsontheCPU.Thereareotheralternativeaccelerationtechnologiesavailable,someofwhicharebrieflydescribedbelow.
ClearspeedOnealternativetoGPUsareprocessorsdesignedespeciallyforHPCapplications,suchasthoseofferedbyClearspeed.TheseproductsareusuallyquitesimilartoGPUs,withafewmodificationsthatusuallymakethemmoresuitableforHPCapplications.OneofthesedifferencesisthatallinternalandexternalmemorycontainsECC(ErrorCorrectionCode)todetectandcorrect‘softerrors’.‘Softerrors’arerandomone-biterrorsthatarecausedbyexternalfactorssuchascosmicrays.
Inthegraphicsmarketsucherrorsaretolerable,andsoGPUsdonotcontainECC,howeverforHPCapplicationsitisoftendesirableorrequired.ClearspeedproductsalsohavemorecoresthanGPUs,buttheyrunataslowerclockspeedtoreduceheatloss.Doubleprecisionisalsoavailable.
SpecialisedproductssuchasClearspeedprocessorshaveamuchsmallermarketthanthatofGPUs.ThisgivesGPUsanumberofadvantages,suchaseconomiesofscale,greateravailability,andmoremoneyspentonR&D.
IntelLarrabeeAnotheralternativethatislikelytogeneratemuchinterestwhenitisreleasedin2009-2010isIntel’sLarrabeeprocessor.Thiswillbeamany-corex86processorwithvectorcapability.IthasthesignificantadvantageoverGPUsofmakinginter-processorcommunicationpossible.ItshouldalsosolveanumberofotherproblemsthataffectGPUs,suchasthelatencyofCPU-GPUcommunication.Itwillinitiallybeaimedatthegraphicsmarket,althoughspecialisedHPCproductsbasedonitarepossibleinthefuture.ItislikelythatitwillalsocontainECCtominimise‘softerrors’.AMDisalsodevelopingasimilarproduct,currentlynamed‘AMDFusion’,howeverfewdetailshavebeenreleasedyet.
CellProcessorACellchipcontainsonePowerProcessorElement(PPE)andseveralSynergisticProcessingElements(SPEs).ThePPEactsmainlytocontroltheSPEs,whichdomostofthecalculations.CellprocessorsarequitesimilartoGPUs.ForsomeapplicationsGPUsoutperformCellProcessors,whileforotherstheoppositeistrue.
FPGAsFieldProgrammableGateArrays(FPGAs)areprogrammablesemiconductordevicesthatarebasedaroundamatrixofconfigurablelogicblocksconnectedviaprogrammableinterconnects.Asopposedtonormalmicroprocessors,wherethedesignofthedeviceisfixedbythemanufacturer,FPGAscanbeprogrammedtocomputetheexactalgorithmrequiredbyagivenapplication.Thismakesthemverypowerfulandversatile.Themaindisadvantagesarethattheyareusuallyquitedifficulttoprogram,andtheyarealsoslowifhigh-precisionisrequired.Forcertaintaskstheyarepopular,however.Severaltime-consumingalgorithmsinAstronomywhereonly4bitprecisionisnecessaryareverysuitableforFPGAs,forexample.3GPUAccelerationofanHPCBenchmark(Omitted)
4Conclusions
GPUs,originallydesignedtosatisfytherenderingcomputationaldemandsofvideogames,potentiallyofferperformancebenefitsformoregeneralpurposeapplications,includingHPCsimulations.ThedifferencesbetweentheGPUandstandardCPUarchitecturesresultintherequirementthatsignificanteffortmustbeinvestedtoenableefficientuseoftheGPUarchitectureforsuchapplications.
WedescribedtheGPUarchitectureandmethodsusedforsoftwaredevelopment,andreportedthatthereispotentialfortheuseofGPUsinHPC:therehavebeennotablesuccessesinseveralresearchareas.WedescribedtheportingofanHPCbenchmarkapplicationtotheGPUarchitecture,whereseveraldegreesofoptimisationwereperformed,andbenchmarkedtheresultingcodesagainstcoderunonastandardCPU.TheGPUwasseentoofferuptoafactorof7.5performanceimprovement.
1.?rendervt.報(bào)答,歸還,給予;呈遞,提供,開出;演出,演奏;翻譯;使,致使;使成為,使變得,使處于某狀態(tài);遞交,呈獻(xiàn);粉刷;將(脂肪)熬成油,熔化;(用其他語言)表達(dá),把……譯成;放棄,讓與,交出(與up連用);歸還,交回(與back連用);付給,交納,納貢;提供(幫助等),給予(服務(wù)等);表達(dá),描繪;給……重新措詞,翻譯(常與in或into連用)vi.給予補(bǔ)償;熬油n.在圖形學(xué)領(lǐng)域,render是染色器。
2.?harnessn.馬具,挽具;(防止墜落或摔倒的)背帶,保護(hù)帶vt.給(馬等)裝上挽具;治理,利用。Vocabulary
3.?susceptibleadj.易受影響的,易動(dòng)感情的;過敏的;易受……感染的;能經(jīng)受的;好動(dòng)感情的,感情豐富的,善感的;容許……的,可能……的,可以……的。
4.?scopen.(活動(dòng)或能力的)余地,機(jī)會;(處理、研究事務(wù)的)范圍;……鏡(觀察儀器);視野,視界;見識,眼界,理解的范圍;(活動(dòng))范圍,(影響、波及)面;能力,力量;長度。
5.?threadn.線,細(xì)線;線索,思路;線狀物;細(xì)細(xì)的一條;螺紋;衣服vt.將(針、線等)穿過……;將(影片)裝入放映機(jī);穿成串,串在一起;給……裝入(膠片、繩子);用……線縫;把……線編織進(jìn)。
6.?warpn.彎曲,歪斜;經(jīng)線;經(jīng)紗;vt.&vi.弄彎,變歪vt.使(行為等)不合情理;使乖戾。
7.?divergentadj.有分歧的;叉開的;發(fā)散的,擴(kuò)散的。
8.?texturen.手感,質(zhì)感,質(zhì)地;口感;(音樂或文學(xué)的)諧和統(tǒng)一感,神韻。
9.?coalescevi.聯(lián)合,合并。
[1]?Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.
很自然地,人們對GPU提供的處理能力是否能夠用來加強(qiáng)更多通用計(jì)算產(chǎn)生了興趣。asto,關(guān)于;Tobringundercontrolanddirecttheforceof,統(tǒng)治,管理,支配控制住和指揮……的力量:這里表示指揮和控制GPU的圖形處理能力使之加強(qiáng)通用計(jì)算。ImportantSentences
[2]?Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelization,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftw
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 節(jié)能建筑室內(nèi)空氣質(zhì)量控制施工考核試卷
- 施工組織設(shè)計(jì)中關(guān)于安全及文明施工措施
- 2025男女雙方自愿離婚協(xié)議書范文
- 2025施工圖實(shí)習(xí)報(bào)告范文
- 13-任何可能的緊急情況的處理措施、預(yù)案以及抵抗風(fēng)險(xiǎn)(包括工程施工過程中可能遇到的各種風(fēng)險(xiǎn))的措施
- 室內(nèi)設(shè)計(jì)合同書
- 車輛使用權(quán)協(xié)議
- 公司傳真交易基金合同書
- 有限責(zé)任公司股東擔(dān)保協(xié)議書
- 工程居間合同書范本(2025年)
- 美團(tuán)外賣運(yùn)營知識試題
- 航空概論學(xué)習(xí)通超星期末考試答案章節(jié)答案2024年
- 業(yè)務(wù)流程可視化改善
- 期末復(fù)(知識清單)2024-2025學(xué)年人教PEP版(2024)英語三年級上冊
- 45001-2020職業(yè)健康安全管理體系危險(xiǎn)源識別與風(fēng)險(xiǎn)評價(jià)及應(yīng)對措施表(各部門)
- 人教版六年級科學(xué)重點(diǎn)知識點(diǎn)
- 春節(jié):藝術(shù)的盛宴
- 煙草公司化肥采購項(xiàng)目-化肥投標(biāo)文件(技術(shù)方案)
- 【良品鋪?zhàn)映杀究刂浦写嬖诘膯栴}及優(yōu)化建議探析(定量論文)11000字】
- 2023八年級語文上冊 第三單元 13 唐詩五首說課稿 新人教版
- 2024至2030年中國青年旅舍行業(yè)發(fā)展監(jiān)測及投資戰(zhàn)略研究報(bào)告
評論
0/150
提交評論