計(jì)算機(jī)體系結(jié)構(gòu)pipelining_第1頁
計(jì)算機(jī)體系結(jié)構(gòu)pipelining_第2頁
計(jì)算機(jī)體系結(jié)構(gòu)pipelining_第3頁
計(jì)算機(jī)體系結(jié)構(gòu)pipelining_第4頁
計(jì)算機(jī)體系結(jié)構(gòu)pipelining_第5頁
已閱讀5頁,還剩83頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1Pipelining:BasicandIntermediateConcepts

IntroductionTheMajorHurdleofPipelining

PipelineHazardsImplementationMulti-cycleOperationsTheMIPSR4000PipelineCrosscuttingIssues

ScoreboardingConclusionCDA5155–Spring2012

Copyright?2012PrabhatMishra2IntroductionInearlyCPUs,deepcombinationallogicnetworkswereusedbetweenstateupdates.Signaldelaysmayvarywidelyacrosspaths.Newinputcannotbeprovidedtothenetworkuntiltheslowestpathshavefinished.Slowclockspeed,slowprocessingrates.LogicGate3IntroductionInpipelineddesign,logicnetworksaredividedintoshallowslices(pipelinestages).Delaysthroughthenetworkaremadeuniform.Anewinputcanbeprovidedtoeachsliceassoonasitsquick,shallownetworkhasfinished.Clockcycleisonlyaslongasthesloweststage.4PipeliningPipeliningisanimplementationtechniquewherebymultipleinstructionsareoverlappedinexecutionTakesadvantageofparallelismthatexistsamongtheactionsneededtoexecuteaninstruction

fetch

instructionfrommemory

decode

tofigureoutwhattodo

read

sourceoperands

execute

write

results5SimpleRISCDatapathIFIDEXMEMWBProgram

CounterNextPCInst.

Reg.Load

fr.Mem.

Data6BasicRISCPipeliningBasicidea:Eachinstructionspends1clockcycleineachofthe5executionstages.During1clockcycle,thepipelinecanprocess(indifferentstages)5differentinstructions.7AlternativeVisualization8VisualizationwithPipelineRegisters9LimitsofPipeliningIncreasingthenumberofpipelinestagesinagivenlogicblockbyafactorofn

Generallyallowsincreasingclockspeedandthroughputbyafactorofalmostn.Usuallylessthann

becauseofoverheadssuchaslatchesandbalanceofdelayineachstage.But,pipelininghasanaturallimit:Atleast1layeroflogicgatesperpipelinestage.Practicalminimumisusuallyseveralgates(2-10).Commercialdesignsarerapidlynearingthispoint.10FewRelatedTermsClockPeriod

=Max{timedelayofastage}1k+otherdelay(e.g.,skew,latchdelay)FrequencyReciprocaloftheclockperiodSpeedupkstagepipeline,ninstructions

kwhenn>>k.EfficiencyRatioofitsactualspeeduptotheidealspeedupThroughputNumberofinstructionsthatcanbecompletedperunittime.11BeyondPipeliningTherearesomeproblemswithclockedlogicHighpowerdissipationforclocksignaldistributionthroughoutchipupto40%ormoreoftotalpowerClocksignaltimingdifferences(clockskew)betweendifferentportionsofchipCanbesignificant,interferewithproperexecutionEachclockcyclecanonlybeasfastastheworst-casedelayoverallpipelinestagesintheentiredesign.Manyresultsendupwaitingwhentheydon’tneedto.AnalternativeSelf-timedlogic,Asynchronouslogic,DataflowcircuitsEachlogicblockusesexplicit“handshaking”signalsGloballyasynchronouslocallysynchronous(GALS)12Outline

IntroductionTheMajorHurdleofPipelining

ImplementationMulti-cycleOperationsTheMIPSR4000PipelineCrosscuttingIssuesConclusion13PipelineHazardsHazardsarecircumstanceswhichmayleadtostalls(delays,“bubbles”)inthepipelineifnotaddressed.Threemajortypes:Structuralhazards:NotenoughHWresourcestokeepallinstrs.moving.DatahazardsDataresultsofearlierinstructionsnotavailableyet.ControlhazardsControldecisionsresultingfromearlierinstr.(branches)notyetmade;don’tknowwhichnewinstr.toexecute.14StructuralHazardExampleTheprocessorhasacombinedinstruction+datamemorywithonly1readport15HazardsProduce“Bubbles”16TextualViewApipelinestalledforastructuralhazard–aloadwithonlyonememoryport17ThreeTypesofDataHazardsLeti

beanearlierinstruction,j

alaterone.RAW(readafterwrite)j

triestoreadavaluebeforei

writesitWAW(writeafterwrite)i

andj

writetosameplace,butinthewrongorder.Onlyoccursifmorethan1pipelinestagecanwriteWAR(writeafterread)j

writesanewvaluetoalocationbeforei

hasreadtheoldone.Onlyoccursifwritescanhappenbeforereadsinpipeline(in-order).18DataHazardExample19ForwardingforDataHazards20AnotherForwardingExample21AnUnavoidableStall22Stallinginmidstofinstruction23DataHazardPreventionAclevercompilercanoftenrescheduleinstructionstoavoidastall.Asimpleexample:Originalcode:

lwr2,0(r4)

addr1,r2,r3Stallhappenshere.

lwr5,4(r4)Transformedcode:

lwr2,0(r4)

lwr5,4(r4)

addr1,r2,r3Nostallneeded.

24SimpleRISCPipelineStallStatistics%ofloadsthatcauseastall25Control(Branch)HazardsSupposethenewPCvalueisnotcomputeduntiltheMEMstage.Thenwemuststall3clocksaftereverybranch!26PerformanceofPipelineswithStallsSpeedupAvg.inst.time(unpipelined)/

Avg.inst.time(pipelined)…CPIunpipelined/

CPIpipelined…PipelineDepth/(1+Pipelinestallcyclesperinstruction)PipelineDepth/(1+BranchfrequencyxBranchpenalty)27DelayedBranchesMachinecodesequence: Branchinstruction Delayslotinstruction(s) Post-branchinstructionsBranchistaken

(iftaken)atthispoint28SchedulingtheBranch-DelaySlot29Outline

IntroductionTheMajorHurdleofPipeliningImplementation

Multi-cycleOperationsTheMIPSR4000PipelineCrosscuttingIssuesConclusion30SimpleRISCDatapathIFIDEXMEMWBProgram

CounterNextPCInst.

Reg.Load

fr.Mem.

Data31DescriptionofPipeStages32DataHazardDetection33HazardDetectionLogicExample:Detectingwhetheraninstructionthathasjustbeenfetchedneedstobestalledbecauseofaprecedingload.ID/EX.IR[rt]==IF/ID.IR[rs]ID/EX.IR[rt]==IF/ID.IR[rt]ID/EX.IR[rt]==IF/ID.IR[rs]34ForwardingSituationsinDLX35ImplementingForwardinginHW36EarlyBranchResolution37OriginalRISCDatapathIFIDEXMEMWBProgram

CounterNextPCInst.

Reg.Load

fr.Mem.

Data38NewPipelineLogic39ControlInstructionStatistics40StatisticsonTakenBranches41Predict-Not-Taken42StaticBranchPredictionEarlierwediscussedpredict-taken,predict-not-takenstaticpredictionstrategiesApplieduniformlyacrossallbranchesinprogramStaticanalysisincompilermaybeabletodobetter,ifitcannon-uniformlypredictwhethereachspecificbranchislikelytobetakenOneway:backwardstaken,forwardsnottaken.Ifwecandobetter,itcanhelpwithstaticcodeschedulingtoreducedatahazardstalls…Alsomayassistlaterdynamicprediction43PredictionHelpsStaticSchedulingLDR1,0(R2)DSUBUR1,R1,R3BEQZR1,elseOR R4,R5,R6DADDUR10,R4,E3J afterelse:DADDUR7,R8,R9…after:PotentialloaddelaytofillIf-then-else

controlflowCode

movements

toconsider:SomedatadependencesWhichwaywillthis

branchgo?If

caseElse

case44SomeStaticPredictionSchemesAlwayspredicttaken34%mispredictrateonSPEC(range9%-54%)Backwardspredicttaken,forwardsnottakenInSPEC,morethan?offorwardsaretaken!Thisdoesworsethan“alwayspredicttaken”strategyUsu.notbetterthan30-40%mispredictionrateBetterthaneither:UseprofileinformationCollectstatisticsonearlierprogramruns.Workswellbecauseindividualbranchestendtobestronglybiased(takenornot)givenaveragedataBiasremainsstableacrossmultipleruns45Profile-BasedPredictorStatisticsFloating-Point46Predict-Takenvs.Profile-BasedFloating-pointInstructionsbetweenmis-predictions47TypesofExceptions(Interrupts,Faults)I/Odevicerequest,timereventInvokingOSservicesfromauserprogramTracing(single-stepping)throughprogramBreakpointsIntegerarithmeticoverflow,dividebyzeroFParithmeticanomaly(overflow,underflow,

,NaN,etc.)Pagefault(pagenotinphysicalmemory)MisalignedmemoryaccessMemory-protectionviolation(acc.mem.notalloc’edtoproc.)Illegal(undefinedorunimplemented)instructionHardwaremalfunctionPower-relatedinterrupt(e.g.batterylow,powerfailure)48TerminologyAcrossArchitectures49ExceptionCategorizationsSynchronousvs.asynchronousEventsynchronizedwithprogramexecution?Userrequestedvs.coercedEventcausedintentionallybyuserprogram?Usermaskable(canbedisabled)ornotCaneventbedisabled?WithininstructionsorbetweeninstructionsDoeseventpreventinstructionfromcompleting?ResumevsterminateDoestheprogramcontinuefromwhereitleftoffafterexceptionishandled,ordoesitstop?50RestartableExceptionsRequirements:Exceptionmayoccurwithininstruction.Programmustcontinueafterexceptionishandled.Examples:Virtualmemorypagefault.Difficultbecause:Pipelinestatemustbesaved.Oneapproach,foreasycases:1.Forceatrapinst.intopipelineonnextIF.2.Clearpipelinebehindfaultinginstruction.3.ExceptionhandlersavesPCoffaultinginstr.51Precisevs.ImpreciseHandlingMachinesmaysupporteitherorbothmodesofexceptionhandling:Preciseexceptionhandling:Correctlyimplementallpossiblecombinationsofexceptionsinallcircumstances.Maybearequirementforsomesystems/applications.Maybe10xslower!Easierforintegerthanfloating-point.Usefulfordebuggingcode.Impreciseexceptionhandling:Onlycorrectlyimplementthemostcommoncases.Softwaremayavoidsomeexceptions.Onlystatisticalguaranteeofcorrectness,throughtesting.52ExceptionsinMIPSpipelineInstructionFetch,&MemorystagesPagefaultoninstruction/datafetchMisalignedmemoryaccessMemory-protectionviolationInstructionDecodestageUndefined/illegalopcodeExecutionstageArithmeticexceptionWrite-BackstageNone!53Out-of-OrderExceptionsConsiderthefollowingcodesequence:LWIFIDEXMEMWBADDIFIDEXMEMWBTheADDmaycauseanexceptionduringIF,beforeLWcausesanexceptionduringMEM!Can’trestartPContheADD!Solution:Notetheexceptioninastatusvector,carriedalong.Disablewritesforthatinstruction.Resolveallexceptionsatalatestage(e.g.WB).54Outline

IntroductionTheMajorHurdleofPipeliningImplementationMulti-cycleOperations

TheMIPSR4000PipelineCrosscuttingIssuesConclusion55Multi-cycleOperationsforFP56PipelinedMultiple-IssueFPU57FPUPipeliningIssuesinDLXNoticeinstructionsmaycompleteout-of-order:MULTDIFIDM1M2M3M4M5M6M7MEWBADDDIFIDA1A2A3A4MEWBLDIFIDEX

MEWBSDIFIDEX

MEWBRaisesthepossibilityofWAWhazards,andstructuralhazardsinMEM&WBstages.Structuralhazardsmayoccurespeciallyoftenwithnon-pipelinedDIVunit.Out-of-ordercompletionimpactsexceptionhandling.58TypicalFPCodeSeq.withStallsMUL.DstallsinID1cyclewaitingfornewvalueofF4fromMEMstageofL.DADD.Dstalls1cycleinIFwaitingforMUL.DtoleaveID,then6cyclesinIDwaitingfornewF0tobereturnedbyMUL.DstageM7.S.Dstalls6cyclesinIFwaitingforADD.DtoleaveID,then2cyclesinEXwaitingfornewF2tobereturnedbyADD.DstageA4,then1morecycleinEXwaitingforADD.DtoclearMEMstage.ClockCycleNumberInstruction1234567891011121314151617L.DF4,0(R2)IFIDEXMEWBMUL.DF0,F4,F6IFIDstallM1M2M3M4M5M6M7MEWBADD.DF2,F0,F8IFstallIDstallstallstallstallstallstallA1A2A3A4MEWBS.DF2,0(R2)IFstallstallstallstallstallstallIDEXstallstallstallME59IssuesinMulti-CycleOperationsStallforRAWislongerandmorefrequent(Fig.A33)WAWispossible;WARisnot(why?)StructuralHazardpossiblefornon-pipelinedunitMultipleWBsarelikely(Fig.A.34)HandlinghazardsatIssue(ID)stage:Checkstructuralhazards:functionalunit,WBportCheckRAWhazards:IssuewithforwardingCheckWAWhazards:NotissuetomakesurewriteinorderDetectandstallinstructionbeforeMEMandWBstages60FPStallStatisticsperFPoperation61ISADesignImpactsPipeliningVariableinstructionlengths&runtimes:Introducesdelaysduetopipelineinequities.Complicateshazard-detection&preciseexceptionsSophisticatedaddressingmodes:Post-autoincrementcomplicateshazarddetection,restarting,introducesWAR&WAWhazards.Multiple-indirectmodescomplicatepipelinecontrol&timing.Self-modifyingcode:Whatifyouoverwriteaninstructioninthepipe?Implicitconditioncodes:WARhazards,restarts62Outline

IntroductionTheMajorHurdleofPipeliningImplementationMulti-cycleOperationsTheMIPSR4000Pipeline

CrosscuttingIssuesConclusion63RealMIPSR4000“SuperPipeline”IF,IS-Instructioncachefetch,First&Secondhalves.RF-Inst.decode,RegisterFetch,hazardcheck…EX-Execution(EAcalc,ALUop,targetcalc…)DF,DS-Datacacheaccess,First&Secondhalves.TC-TagCheck,didcacheaccesshit?WB-Write-Backforloads®ister-registerops.64R4000:Two-CycleLoadDelay65R4000:Three-CycleBranchDelay66R4000FPFunctionalUnitStagesU–Unpackfloating-pointnumbersFPadderfunctionalunitstages:A–MantissaADDstageR–RoundingstageS–OperandshiftstageFPmultiplierfunctionalunitstages:E–ExceptionteststageM–FirststageofmultiplierN–SecondstageofmultiplierFPdividerfunctionunitstages:D–Dividepipelinestage67LatencyandInitiationIntervalFPInstructionLatencyInitiation

intervalMIPSR4000PipestagesAdd,subtract43U,S+A,A+R,R+SMultiply84U,E+M,M,M,M,N,N+A,RDivide3635U,A,R,D27,D+A,D+R,D+A,D+R,A,RSquareroot112111U,E,(A+R)108,A,RNegate21U,SAbsolutevalue21U,SFPcompare32U,A,RU–unpackA–mantissaaddR–roundS–shiftE–exceptiontestM–multiply1ststageN–multiply2ndstageD–divideBothunitsusedinsameclockcyclePairofunitsusedon

108consecutivecycles68FPMultiplyfollowedbyAddClockcycleOpIssue/

Stall0123456789101112mulIssueUEMMMMNNARaddIssueUSAARRSIssueUSAARRSIssueUSAARRSStallUSAARRSStallUSAARRSIssueUSAARRSIssueUSAARRS69TheMIPSR4300PipelineManufacturedbyNEC64-bitprocessorimplementsMIPS64ISAUsedinembeddedapplicationsNintendo-64gameprocessor,networkrouter,…MultipleEXstagesforfloating-pointpipelineOut-of-ordercompletion,preciseexceptionsNECVR4122:Integerdatapath,softwareforFPoperations70Outline

IntroductionTheMajorHurdleofPipeliningImplementationMulti-cycleOperationsTheMIPSR4000PipelineCrosscuttingIssues

Conclusion71RISCISAandEfficiencyofPipelinesSimpleinstructionsetEasiertoschedulecodetoimproveperformanceStaticschedulingbycompilerDynamicschedulingbyhardwareLeadstoout-of-orderexecution(completion)RequiresmechanismtoensurecorrectexecutionScoreboarding72ScoreboardingTechniqueforimplementinganinstructionqueuethatsupportsdynamicreordering.DevelopedonCDC6600(decadesago).ReorderingmustcheckWAR/WAWhazards:

DIVDF0,F2,F4

Long-running

ADDDF10,F0,F8

DependsonDIV

SUBDF8,F8,F14

Anti-dependsonADDGoal:Beginexecutionofinstructionsasearlyaspossible73Simplescoreboardeddatapath74PipelinewithScoreboarding0. (F)Fetchinstructionfromcacheorprefetchbuffer(I)Issueinst.toanexecutionpath(whennostructural/WAWhazards)(R)Readoperands(whennoRAWhazardsremain)(E)Executeinstruction(possiblymulti-cycle)(W)Writeresults(whennoWARhazardsremain)InstructionFetchInstructionIssuePre-issue

bufferExecutionunit1Executionunit2…WriteresultsPre-execution

buffersPost-execution

buffersRead

operandsRead

operandsScoreboard/

ControlUnitInstructionDecode751.InstructionIssue(IS)StageReceivenewly-fetchedinstructionDecodebinaryinstructionformatCheckforstructuralhazards:Instructionneedsexecutionunitcurrentlyinuse,whoseinitiationintervalhasn’tpassed?CheckforWAWhazards:Instructionwantstowritetoaregisterthatanactiveinstruction(issued,butnotyetfinished)wantstowriteto?Badiftheyfinishout-of-order!Stallallcurrent(&future)instructionissuing,untilnoneofthesehazardsremain.Issueinstructions(in-order)totheappropriateexecutionunits&trackstatusonscoreboard.Replacesfirsthalf

ofIDstage762.ReadOperands(RO)StageReceiveinstructionissuedtofunctionalunit.CheckforRAWhazards:Areallsourceoperandsavailableyet?Ifno:Holdinstructioninapre-executionbuffer.Ifbufferhasonly1entry,thisandallnot-yet-issuedinstructionsusingthisfunctionalunitmustwait.Ifyes:Readoperandsfromregisterfile,&startinstructiondowntheexecutionunit’spipeline.Replacessecondhalf

ofIDstage773.Execution(EX)StageOnceoperandsarereceived,beginexecutionoftheinstructionintheexecutionunit.Executionmaytakemultiplecycles.Whenresultisready,notifyscoreboardofinstructioncompletion.ReplacesoldEXstage784.WriteResult(WR)stageReceivecompletedinstruction&itsresultfromexecutionunit.CheckforWARhazards:Doesanypreviously-issuedinstructionthathasnotyetreaditsoperandsdependontheoldvalueweareabouttooverwrite?(Doesitanti-dependonus?)Whileyes:

Stallinstructioninapost-executionbuffer.Whenno:

Writeinstructionresulttoregisterfile.ReplacesWBstage79ScoreboardImplementationOnetypicalimplementationusesthreetables:Instructionstatus,foreachinstructiononthescoreboardWhichstageofexecutionistheinstructioncurrentlyin?FunctionalUnit(FU)status,foreachFU:Whatinstruction(ifany)isbeingprocessed?Ifinst.isinROstage,thenforeachoperand:Whatregisteristheoperandcomingfrom?Istheoperandready?Ifnot,whichFUwillproducetheoperand?Registerresultstatus,foreachreg.intheISA:Whichcurrently-runningFU(ifany)isscheduledtooverwritethegivenregister?80FunctionalUnitStatusTableForeachfunctionalunit,thefollowingfields:Busy–Istheunitbusy(Yes/No)?Op–WhichexactopcodetoperformintheFU?Fi–DestinationregisterofinstructionintheFUFj,Fk–SourceregistersofinstructionThesefieldsareonlyneededduringROstage:Qj,Qk–FUstowritenewvaluesofsourceregisters,or0Rj,Rk–AreoperandsFj,Fkready?(Yes/No)Registerresultstatustablehasonly1field:Result–Whichcurrently-executingFUwillwriteitsresulttothisregister?81#1:ScoreboardAfter2ndLD’sEX82#2:JustbeforeMULTD’sWRstage83#3:JustbeforeDIVD’sWRstage84ScoreboardingLogicInthetablebelow:FU=functionalunitusedbygiveninstructionD,S1,S2=giveninstr’sdestination&sourceregsop=operationtobeperformedResult[reg]=Registerresultstatustableentry

forregisteridentifiedbyreg(completedbyendofcycle)(true@start

ofcycle)(AvoidsWAW)(AvoidsWAR)(AvoidsRAW)(Releaseotherwriters

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論