并行處理與體系結(jié)構(gòu)課件hitsz-lec01_第1頁
并行處理與體系結(jié)構(gòu)課件hitsz-lec01_第2頁
并行處理與體系結(jié)構(gòu)課件hitsz-lec01_第3頁
并行處理與體系結(jié)構(gòu)課件hitsz-lec01_第4頁
并行處理與體系結(jié)構(gòu)課件hitsz-lec01_第5頁
已閱讀5頁,還剩81頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

Chapter1:

FundamentalsofComputerDesignDavidPattersonElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley/~pattrsn/~cs252Originalslidescreatedby:Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls2WhatisComputerArchitecture?FunctionaloperationoftheindividualHWunitswithinacomputersystem,andtheflowofinformationandcontrolamongthem.TechnologyProgrammingLanguageInterfaceInterfaceDesign(ISA)Measurement&EvaluationParallelismComputerArchitecture:ApplicationsOSHardwareOrganization34AbstractionLayersinModernSystemsAlgorithmGates/Register-TransferLevel(RTL)ApplicationInstructionSetArchitecture(ISA)OperatingSystem/VirtualMachineMicroarchitectureDevicesProgrammingLanguageCircuitsPhysicsOriginaldomainofthecomputerarchitect(‘50s-’80s)Domainofrecentcomputerarchitecture(‘90s)Reliability,power,…Parallelcomputing,security,…Reinvigorationofcomputerarchitecture,mid-2000sonward.5ComputerSystems:TechnologyTrends1988SupercomputersMassivelyParallelProcessorsMini-supercomputersMinicomputersWorkstationsPC’s2002PowerfulPC’sandSMPWorkstationsNetworkofSMPWorkstationsMainframesSupercomputersEmbeddedComputersCrossroads:ConventionalWisdominComp.ArchOldConventionalWisdom:Powerisfree,TransistorsexpensiveNewConventionalWisdom:“Powerwall”Powerexpensive,Xtorsfree

(Canputmoreonchipthancanaffordtoturnon)OldCW:SufficientlyincreasingInstructionLevelParallelismviacompilers,innovation(Out-of-order,speculation,…)NewCW:“ILPwall”lawofdiminishingreturnsonmoreHWforILPOldCW:Multipliesareslow,MemoryaccessisfastNewCW:“Memorywall”Memoryslow,multipliesfast

(200clockcyclestoDRAMmemory,4clocksformultiply)OldCW:Uniprocessorperformance2X/1.5yrsNewCW:PowerWall+ILPWall+MemoryWall=BrickWallUniprocessorperformancenow2X/5(?)yrs Seachangeinchipdesign:multiple“cores”

(2Xprocessorsperchip/~2years)Moresimplerprocessorsaremorepowerefficient6Crossroads:UniprocessorPerformanceVAX :25%/year1978to1986RISC+x86:52%/year1986to2002RISC+x86:??%/year2002topresentFromHennessyandPatterson,ComputerArchitecture:AQuantitativeApproach,4thedition,October,2006Lessthan20%7ChangeinChipDesignIntel4004(1971):4-bitprocessor,

2312transistors,0.4MHz,

10micronPMOS,11mm2chip

Processoristhenewtransistor?

RISCII(1983):32-bit,5stage

pipeline,40,760transistors,3MHz,

3micronNMOS,60mm2chip125mm2chip,0.065micronCMOS

=2312RISCII+FPU+Icache+DcacheRISCIIshrinksto~0.02mm2at65nmCachesviaDRAMor1transistorSRAM()?ProximityCommunicationviacapacitivecouplingat>1TB/s?

(IvanSutherland@Sun/Berkeley)8TakingAdvantageofParallelismIncreasingthroughputofservercomputerviamultipleprocessorsormultipledisksDetailedHWdesignCarrylookaheadaddersusesparallelismtospeedupcomputingsumsfromlineartologarithmicinnumberofbitsperoperandMultiplememorybankssearchedinparallelinset-associativecachesPipelining:overlapinstructionexecutiontoreducethetotaltimetocompleteaninstructionsequence.Noteveryinstructiondependsonimmediatepredecessorexecutinginstructionscompletely/partiallyinparallelpossibleClassic5-stagepipeline:

1)InstructionFetch(Ifetch),

2)RegisterRead(Reg),

3)Execute(ALU),

4)DataMemoryAccess(Dmem),

5)RegisterWrite(Reg)9PipelinedInstructionExecutionInstr.OrderTime(clockcycles)RegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchRegCycle1Cycle2Cycle3Cycle4Cycle6Cycle7Cycle510Limitstopipelining

HazardspreventnextinstructionfromexecutingduringitsdesignatedclockcycleStructuralhazards:attempttousethesamehardwaretodotwodifferentthingsatonceDatahazards:InstructiondependsonresultofpriorinstructionstillinthepipelineControlhazards:Causedbydelaybetweenthefetchingofinstructionsanddecisionsaboutchangesincontrolflow(branchesandjumps).Instr.OrderTime(clockcycles)RegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchRegRegALUDMemIfetchReg11ThePrincipleofLocalityThePrincipleofLocality:Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.TwoDifferentTypesofLocality:TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)SpatialLocality(LocalityinSpace):Ifanitemisreferenced,itemswhoseaddressesareclosebytendtobereferencedsoon

(e.g.,straight-linecode,arrayaccess)Last30years,HWreliedonlocalityformemoryperf.PMEM$12LevelsoftheMemoryHierarchyCPURegisters100sBytes300–500ps(0.3-0.5ns)L1andL2Cache10s-100sKBytes~1ns-~10ns$1000s/GByteMainMemoryGBytes80ns-200ns~$100/GByteDisk10sTBytes,10ms

(10,000,000ns)~$1/GByteCapacityAccessTimeCostTapeinfinitesec-min~$1/GByteRegistersL1CacheMemoryDiskTapeInstr.OperandsBlocksPagesFilesStagingXferUnitprog./compiler1-8bytescachecntl32-64bytesOS4K-8Kbytesuser/operatorMbytesUpperLevelLowerLevelfasterLargerL2Cachecachecntl64-128bytesBlocks13WhatComputerArchitecturebringstoTableOtherfieldsoftenborrowideasfromarchitectureQuantitativePrinciplesofDesignTakeAdvantageofParallelismPrincipleofLocalityFocusontheCommonCaseAmdahl’sLawTheProcessorPerformanceEquationCareful,quantitativecomparisonsDefine,quantity,andsummarizerelativeperformanceDefineandquantityrelativecostDefineandquantitydependabilityDefineandquantitypowerCultureofanticipatingandexploitingadvancesintechnologyCultureofwell-definedinterfacesthatarecarefullyimplementedandthoroughlychecked14Comp.Arch.isanIntegratedApproachWhatreallymattersisthefunctioningofthecompletesystemhardware,runtimesystem,compiler,operatingsystem,andapplicationInnetworking,thisiscalledthe“EndtoEndargument”Computerarchitectureisnotjustabouttransistors,individualinstructions,orparticularimplementationsE.g.,OriginalRISCprojectsreplacedcomplexinstructionswithacompiler+simpleinstructions15ComputerArchitectureis

DesignandAnalysisArchitectureisaniterativeprocess:SearchingthespaceofpossibledesignsAtalllevelsofcomputersystemsCreativityGoodIdeasMediocreIdeasBadIdeasCost/PerformanceAnalysis16Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls17FocusontheCommonCaseCommonsenseguidescomputerdesignSinceitsengineering,commonsenseisvaluableInmakingadesigntrade-off,favorthefrequentcaseovertheinfrequentcaseE.g.,Instructionfetchanddecodeunitusedmorefrequentlythanmultiplier,sooptimizeit1stE.g.,Ifdatabaseserverhas50disks/processor,storagedependabilitydominatessystemdependability,sooptimizeit1stFrequentcaseisoftensimplerandcanbedonefasterthantheinfrequentcaseE.g.,overflowisrarewhenadding2numbers,soimproveperformancebyoptimizingmorecommoncaseofnooverflowMayslowdownoverflow,butoverallperformanceimprovedbyoptimizingforthenormalcaseWhatisfrequentcaseandhowmuchperformanceimprovedbymakingcasefaster=>Amdahl’sLaw

18Amdahl’sLawBestyoucouldeverhopetodo:19Amdahl’sLawexampleNewCPU10XfasterI/Oboundserver,so60%timewaitingforI/OApparently,itshumannaturetobeattractedby10Xfaster,vs.keepinginperspectiveitsjust1.6Xfaster20Processorperformanceequation InstCount CPI ClockRateProgram X Compiler X (X)Inst.Set. X XOrganization X XTechnology XCPUtime =Seconds=InstructionsxCyclesxSeconds Program ProgramInstructionCycleinstcountCPICycletime21RelatingMetricsCPUexecutiontimeMeasuredtimeforarunningprogramEasytobemeasuredClockcyclesThenumberoftheclockpulseforarunningprogramHardtobemeasuredInstructioncountThenumberofinstructionsexecutedbytheprogramcanbemeasuredbyusingsoftwaretoolsthatprofiletheexecutionorbyusingasimulatorofthearchitectureCPIClockcyclesperinstructionsNeedtheclockcyclesandcountinstructionnumberforeachinstructiontypeforcomputingtheCPIClocksDigitalcircuithasaclockthatrunsataconstantrate(像人的脈膊),clockisusedforsignalsynchronizationCycletime=timeforonefullcycle(secondspercycle)Clockrate=cyclespersecond(HertzorHz)AlsoknownasclockfrequencyScientificPrefixesusingwithcycletimeandclockratePrefixSymbolMultipleteraT10E12gigaG10E9megaM10E6kilok10E3millim10E-3micro

u10E-6nanon10E-9picop10E-12What’saClockCycle?Olddays:10levelsofgatesToday:determinedbynumeroustime-of-flightissues+gatedelaysclockpropagation,wirelengths,driversLatchorregistercombinationallogic24TheaveragenumberofclockcycleseachinstructiontakestoexecuteAfloatingpointintensiveapplicationmighthaveahigherCPICPUclockcycles=InstructioncountxCPICPUtime=CPUclockcyclesxClockcycletimeCPUtime=InstructioncountxCPIxClockcycletimeCPUtime=(InstructioncountxCPI)/ClockrateCPI(Clockcyclesperinstruction)Supposewehavetwoimplementationsofthesameinstructionset

architecture(ISA).

Forsomeprogram,

MachineAhasaclockcycletimeof10ns.andaCPIof4.0

MachineBhasaclockcycletimeof20ns.andaCPIof1.2

Whatmachineisfasterforthisprogram,andbyhowmuch?

CPIExampleCPIExampleAnswer:MachineA:clockcycle=1ns,CPI=2MachineB:clockcycle=2ns,CPI=1.2CPUclockcyclesA=InstructionCountx4.0CPUclockcyclesB=InstructionCountx1.2CPUtimeA=CPUclockcyclesAxclockcycletime=InstructionCountx2x1=2xInstructionCountCPUtimeB=InstructionCountx1.2x2=4.4xInstructionCountPerformanceA/PerformanceB=ExecutiontimeB/ExecutiontimeA=(4.4xI)/(2xI)=1.2Thus,Ais1.2timesfasterthanBOutline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls28Desktop:personalcomputerServer:webservers,fileservers,databaseserversEmbedded:handhelddevices(phones,cameras),dedicatedparallelcomputersThreemainclassesofcomputers29FeatureDesktopServerEmbeddedPriceofsystemPriceofmultiprocessormoduleCriticalsystemdesignissues$500-$5000$5000-$5,000,000$10-$100,000$50-$500$200-$10,000$.01-$100Price-performance,GraphicsperformanceThroughput,Availability,ScalabilityPrice,Powerconsumption,Application-specificperformance30Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls31InstructionSetArchitecture:CriticalInterfacePropertiesofagoodabstractionLaststhroughmanygenerations(portability)Usedinmanydifferentways(generality)ProvidesconvenientfunctionalitytohigherlevelsPermitsanefficientimplementationatlowerlevelsinstructionsetsoftwarehardware32Example:MIPSarchitecture0r0r1°°°r31PClohiProgrammablestorage 2^32xbytes 31x32-bitGPRs(R0=0) 32x32-bitFPregs(pairedDP) HI,LO,PCDatatypes?Format?AddressingModes? Arithmeticlogical

Add,AddU,Sub,SubU,And,Or,Xor,Nor,SLT,SLTU, AddI,AddIU,SLTI,SLTIU,AndI,OrI,XorI,LUI SLL,SRL,SRA,SLLV,SRLV,SRAVMemoryAccess

LB,LBU,LH,LHU,LW,LWL,LWR SB,SH,SW,SWL,SWRControl

J,JAL,JR,JALR BEq,BNE,BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL32-bitinstructionsonwordboundary33RegistertoregisterTransfer,branchesJumpsMIPSarchitectureinstructionsetformat34ISAvs.ComputerArchitectureOlddefinitionofcomputerarchitecture

=instructionsetdesignOtheraspectsofcomputerdesigncalledimplementationInsinuatesimplementationisuninterestingorlesschallengingOurviewiscomputerarchitecture>>ISAArchitect’sjobmuchmorethaninstructionsetdesign;technicalhurdlestodaymorechallengingthanthoseininstructionsetdesignSinceinstructionsetdesignnotwhereactionis,someconcludecomputerarchitecture(usingolddefinition)isnotwhereactionisWedisagreeonconclusionAgreethatISAnotwhereactionis(ISAinCA:AQA4/eappendix)35Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls36Moore’sLaw:2Xtransistors/“year”“CrammingMoreComponentsontoIntegratedCircuits”GordonMoore,Electronics,1965#ontransistors/cost-effectiveintegratedcircuitdoubleeveryNmonths(12≤N≤24)37TrackingTechnologyPerformanceTrendsDrilldowninto4technologies:Disks,Memory,Network,ProcessorsCompare~1980Archaic(Nostalgic)vs.

~2000Modern(Newfangled)PerformanceMilestonesineachtechnologyCompareforBandwidthvs.LatencyimprovementsinperformanceovertimeBandwidth:numberofeventsperunittimeE.g.,Mbits/secondovernetwork,Mbytes/secondfromdiskLatency:elapsedtimeforasingleeventE.g.,one-waynetworkdelayinmicroseconds,

averagediskaccesstimeinmilliseconds38Disks:Archaic(Nostalgic)v.Modern(Newfangled)CDCWrenI,19833600RPM0.03GBytescapacityTracks/Inch:800

Bits/Inch:9550

Three5.25”platters

Bandwidth:

0.6MBytes/secLatency:48.3msCache:noneSeagate373453,200315000RPM (4X)73.4GBytes (2500X)Tracks/Inch:64000 (80X)Bits/Inch:533,000 (60X)Four2.5”platters

(in3.5”formfactor)Bandwidth:

86MBytes/sec (140X)Latency:5.7ms (8X)Cache:8MBytes39LatencyLagsBandwidth(forlast~20years)PerformanceMilestonesDisk:3600,5400,7200,10000,15000RPM(8x,143x)(latency=simpleoperationw/ocontentionBW=best-case)40Memory:Archaic(Nostalgic)v.Modern(Newfangled)1980DRAM

(asynchronous)0.06Mbits/chip64,000xtors,35mm216-bitdatabuspermodule,16pins/chip13Mbytes/secLatency:225ns(noblocktransfer)2000

DoubleDataRateSynchr.

(clocked)DRAM256.00Mbits/chip (4000X)256,000,000xtors,204mm264-bitdatabusper

DIMM,66pins/chip (4X)1600Mbytes/sec (120X)Latency:52ns (4X)Blocktransfers(pagemode)41LatencyLagsBandwidth(last~20years)PerformanceMilestones

MemoryModule:16bitplainDRAM,PageModeDRAM,32b,64b,SDRAM,

DDRSDRAM(4x,120x)Disk:

3600,5400,7200,10000,15000RPM(8x,143x)(latency=simpleoperationw/ocontentionBW=best-case)42LANs:Archaic(Nostalgic)v.Modern(Newfangled)Ethernet802.3

YearofStandard:197810Mbits/s

linkspeedLatency:3000msecSharedmediaCoaxialcableEthernet802.3ae

YearofStandard:200310,000Mbits/s (1000X)

linkspeedLatency:190msec (15X)SwitchedmediaCategory5copperwireCoaxialCable:CoppercoreInsulatorBraidedouterconductorPlasticCoveringCopper,1mmthick,

twistedtoavoidantennaeffectTwistedPair:"Cat5"is4twistedpairsinbundle43LatencyLagsBandwidth(last~20years)PerformanceMilestones

Ethernet:10Mb,100Mb,1000Mb,10000Mb/s(16x,1000x)MemoryModule:

16bitplainDRAM,PageModeDRAM,32b,64b,SDRAM,

DDRSDRAM(4x,120x)Disk:

3600,5400,7200,10000,15000RPM(8x,143x)(latency=simpleoperationw/ocontentionBW=best-case)44CPUs:Archaic(Nostalgic)v.Modern(Newfangled)1982Intel8028612.5MHz2MIPS(peak)Latency320ns134,000xtors,47mm216-bitdatabus,68pinsMicrocodeinterpreter,

separateFPUchip(nocaches)

2001IntelPentium4

1500

MHz (120X)4500MIPS(peak) (2250X)Latency15ns (20X)42,000,000xtors,217mm264-bitdatabus,423pins3-waysuperscalar,

DynamictranslatetoRISC,Superpipelined(22stage),

Out-of-OrderexecutionOn-chip8KBDatacaches,

96KBInstr.Tracecache,

256KBL2cache45LatencyLagsBandwidth(last~20years)PerformanceMilestonesProcessor:‘286,‘386,‘486,Pentium,PentiumPro,Pentium4(21x,2250x)Ethernet:10Mb,100Mb,1000Mb,10000Mb/s(16x,1000x)MemoryModule:16bitplainDRAM,PageModeDRAM,32b,64b,SDRAM,

DDRSDRAM(4x,120x)Disk:3600,5400,7200,10000,15000RPM(8x,143x)CPUhigh,Memorylow

(“MemoryWall”)46RuleofThumbforLatencyLaggingBWInthetimethatbandwidthdoubles,latencyimprovesbynomorethanafactorof1.2to1.4

(andcapacityimprovesfasterthanbandwidth)Statedalternatively:

BandwidthimprovesbymorethanthesquareoftheimprovementinLatency

476ReasonsLatency

LagsBandwidth1. Moore’sLawhelpsBWmorethanlatencyFastertransistors,moretransistors,

morepinshelpBandwidthMPUTransistors: 0.130vs.42Mxtors (300X)DRAMTransistors: 0.064vs.256Mxtors (4000X)MPUPins: 68vs.423pins

(6X)DRAMPins: 16vs.66pins

(4X)Smaller,fastertransistorsbutcommunicate

over(relatively)longerlines:limitslatency

Featuresize: 1.5to3vs.0.18micron (8X,17X)MPUDieSize: 35vs.204mm2 (ratiosqrt2X)DRAMDieSize: 47vs.217mm2 (ratiosqrt2X)486ReasonsLatency

LagsBandwidth(cont’d)

2.Distancelimitslatency

SizeofDRAMblock

longbitandwordlines

mostofDRAMaccesstimeSpeedoflightandcomputersonnetwork1.&2.explainslinearlatencyvs.squareBW?3. Bandwidtheasiertosell(“bigger=better”)E.g.,10Gbits/sEthernet(“10Gig”)vs.

10mseclatencyEthernet4400MB/sDIMM(“PC4400”)vs.50nslatencyEvenifjustmarketing,customersnowtrainedSincebandwidthsells,moreresourcesthrownatbandwidth,whichfurthertipsthebalance496ReasonsLatency

LagsBandwidth(cont’d)

4. LatencyhelpsBW,butnotviceversa

Spinningdiskfasterimprovesbothbandwidthandrotationallatency

3600RPM15000RPM=4.2XAveragerotationallatency:8.3ms2.0msThingsbeingequal,alsohelpsBWby4.2XLowerDRAMlatency

Moreaccess/second(higherbandwidth)HigherlineardensityhelpsdiskBW

(andcapacity),butnotdiskLatency9,550BPI533,000BPI

60XinBW506ReasonsLatency

LagsBandwidth(cont’d)

5.BandwidthhurtslatencyQueueshelpBandwidth,hurtLatency(QueuingTheory)AddingchipstowidenamemorymoduleincreasesBandwidthbuthigherfan-outonaddresslinesmayincreaseLatency6.OperatingSystemoverheadhurts

LatencymorethanBandwidthLongmessagesamortizeoverhead;

overheadbiggerpartofshortmessages51Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls52Defineandquantitypower(1/2)ForCMOSchips,traditionaldominantenergyconsumptionhasbeeninswitchingtransistors,calleddynamicpower:Formobiledevices,energybettermetricForafixedtask,slowingclockrate(frequencyswitched)reducespower,butnotenergyCapacitiveloadafunctionofnumberoftransistorsconnectedtooutputandtechnology,whichdeterminescapacitanceofwiresandtransistorsDroppingvoltagehelpsboth,sowentfrom5Vto1VTosaveenergy&dynamicpower,mostCPUsnowturnoffclockofinactivemodules(e.g.Fl.Pt.Unit)53ExampleofquantifyingpowerSuppose15%reductioninvoltageresultsina15%reductioninfrequency.Whatisimpactondynamicpower?54Defineandquantitypower(2/2)Becauseleakagecurrentflowsevenwhenatransistorisoff,nowstaticpowerimportanttooLeakagecurrentincreasesinprocessorswithsmallertransistorsizesIncreasingthenumberoftransistorsincreasespowereveniftheyareturnedoffIn2006,goalforleakageis25%oftotalpowerconsumption;highperformancedesignsat40%Verylowpowersystemsevengatevoltagetoinactivemodulestocontrollossduetoleakage55Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls56CostofIntegratedCircuitsdependsofseveralfactors:Time:Thepricedropswithtime,learningcurveincreasesVolume:ThepricedropswithvolumeincreaseCommodities:ManymanufacturersproducethesameproductCompetitionbringspricesdown57ThepriceofIntelPentium4andPentiumM58AMDOpteronMicroprocessorDie59A300mmsiliconwafercontains117AMDOpteronmicroprocessorchipsina90nmprocess60Costofintegratedcircuit=Costofdie+Costoftestingdie+CostofPackagingandfinalTestFinalTestYieldCostofdie=CostofWaferDiesperwaferXDieyield61Diesperwafer=PiXWaferDiameterSqrt(2XDiearea)Example:WaferDiameter=300mmDiearea=1.5cmX1.5cm=2.25cm^2Diesperwafer=270PiX(WaferDiameter/2)^2Diearea-62Dieyield=DefectsperunitareaXDieareaaWaferyieldX(1+)-aWaferyield:measureshowmanywafersarecompletelybada=4Empiricalformulacorrespondstomaskinglevelsinmanufacturingprocess63Example:Diearea=1.5cmX1.5cm=2.25cm^2Dieyield=0.44Defectdensity=0.4percm^2Diearea=1.0cmX1.0cm=1cm^2Dieyield=0.68Smallerdieareagivesmoredieyield64Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependability

PerformanceFallaciesandPitfalls65Defineandquantitydependability(1/3)Howdecidewhenasystemisoperatingproperly?InfrastructureprovidersnowofferServiceLevelAgreements(SLA)toguaranteethattheirnetworkingorpowerservicewouldbedependableSystemsalternatebetween2statesofservicewithrespecttoanSLA:Serviceaccomplishment,wheretheserviceisdeliveredasspecifiedinSLAServiceinterruption,wherethedeliveredserviceisdifferentfromtheSLAFailure=transitionfromstate1tostate2Restoration=transitionfromstate2tostate166Defineandquantitydependability(2/3)Modulereliability=measureofcontinuousserviceaccomplishment(ortimetofailure).

2metricsMeanTimeToFailure(MTTF)measuresReliabilityFailuresInTime(FIT)=1/MTTF,therateoffailuresTraditionallyreportedasfailuresperbillionhoursofoperationMeanTimeToRepair(MTTR)measuresServiceInterruptionMeanTimeBetweenFailures(MTBF)=MTTF+MTTRModuleavailabilitymeasuresserviceasalternatebetweenthe2statesofaccomplishmentandinterruption(numberbetween0and1,e.g.0.9)Moduleavailability=MTTF/(MTTF+MTTR)67ExamplecalculatingreliabilityIfmoduleshaveexponentiallydistributedlifetimes(ageofmoduledoesnotaffectprobabilityoffailure),overallfailurerateisthesumoffailureratesofthemodulesCalculateFITandMTTFfor10disks(1MhourMTTFperdisk),1diskcontroller(0.5MhourMTTF),and1powersupply(0.2MhourMTTF):68Outline:IntroductionQuantitativePrinciplesofComputerDesignClassesofComputersComputerArchitectureTrendsinTechnologyPowerinIntegratedCircuitsTrendsinCostDependabilityPerformanceFallaciesandPitfalls6970HowtoQuantifyPerformance?Timetorunthetask(ExTime)Executiontime,responsetime,latencyTasksperday,hour,week,sec,ns…(Performance)Throughput,bandwidthPlaneBoeing747BAD/SudConcodreSpeed610mph1350mphDCtoParis6.5hours3hoursPassengers470132Throughput(pmph)286,700178,200Definition:Performance Performance(X) Execution_time(Y) n= = Performance(Y) Execution_time(X)PerformanceisinunitsofthingspersecbiggerisbetterIfweareprimarilyconcernedwithresponsetime1 execution_time(x)"XisntimesfasterthanY"means:performance(x)=71Performance:WhattomeasureUsuallyrelyonbenchmarksvs.realworkloadsToincreasepredictability,collectionsofbenchmarkapplications,calledbenchmarksuites,arepopularSPECCPU:populardesktopbenchmarksuiteCPUonly,splitbetweenintegerandfloatingpointprogramsSPECint2000has12integer,SPECfp2000has14integerpgmsSPECCPU2006tobeannouncedSpring2006SPECSFS(NFSfileserver)andSPECWeb(WebServer)addedasserverbenchmarksTransactionProcessingCouncilmeasuresserverperformanceandcost-performancefordatabasesTPC-CComplexqueryforOnlineTransactionProcessingTPC-HmodelsadhocdecisionsupportTPC-WatransactionalwebbenchmarkTPC-Appapplicationserverandwebservicesbenchmark7273SPEC:SystemPerformanceEvaluationCooperativeFirstRound198910programsyieldingasinglenumber(“SPECmarks”)SecondRound1992SPECInt92(6integerprograms)andSPECfp92(14floatingpointprograms)CompilerFlagsunlimited.March93newsetofprograms:SPECint95(8integerprograms)andSPECfp95(10floatingpoint)“benchmarksusefulfor3years”Singleflagsettingforallprograms:SPECint_base95,SPECfp_base95

SPECCPU2000(11integerbenchmarks–CINT2000,and14floating-pointbenchmarks–CFP2000NormalizedExecutionTimeNormalizeexecutiontimetoareferencemachineTwocommonmethodArithmeticmeanGeometricmeanComparisonArithmeticmeanUsetopredictperformanceMaynotbeconsistentGeometricmeanIndependentoftherunningtimesoftheindividualprogramsCannotbeusedtopredictrelativeexecutiontimeforaworkload4.5NormalizedExecutionTime–ExampleTimeonATimeonBNormalizedtoANormalizedtoBABABProgram111011

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論