ECE259CPS221AdvancedComputerArchitectureII(_第1頁(yè)
ECE259CPS221AdvancedComputerArchitectureII(_第2頁(yè)
ECE259CPS221AdvancedComputerArchitectureII(_第3頁(yè)
ECE259CPS221AdvancedComputerArchitectureII(_第4頁(yè)
ECE259CPS221AdvancedComputerArchitectureII(_第5頁(yè)
已閱讀5頁(yè),還剩17頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、ECE 259 / CPS 221 Advanced Computer Architecture II(Parallel Computer Architecture)Shared Memory MPs COMA & BeyondCopyright 2004 Daniel J. SorinDuke UniversitySlides are derived from work bySarita Adve (Illinois), Babak Falsafi (CMU),Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michig

2、an), and J. P. Singh (Princeton). Thanks!OutlineCache Only Memory Architecture (COMA)BasicsData Diffusion Machine (DDM)Simple COMA (S-COMA)Reactive NUMAHierarchical CoherenceBasicsSequent NUMA-QChip Multiprocessor (CMP)Sun WildfireToken CoherenceReviewBasic idea of directoriesPer-processor cache hie

3、rarchiesDirectory interleaved with memoryDirectory limitations/drawbacksLimited capacity for replicationHigh design & implementation costSingle hard-wired protocolLimitations of shared physical address spaceCache Only Memory Architecture (COMA)Make all memory available for migration & replicationAll

4、 memory is DRAM cache called Attraction MemoryExamplesData Diffusion Machine (next)Flat COMA (fixed home for directory but not data)KSR-1 (hierarchy of snooping rings)But how do youFind data?Deal with replacements?COMA example: Data Diffusion Machine (DDM)All hardware COMAAttraction Memory One giant

5、 hardware cacheMaintains both address tags and stateData addressed, allocated, & kept coherent in blocksDirectory info on a per cache-block basisNot home based:Data is migratory AM attracts dataMust find a home when replacing the dataMust find the directory entry before finding the dataDDM Directory

6、Directory is hierarchical in a tree formEach is a set-associative cache of directory infoTree maintains inclusion:Higher levels keep replica of lower sub-treesDDDDDDDDDM Coherence/Placement ProtocolSimple write-invalidate protocolCache states: Invalid, Shared, ExclusiveMust traverse the directory:To

7、 find a copy on a read or write missTo invalidate on a write to SharedDirectory is hierarchical set-associative cachesQ1: Is the block in my sub-tree?Q2: Does the block exist outside my sub-tree?Request goes up until Q2=no and then downRequest goes down until Q1=no or leafOn a replacement:for an Exc

8、lusive copy, must find another home (HARD!)for a Shared copy, must make sure other copies existelse must find another homeSimple COMA (S-COMA)(Pure) COMABlock granularity to find/allocate/replace (complex hardware)Block granularity for coherence/transfers (good for false sharing)Software DSMPage gra

9、nularity to find/allocate/replace (use VM: good)Page granularity for coherence/transfers (bad for false sharing)Simple COMAPage granularity to find/allocate/replace (use VM: good)Block granularity for coherence/transfers (good for false sharing)Blocks act like sub-blocks on pageS-COMA-like ExamplesW

10、isconsin Typhoon Reinhardt et al. ISCA 1994On access, VM system checks if page presentOn access, HW/SW checks block stateFailure invokes user-level protocol in SWGood flexibility, but SW slow & users dont want to write protocolsSun Wildfire Hagersten/Koster HPCA 1999Begin with up to four SMP nodesAd

11、d pseudo-processor board to each as proxy for rest of systemCan run CC-NUMA directory protocolCan selectively use S-COMA (called Coherent Memory Replication)Selects between with competitive algorithm Falsafi/Wood ISCA97Hierarchical method of building parallel machinesWELL TALK MORE ABOUT THIS LATERA

12、 Taxonomy of IssuesAllocation/ReplicationCache line vs pageAccess Control (Coherence)Cache line vs pageHW vs SWProtocol ProcessingHW vs SWCommunicationCache line vs pageHW vs SW (message passing)Reactive NUMA (R-NUMA)PRESENTATIONOutlineCache Only Memory Architecture (COMA)BasicsData Diffusion Machin

13、e (DDM)Reactive NUMA (R-NUMA)Hierarchical CoherenceBasicsNUMA-QChip multiprocessor (CMP)Sun WildfireIntel ProfusionHierarchical CoherenceMany older systems were flatE.g., a directory that points to 1K processorsUse hierarchyIntra-node coherence (e.g., snooping in SMP node)Inter-node coherence (e.g.,

14、 directory between nodes)Why?Divide & conquer markets (e.g., sell node)Divide & conquer complexity (but must interface protocols)Example Two-level HierarchiesAdvantages of Multiprocessor NodesAmortization of node fixed costs over multiple processorsApplies even if processors simply packaged together

15、 but not coherentCan use commodity SMPsLess nodes for directory to keep track of (coarser grain)Much communication may be contained within node (cheaper)Nodes prefetch data for each other (fewer “remote” misses)Combining of requests (like hierarchical, only two-level)Can even share caches (overlappi

16、ng of working sets)Benefits depend on sharing pattern (and mapping)Good for widely read-shared: e.g. tree data in Barnes-HutGood for nearest-neighbor, if properly mappedNot so good for all-to-all communicationDisadvantages of Coherent MP NodesBandwidth shared among nodesAll-to-all exampleApplies to

17、coherent or notBus increases latency to local memoryWith coherence, typically wait for local snoop results before sending remote requestsSnoopy bus at remote node increases delays there, too, increasing latency and reducing bandwidthOverall, may hurt performance if sharing patterns dont complySequen

18、t NUMA-Q System OverviewUse of high-volume SMPs as building blocksQuad bus is 532MB/s split-transaction in-order responsesLimited facility for out-of-order responses for off-node accessesCross-node interconnect is 1GB/s unidirectional ringLarger SCI systems built by bridging multiple ringsNUMA-Q IQ-

19、Link BoardIQ-Link board plays the role of Hub Chip in SGI OriginCan generate interrupts between quadsRemote cache (visible to SCI) block size is 64 bytes (32MB, 4-way) Processor caches not visible (snoopy-coherent within SMP node) to SCIRemote cache is inclusive with respect to processor caches on S

20、MPData Pump (GaAs) implements SCI, pulls off relevant packetsInterface to quad bus.Manages remote cachedata and bus logic. Pseudo-memory controller and pseudo-processor. Interface to data pump,OBIC, interrupt controllerand directory tags. ManagesSCI protocol using program-mable engines. NUMA-Q cont.IQ-Link is keyLocal directory: home (I), fresh (S), gone (E) + pointer“L3” remote ca

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論