高性能計(jì)算習(xí)題及問題詳解_第1頁(yè)
高性能計(jì)算習(xí)題及問題詳解_第2頁(yè)
高性能計(jì)算習(xí)題及問題詳解_第3頁(yè)
高性能計(jì)算習(xí)題及問題詳解_第4頁(yè)
高性能計(jì)算習(xí)題及問題詳解_第5頁(yè)
已閱讀5頁(yè),還剩13頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、實(shí)用文檔高性能計(jì)算練習(xí)題1、一下哪種編程方式適合在單機(jī)內(nèi)并行?哪種適合在多機(jī)間并行? 單機(jī):Threading 線程、OpenMp 多機(jī):MPI。2、例題:HPC1群的峰值計(jì)算能力:一套配置256個(gè)雙路X5670處理器計(jì)算節(jié)點(diǎn)的 HPCB群。X5560:2.93GHz Intel XS5670 Westmere 六核處理 器,目前主流的Intel 處理器每時(shí)鐘周期提供4個(gè)雙精度浮點(diǎn)計(jì)算。峰值計(jì)算性能:2.93GHz*4Flops/Hz*6Core*2CPU*256 節(jié)點(diǎn)=36003.8GFlops。 Gflops=10 億次, 所以36003Gflops=36.003TFlops=36.003

2、萬(wàn)億次每秒的峰值性能。3、Top500排名的依據(jù)是什么? High Performance Linpack(HPL) 測(cè)試結(jié)果4、目前最流行的GPUFF發(fā)環(huán)境是什么?CUDA5、一套配置200TFlops的HPC集群,如果用雙路 2.93GHz Intel westmere六核處理器 X5670來構(gòu)建,需要用多少個(gè)計(jì)算節(jié)點(diǎn)?計(jì)算節(jié)點(diǎn)數(shù)=200TFlops/(2*2.93GHz*6*4Flops/Hz)=14226、天河1A參與TOP500排名的實(shí)測(cè)速度是多少,效率是多少?2.57PFlops 55%7、 RDM&口何實(shí)現(xiàn)?RDMA(Remote Direct Memory Access),數(shù)據(jù)

3、發(fā)送接收時(shí),不用將數(shù)據(jù)拷貝到緩沖區(qū)中,而直接將數(shù)據(jù)發(fā)送到 對(duì)方。繞過了核心,實(shí)現(xiàn)了零拷貝。8、InfiniBand的最低通訊延遲是多少?1-1.3usec MPI end-to-end ,0.9-1us InfiniBand latency for RDMA operations9、GPU-Direct如何加速應(yīng)用程序運(yùn)行速度?通過除去InfiniBand 和GPU間的內(nèi)存拷貝來加速程序運(yùn)行 。? GPUs provide cost effective way for building supercomputers【GPUs提供高效方式建立超級(jí)計(jì)算機(jī)】? Dense packaging of

4、compute flops with high memory bandwidth【使用高端內(nèi)存帶寬的密級(jí)封裝浮點(diǎn)計(jì)算】10、網(wǎng)絡(luò)設(shè)備的哪個(gè)特性決定了MPI_Allreduce 性能?集群大小,Time for MPI_Allreduce keeps increasing ascluster size scales ,也就是說集群的規(guī)模決定了MPI_Allreduce 的性能。11、現(xiàn)排名世界第一的超級(jí)計(jì)算機(jī)的運(yùn)行速度?K computer: 10PFlops 也就是10千萬(wàn)億次,93%12、以下哪些可以算作是嵌入式設(shè)備:A路由器B機(jī)器人C微波爐D筆記本電腦13、選擇嵌入式操作系統(tǒng)的頭兩個(gè)因素是

5、:A 成本B售后服務(wù)C可獲得源代碼D相關(guān)社區(qū)E開發(fā)工具14、構(gòu)建嵌入式Linux的主要挑戰(zhàn)是:A需要廣博的知識(shí)面 B深度定制的復(fù)雜性 C日益增加的維護(hù)成本D穩(wěn)定性與安全性 E開源項(xiàng)目通常質(zhì)量低下15、The Yocto Project 的主要目的是:A.構(gòu)建一個(gè)統(tǒng)一的嵌入式Linux社區(qū)B.提供高質(zhì)量的工具幫助你輕松構(gòu)建嵌入式Linux ,從而專注于其上的研究工作C.包括一組經(jīng)過測(cè)試的metadata ,指導(dǎo)最核心的一些開源項(xiàng)目的交叉編譯過程D.提供靈活的擴(kuò)展接口,可以方便的導(dǎo)入新的項(xiàng)目,或是新的板級(jí)支持包(BSP)16、請(qǐng)描述交叉編譯一個(gè)開源項(xiàng)目需要完成哪些工作?Patch-Configur

6、e-Compile-Install-Sysroot-Package-Do_rootfs. Top500排名的依據(jù)是什么?答:High Performance Linpack(HPL)測(cè)試結(jié)果.Write codes to create a thread to compute the sum of the elements of an array.答: Create a thread to complete the sum of the elements of an array.實(shí)用文檔struct arguments double *array;int size;double *sum;)in

7、t main(int argc, char *argv) double array100;double sum;pthread_t worker_thread;struct arguments *arg;arg = (struct arguments *)calloc(1,sizeof(struct arguments);arg-array = array;arg-size=100;arg-sum = if (pthread_create(&worker_thread, NULL, do_work, (void *)arg) fprintf(stderr, Error while creati

8、ng threadn );exit(1);).)void *do_work(void *arg) struct arguments *argument;int i, size;double *array;double *sum;argument = (struct arguments*)arg;size = argument-size;array = argument-array;sum = argument-sum;*sum = 0;for (i=0;i pcomputing sum sAssignmentthread k sums sk = f (Ak*n/p) + f(A(k+1)*n/

9、p-1)(3)(4)thread 1 sums s = s1+ + sp (for simplicity of this example)thread 1 communicates s to other threadsOrchestrationstarting up threadscommunicating, synchronizing with thread 1Mapping實(shí)用文檔processorj runs thread jMFlops : Millions of floating point operations /sec POSIX : Portable Operating Sys

10、tem Interface of Unix可移植操作系統(tǒng)接口. Thread 線程:可作為獨(dú)立單元被調(diào)度的一連串代碼。(process進(jìn)程).編寫多線程代碼時(shí)要注意的問題(1)負(fù)載均衡(2)正確的存取共享變量(通過互斥代碼或互斥鎖實(shí)現(xiàn)) 35.用戶級(jí)線程:多對(duì)一映射。不需要系統(tǒng)支持,操作開銷小。一個(gè)線程阻塞時(shí)其他線程也要阻塞。內(nèi)核級(jí)線程:一對(duì)一映射。每個(gè)內(nèi)核線程調(diào)度相互獨(dú)立,OS完成線程的操作。在一個(gè)處理器上每個(gè)內(nèi)核線程可并行執(zhí)行,一個(gè)線程阻塞時(shí)其他線程也可以被調(diào)度。線程調(diào)度開銷大,O酸適應(yīng)線程數(shù)目的變化。.多線程pthread_t : the type of a threadpthread_

11、create() : creates a threadpthread_mutex_t : the type of a mutex lockpthread_mutex_lock() : lock a mutexpthread_self() : Returns the thread identifier for the calling threadint pthread_create ( pthread_t *thread , pthread_attr_t *attr, void * (*start_routine) (void *) , void *arg);(1)計(jì)算數(shù)組元素之和 struct

12、 arguments (double *array; int size;double *sum; ) int main(int argc, char *argv) ( double array100;double sum;pthread_t worker_thread;struct arguments *arg;arg = (struct arguments *)calloc(1,sizeof(struct arguments);arg-array = array;arg-size=100; arg-sum = if (pthread_create(&worker_thread, NULL ,

13、 do_work, (void *)arg) (fprintf(stderr, Error while creating threadn ); exit(1);) . )void *do_work(void *arg) 實(shí)用文檔struct arguments *argument;int i, size;double *array;double *sum;argument = (struct arguments*)arg;size = argument-size;array = argument-array;sum = argument-sum;*sum = 0;for (i=0;iarray

14、 = array;arg-size=100;arg-sum = if (pthread_create(&worker_thread, NULL , do_work, (void *)arg) fprintf(stderr, Error while creating threadn );exit(1);.if (pthread_join(worker_thread, &return_value) fprintf(stderr, Error while waiting for threadn);exit(1); RDMA Remote Direct Memory Access,遠(yuǎn)程直接存儲(chǔ)器存儲(chǔ),

15、通過 ? Zero-copy 和? Kernel bypass 技術(shù)實(shí)現(xiàn)。37. InfiniBand的最低通訊延遲是多少?高吞吐率(40Gb/s點(diǎn)對(duì)點(diǎn)和120Gb/s連接;消息傳遞接近 90M/s ;發(fā)送接收和 RDM展作通過0復(fù)制),低延遲 (11.3usec MPI 端對(duì)端;RDMA0.91us Infiniband 延遲)實(shí)用文檔InfiniBand - Highest Performance* Highest throughput-40GHs node to node and 120Gb/s switch to switch-Nearly 90M MRI messages per s

16、econd-Send/receive and RDMA operations with zero-copy Lowest latency-1-1.3usec MPI end-to-end-09-1 us InfiniBand latency for RDMA operations Lowest CPU overhead-Full transport offload maximizes CPU availability for user applications38.計(jì)算科學(xué)與理論科學(xué)和實(shí)驗(yàn)科學(xué)是人類認(rèn)識(shí)自然的三大支柱。應(yīng)用領(lǐng)域:美國(guó)HPCC十劃,包括:磁記錄技術(shù)、新藥設(shè)計(jì)、高速民航、催化作用、

17、燃料燃燒、海洋建模、臭氧損耗、數(shù)字解析、大氣污染、蛋白質(zhì)結(jié)構(gòu)設(shè)計(jì)、圖像理解、密碼破譯。HPC衡量單位:十億次 Gflop/s ,萬(wàn)億次Tflop/s ,千萬(wàn)億次 Pflop/s 。Linpack是國(guó)際上最流行的用于測(cè)試高性能計(jì)算機(jī)系統(tǒng)浮點(diǎn)性能的benchmark。通過對(duì)高性能計(jì)算機(jī)采用高斯消元法求解一元 N次稠密線性代數(shù)方程組的測(cè)試,評(píng)價(jià)高性能計(jì)算機(jī)的浮點(diǎn)性能。共享存儲(chǔ)對(duì)稱多處理機(jī)系統(tǒng) (SMP, Shared Memory Processor),任意處理器可直接訪問任意內(nèi)存地址,且訪問延遲、帶寬、幾率都是等價(jià)的;系統(tǒng)是對(duì)稱的。Cluster 集群:將多個(gè)計(jì)算機(jī)系統(tǒng)通過網(wǎng)絡(luò)連接起來如同一個(gè)系統(tǒng)

18、一樣提供服務(wù),可以獲得高并行處理能 力、高可用性、負(fù)載均衡和管理便捷性。Cluster技術(shù)進(jìn)步的必然:高性能處理器、高速網(wǎng)絡(luò)、集群OS和管理系統(tǒng)、并行/分布計(jì)算工具以及軟件。并行計(jì)算 Parallel computing:單一系統(tǒng),眾核處理同一任務(wù); 分布式計(jì)算Distributed computing:將多系統(tǒng)用調(diào)度器松散的結(jié)合起來,處理相關(guān)任務(wù);網(wǎng)格計(jì)算Grid Computing:用軟件和網(wǎng)絡(luò)將多系統(tǒng)和多處理器緊耦合,共同處理同一任務(wù)或相關(guān)任務(wù)。并行計(jì)算的兩大優(yōu)勢(shì):處理器總體性能更強(qiáng),總體內(nèi)存更大。并行式計(jì)算的分類:1) shared memory (共享內(nèi)存),可以分為統(tǒng)一內(nèi)存訪問Un

19、iform memoryaccess (UMA)即所有處理器訪存相同和Non-uniform memory access (NUMA) 訪存延遲取決于數(shù)據(jù)存儲(chǔ)位置;2) distributedmemory (分布式內(nèi)存)??煞譃榇笠?guī)模并行處理器Massively Parallel Processor (MPP) 和集群Cluster 。對(duì)稱多處理器SMP與全局內(nèi)存通過總線或交叉開關(guān)crossbar互聯(lián)。優(yōu)點(diǎn)編程模型簡(jiǎn)單,問題總線帶寬會(huì)飽和,交叉開關(guān)規(guī)模會(huì)隨處理器個(gè)數(shù)增加而增大。缺點(diǎn)不宜擴(kuò)展,限制了SMPW規(guī)模。集群優(yōu)勢(shì):通用的高性能、高可用性、高擴(kuò)展性和高性價(jià)比。 TOC o 1-5 h z

20、分布式內(nèi)存編程模型:MPI共享內(nèi)存編程模型:OpenMP Pthreads并行粒度:PVM/MPL Threads Compilers、CPU消息傳遞是當(dāng)前并行計(jì)算領(lǐng)域的一個(gè)非常重要的并行程序設(shè)計(jì)方式MPI是一個(gè)庫(kù),而不是一門語(yǔ)言;MPI是一種消息傳遞編程模型,是提供一個(gè)實(shí)際可用的、可移植的、高效的和靈活的消息傳遞接口標(biāo)準(zhǔn).消息傳送機(jī)制:阻塞方式,必須等到消息從本地送出之后才可以執(zhí)行后續(xù)的語(yǔ)句;非阻塞方式,不須等到 消息從本地送出就可以執(zhí)行后續(xù)的語(yǔ)句,并不保證資源的可再用性。并行加速的木桶理論:一個(gè)給定問題中的并行加速比受此問題的串行部分限制。對(duì)于并行計(jì)算來說,最危險(xiǎn)的缺陷就是將一個(gè)計(jì)算問題變

21、成了一個(gè)通信問題:這種問題一般發(fā)生在各個(gè)節(jié)實(shí)用文檔點(diǎn)為了保持同步而傳輸數(shù)據(jù)的時(shí)間超過了CPU進(jìn)行計(jì)算的時(shí)間,常見網(wǎng)絡(luò)Infiniband, 10GE GE Myrinet 。GPU , C-G混合架構(gòu)。第二次課:蔣運(yùn)宏VMM Virtual Machine Monitor ,虛擬機(jī)監(jiān)控程序。VMM勺基本特征:Equivalence (等價(jià)),Isolation(隔離),Efficiency(高效)。VMM!要能夠控制整個(gè)物理平臺(tái),通過“ Ring Deprivileging ”實(shí)現(xiàn)CPg制。可虛擬化的指令集:特權(quán)指令,敏感指令。劉通:什么是 SuperComputing : biggest ,

22、 fastest 。 About Size and Speed 。Supercomputing用在對(duì)物理現(xiàn)象的仿真,數(shù)據(jù)挖掘,虛擬化。HPC的組件:硬件、軟件、應(yīng)用程序和人。Remote DMA Zero-copy , Kernel bypassTCP/IP Networks: Overhead and Latency (負(fù)載和延遲)InfiniBand 的高性能體現(xiàn)在:高的吞吐量( highest throughput , 40Gb/s node to node and 120Gb/s switch to switch , Nearly 90M MPI messages per second

23、 , Send/receive and RDMA operations with zero-copy ), 低的延遲(lowest latency ,1-1.3usec MPI end-to-end ,0.9-1us InfiniBand latency for RDMAoperations ), 低白勺 CPU負(fù)載(Lowest CPU overhead )影響可擴(kuò)展性的關(guān)鍵元素:硬件,軟件,程序本身隨著系統(tǒng)大小的增加,通信時(shí)間所占的比例持續(xù)增加Mostly used MPI functions , MPI 最常用的函數(shù):MPI_Wait, MPI_Allreduce, and MPI_Bc

24、astInfiniBand provides higher utilization, performance and scalability,提供了更高的利用率,性能和可擴(kuò)展能力。王璟:基本概念:并行計(jì)算(Parallel Computing ),高端計(jì)算(High-end Parallel Computing),高性能計(jì)算(High Performance Computing) , 超級(jí)計(jì)算 (Super Computing)為何要做HPC科學(xué)和工程問題的數(shù)值模擬與仿真,要求:在合理的時(shí)限內(nèi)完成計(jì)算任務(wù)。如何滿足高精度計(jì)算的需求?一并行計(jì)算,降低單個(gè)問題求解的時(shí)間,增加問題求解規(guī)模,提高吞吐

25、率(多機(jī)同時(shí)執(zhí)行多個(gè)串行程序).高性能計(jì)算機(jī):由多個(gè)計(jì)算單元組成,運(yùn)算速度快、存儲(chǔ)容量大、可靠性高的計(jì)算機(jī)系統(tǒng)??蒲袆?chuàng)新的三大支柱:,理論分析,計(jì)算模擬,觀察實(shí)驗(yàn)。HPC應(yīng)用:汽車制造,氣象預(yù)報(bào),生物制藥,飛機(jī)制造,動(dòng)畫渲染,金融計(jì)算,石油勘探。并行計(jì)算的硬件體系:并行計(jì)算機(jī)就是由多個(gè)處理單元組成的計(jì)算機(jī)系統(tǒng),這些處理單元相互通信和協(xié)作 以快速、高效求解大型復(fù)雜問題。結(jié)構(gòu)模型:a)PVP; b)SMP c ) MPP Massively Parallel Processor,大規(guī)模并行處理器);d) DSMdistributed shared memory,動(dòng)態(tài)分布式存儲(chǔ));e) Cluste

26、r/COW ;訪存模型:多處理機(jī)(單地址空間共享存儲(chǔ)器),UMA: Uniform Memory Access, NUMA:Nonuniform MemoryAccess ;多計(jì)算機(jī)(多地址空間非共享存儲(chǔ)器),NORMA:No-Remote Memory Access。程序設(shè)計(jì)模型:a)隱式并行(Implicit Parallel ),就是各種并行編程語(yǔ)言,如 Fortran90, HPF(1992);共享變量(Shared Variable ),如 POSIXthreads 線程模型,OpenMP消息傳遞(Message Passing ),如 MPI ( Message Passing I

27、nterface ), PVM(Parallel Virtual Machine )。InfiniBand :以交換為核心;交換機(jī)是InfiniBand中的基本組件;點(diǎn)到點(diǎn)的交換結(jié)構(gòu):解決了共享總線、容錯(cuò)性和可擴(kuò)展性問題;具有物理層低功耗特點(diǎn)和箱外帶寬連接能力。InfiniBand的特點(diǎn):高速度;遠(yuǎn)程直接內(nèi)存存取功能;傳輸卸載;CPUm速-GPU;網(wǎng)絡(luò)加速-InfiniBand ;內(nèi)存加速-虛擬存儲(chǔ);GPU( Graphic Processing Unit),用于個(gè)人計(jì)算機(jī)、工作站和游戲機(jī)的專用圖像顯示設(shè)備顯示卡或主板集成。CPU更多資源用于緩存;GP3多資源用于數(shù)據(jù)計(jì)算,適合具備可預(yù)測(cè)計(jì)算模

28、式的應(yīng)用.實(shí)用文檔HPC面臨的挑戰(zhàn):a)計(jì)算功耗比,即通用性和效率之間尋找一個(gè)平衡點(diǎn);b)更高的并行度;c)足夠價(jià)值的艾級(jí)應(yīng)用;d)容錯(cuò);e)所依賴的器件革命何時(shí)發(fā)生;f)與新興應(yīng)用的關(guān)系;g)高性能應(yīng)用軟件產(chǎn)業(yè);83、集群技術(shù)的優(yōu)勢(shì):通用的高性能:節(jié)點(diǎn)采用傳統(tǒng)服務(wù)器平臺(tái),通用的硬件、操作系統(tǒng),適應(yīng)性強(qiáng)高可用性:高度的設(shè)備冗余,CPU內(nèi)存、磁盤、節(jié)點(diǎn)機(jī)高可擴(kuò)展性:以交換設(shè)備為核心,節(jié)點(diǎn)機(jī)、存儲(chǔ)可靈活填減更高的性價(jià)比:通用設(shè)備,統(tǒng)一的標(biāo)準(zhǔn)84、MPI: Massage Passing Interface:是消息傳遞函數(shù)庫(kù)的標(biāo)準(zhǔn)規(guī)范MPI是一個(gè)庫(kù),而不是一門語(yǔ)言;MPI是一種消息傳遞編程模型,并成

29、為這種編程模型的代表和事實(shí)上的標(biāo)準(zhǔn);MPI是一種標(biāo)準(zhǔn)或規(guī)范的代表,而不特指某一個(gè)對(duì)它的具體實(shí)現(xiàn);目標(biāo):是提供一個(gè)實(shí)際可用的、可移植的、高效的和靈活的消息傳遞接口標(biāo)準(zhǔn)MPI提供C/C+和Fortran 語(yǔ)言的綁定基本縮寫(HPC- )與高性能計(jì)算相關(guān)的縮寫5個(gè)Concurrency PipelineRISC會(huì)畫圖 illustrationHow to improve performance?Coding. How I speed up my code?A Trivial Example load-balance線程 PThread: POSIX Thread一、名詞解釋2) speed upHP

30、CC High Performance Computing and Communications(高性能計(jì)算和通信)RISC: Reduced Instruction Set Computing(精簡(jiǎn)指令集)ILP : Instruction Level Parallelism指令集并行SMP Symmetric Multi-Processors對(duì)稱多處理器SMT Simultaneous Multi ThreadingMPP Massively Parallel Processor SISD: single instruction single data SIMD: single instr

31、uction multiple dta同步多線程大規(guī)模并行處理器單指令單數(shù)據(jù)單指令多數(shù)據(jù)MIMD multiple instructions multiple dataMISD: multiple instructions single dataMSP Multi-Streaming vector Processor MIPS: Millions of instructions / sec DAGs Directed Acyclic Graphs多串流向量處理器 每秒百萬(wàn)條指令FCFS First Come First ServeEASY Extensible Argonne Scheduli

32、ng System CUDA : Compute Unified Device Architecture 并行計(jì)算提出的原因:1、提高性能和存儲(chǔ)能力可擴(kuò)展的Argonne調(diào)度系統(tǒng)通用并行計(jì)算架構(gòu)2、使用戶和計(jì)算機(jī)之間相互協(xié)調(diào)3、獲得一個(gè)問題的邏輯結(jié)構(gòu)4、處理獨(dú)立的物理設(shè)備并行的三大問題:性能,準(zhǔn)確性,可編程性ProgrammabilityMPI : Massage Passing Interface是消息傳遞函數(shù)庫(kù)的標(biāo)準(zhǔn)規(guī)范Parallel computing :單一系統(tǒng),眾核處理同一任務(wù)。時(shí)間上的并行就是指流水線技術(shù),而空間上的并行則 是指用多個(gè)處理器并發(fā)的執(zhí)行計(jì)算。并行計(jì)算優(yōu)勢(shì):處理器總體

33、性能更強(qiáng);總體內(nèi)存更大。HPC1群峰值計(jì)算能力: 一套配置256個(gè)雙路X5560處理器計(jì)算節(jié)點(diǎn)的 HPC1群,X5560: 2.8GHz Intel X5560Nehalem四核處理器,目前主流的處理器每時(shí)鐘周期提供4個(gè)雙精度浮點(diǎn)計(jì)算實(shí)用文檔峰值計(jì)算性能:2.8GHz*4Flops/Hz*4Core*2CPU*256 節(jié)點(diǎn)=22937.6GFlopsGflops=10億次,所以22937Gflops = 22.937TFlops=22.937 萬(wàn)億次每秒的峰值性能(計(jì)算峰值)128個(gè)雙路2.66GHz Intel Nehalem四核處理器計(jì)算節(jié)點(diǎn)的HPC1群,其峰值計(jì)算是多少128*2*2.6

34、6G*4*4( 一個(gè)時(shí)鐘周期可進(jìn)行 4次浮點(diǎn)運(yùn)算)=10,895GFlopsHow do I speed up my code?代碼嵌入消除公共表達(dá)式,消除冗余代碼,確定循環(huán)不變量,指針代替數(shù)組,循環(huán)展開,代碼移動(dòng)變量替換函數(shù), 加法替代乘法,直接使用變量。One option to make code faster is basically to deal with the codeLet s look at some examples of what one can do by handThese techniques were very popular before compilers

35、were any goodOf course, we ll talk about what the compiler can do nowadaysTechniquereplace array accesses by poirrtrdereferencesTechnique #1: identify loop constantstot (ltO;kW.;k+ (J+)for (k-O,k、 /一 一一 、0 ptt+ / / H inteqet addition11t, 一mull皿;確7E俯環(huán)變重,用指針替代數(shù)組Technique #3: Loop Unrollingfor (i-0ii10

36、0 xi+)i 棟口1 -Technique#4: Code Motionsum = 0;for (i = 0; i = fACtdn);1+ sum += i;i-u do I 覆Qi - i- i+; 功口 w i- i+; illjl - i; + ; ftJtji - is i+:】bhibGr”in口”所 EmpPgnm 循環(huán)展開(減少比較次數(shù))stun - 0;f =;for (i = 0; 1 before execution we know how much load 的 given to each processor 巳Or as opposed tn some dynami

37、c algorithm tfiat assigns wortt to prccessars when theyre ready We 11 Come back to that idea when we talk about scheduling We will look st:2 simple load-balancing algorittimsapplication to cur 1-D stencil 叩pllcadonapplication to the 1-D distributed LU factorizationdiscussion of load-balancing for 2-

38、D data distributionsWe assume hctmogenetius networit and heterogeneous c&mpute nodes in this lecture Let 5 enrsider p processors Let 卜% 加 tht Lcytl timis of Hit prottsscri* i.e.H time to prtxsss one elemental unit of computation (wart( units) for the application (T06nl/Let E be the rnjmber of (ident

39、ical) work unitsLetcp be the number of work units pnKessed by each processorj + % = B Perfect lead balancing happens ifq t t is constant實(shí)用文檔then we can have perfect load balancing But in general the formula for j does not give an integer solutionThere is a simple algorithm that give the optimal (int

40、eger) allocation of work units to processors in O(pz)if B is a multiple oflem(hT&) X i二 Z t, i/ initialize with fractional valuesrounded downFor i = 1 to p_Lci _J J_ X BJ/ iteratively increase the c, values while (c + . + cp B)find k in 1p such thattfc(ch+ 1)-min(yc| + 1)g = G + 1 3 processors- 10 w

41、ork unite53 I- 一P1Note that the previous algorithm can be modifiea slightly to record the optimal allocation for B=l/ B=2, etc.One can write the result as a list of processors.8=1: PB=2: Plr P2B = 3; Pr Pq PlB=4: P“ Pm hr PaB=5: Pj, P Pn Pr P:B=6: Plr P2f PL % P2, B = 7: Plf P3r PI1r P加 P” P& 1etc.W

42、e will see how this comes in handy for some load-balancing problems (e.g., LU factorization)4 ProcessorsEach processor handles many rk blocksWhat if ttie processors are heterogeneous?Just give rows to processors proportionally to their speedStill in a cyclic patternShould have perfect efficiencyBut

43、there could of course be the usual notions of rounding off e,g,F what if a processor is 1,0001 faster than another?實(shí)用文檔to fnd t叫 max 為ReductionIndepeof thent computation scaling faiorComputeEvery update needs the scaling factor and the element from the pivot dowIndependentcomputationsBroadcastsCompu

44、te1 requiresload-balancingOur original cyclic distribution doesnt work well in a heterogeneous settingStart with a non-cyclic distributionAt each (or every k) step與j rebalanceAt each step dll processorshave tfie same amount of workto doRedistribution is expensive in term5 of communications Just do w

45、hat we did for the stencil application Use the distribution obtained with the Incremental algorithm we saw before, reversed B=10: P, p2r Php1t Plf Plr P2f PmBut not great MBoptimal iMd-balancingfor 10 columnsoptimal loed-balandngfar 7 columnsoptimal load-balancingfor 4 columns實(shí)用文檔Of course, this sho

46、uld be done for blocks of columns, and not individual columns Also, should be done in a 4Lmotif that spans some number of columns (B=10 in our example) and is repeated cyclically throughout the matrixprovided that B is large enough, we get a good approximation of the optimal loadbalance2-D Data Dist

47、ributions (Sec 62)What we ve seen so far works well for 1-D data distributionsuse the simple algorithmuse the allocation pattern over block in a cyclic fashionWe have seen that a 2-D distribution is whats most appropriate, for instance for matrix multiplicationWe use matrix multiplication as our dri

48、ving exampleC = A xBLet us assume that we have a 5P processor grid, and that p=q=n,9 all 3 matrices are distributed identically2-D Matrix MultiplicationPup單Pzi匕電3p享p歲P搴Processor GridAi工Alr3A24%工A2p3氣1A3,2A窣 We have seen 3 algorithms to do a matrix multiplication (Cannon, Fox, Snyder) Pretty difficul

49、t to generalize them for a heterogeneous platform The outer product is much simpler, and thus easier to adapt and modify Let s look at the outer product algorithm on a heterogeneous 2-D distribution Outer-product AlgorithmProceeds in k=l,n steps Horizontal broadcasts P/ for all i = L.也 broadcasts ai

50、k to processors in its processor row Vertical broadcasts PK j, for all J = broadcasts akj to processors in its processor columnIndependent computations processor can update cd =+ aik x akjOuter-product AlgorithmLet 10be the cycle time of processor PlrWe assign to processor Pj_. a rectangle of siex q

51、First, let us note that it is not always possible to achieve perfect load-balancing There are some theorems that show that its only possible if the processor grid matrix, with processor cycle times, put in their spot is of rank 1Each processor computes for父 q x t;j timeTherefore, the total execution

52、 time isT = ma0 % x q x %實(shí)用文檔Load-balancing can be expressed as a constrained minimization problem minimize max,j r; x Cj x % with the constraintspp門=nE 0=ni=lJ=1The load-balancing problem is in fact much more complexOne can place processors in any place of the processor gridOne must look for the optimal given all possible arrange

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論