高性能計算發(fā)展概括_第1頁
高性能計算發(fā)展概括_第2頁
高性能計算發(fā)展概括_第3頁
高性能計算發(fā)展概括_第4頁
高性能計算發(fā)展概括_第5頁
已閱讀5頁,還剩65頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、高性能計算及應(yīng)用任課教師王云嵐 EMAIL : wangyl趙天海 EMAIL:zhaothnwpu.高性能計算研究與發(fā)展中心辦公室: 勇字樓3樓電話:88493434(O)2課程目標掌握高性能計算編程工具,解決相關(guān)問題課程主要內(nèi)容:介紹高性能計算系統(tǒng)體系結(jié)構(gòu)、高性能并行程序程序設(shè)計方法及高性能計算技術(shù)最新方向。主要包括:高性能處理機、多處理機系統(tǒng);集群計算系統(tǒng)、Linux集群系統(tǒng)配置方法,集群資源管理與作業(yè)調(diào)度,多線程編程及性能優(yōu)化等;并行編程程序工具:OpenMP、MPI、CUDA、MapReduce等。交流平臺2013年高性能計算課程qq群:158463721參考書目:John L.He

2、nnessy, David A.Patterson,賈洪峰(譯者),計算機體系結(jié)構(gòu):量化研究方法(第5版)李靜梅 (編者), 吳艷霞 (編者) ,新一代計算機體系結(jié)構(gòu)楊曉東,陸松,牟勝梅著,并行計算機體系結(jié)構(gòu)技術(shù)與分析,科學(xué)出版社,2009年1月劉鵬,云計算(第二版),電子工業(yè)出版社,2011 年5月曾宇 等著,高效能計算機系統(tǒng)-若干關(guān)鍵技術(shù)分析,高等教育出版社,2010年1月張武生,薛巍,李建江,鄭緯民編著,MPI并行程序設(shè)計實例教程,清華大學(xué)出版社,2009Michael J. Quinn著 陳文光, 武永衛(wèi)等譯,MPI與OpenMP并行程序設(shè)計:C語言版,清華大學(xué)出版社,2004.10/

3、作業(yè)高性能計算相關(guān)研究熱點的技術(shù)報告云計算CPU/GPU技術(shù)虛擬化實驗報告集群環(huán)境構(gòu)建并行應(yīng)用編程:MPI,openMP,Cuda高性能計算及應(yīng)用課程1:高性能計算發(fā)展概述課程內(nèi)容提綱應(yīng)用需求計算機體系結(jié)構(gòu)的發(fā)展高性能計算的核心技術(shù):并行計算并行編程的重要性應(yīng)用需求High performance computing高性能計算與科研,產(chǎn)業(yè)需求與意義基礎(chǔ)科研領(lǐng)域的計算需求物理化學(xué)生物材料工業(yè)領(lǐng)域的需求銀行輔助設(shè)計醫(yī)藥石油氣象在線服務(wù)信息安全傳統(tǒng)的科學(xué)研究difficult, 例如建造大型風(fēng)洞expensive, 例如建造樣機slow, 例如等待氣候的變化,天體的演化dangerous, 例如武器

4、開發(fā),藥品,大氣試驗,電力系統(tǒng)分析基于計算科學(xué)的科學(xué)研究物理原理和數(shù)值方法理論分析設(shè)計試驗富有挑戰(zhàn)性的計算問題遍及科學(xué)與工程的各個領(lǐng)域ScienceGlobal climate modelingAstrophysical modelingBiology: genomics; protein folding; drug designComputational ChemistryComputational Material Sciences and NanosciencesEngineeringCrash simulationSemiconductor designEarthquake and s

5、tructural modelingComputation fluid dynamics (airplane design)Combustion (engine design)Oil field applicationsBusinessFinancial and economic modelingTransaction processing, web services and search enginesDefenseNuclear weapons - test by simulationsCryptographyUnits of High Performance Computing計算能力存

6、儲能力全球氣候模擬計算問題:f(經(jīng)度, 緯度, 海拔, 時間) 溫度, 氣壓, 適度, 風(fēng)速做法:域的離散化分解,10公里解析度(Discretize the domain, e.g., a measurement point every 10 km)給定時間t設(shè)計算法預(yù)測t +dt的天氣(Devise an algorithm to predict weather at time t+dt given t)應(yīng)用:主要事件預(yù)測(Predict major events, e.g., El Nino)用于確定大氣散射標準(Use in setting air emissions standard

7、s)來源: /chammp/chammp.html大氣環(huán)流模擬需求解Navier-Stokes方程1分鐘時間間隔100個浮點運算/網(wǎng)格點對計算的需求為確保時效需1分鐘執(zhí)行5 x 1011 flops=8 Gflop/s以天為單位的7 天天氣預(yù)報需要56 Gflop/s以月為單位的50年氣候預(yù)測需要4.8 Tflop/s以12小時為單位的50年預(yù)測288 Tflop/s 如果提高網(wǎng)格解析度則計算復(fù)雜性將呈8x,16x增加 更高的精確預(yù)測模型則需要綜合考慮大氣,海洋,冰川,陸地,加上地球化學(xué)等因素 千年氣候模型分析目前無法對此進行有效計算全球氣候模擬高性能計算已經(jīng)成為復(fù)雜系統(tǒng)工程的必備手段航空高性

8、能計算領(lǐng)域高端需求主要集中在CAE領(lǐng)域氣動力計算結(jié)構(gòu)計算氣動彈性分析多學(xué)科設(shè)計優(yōu)化飛行載荷計算隱身設(shè)計計算穩(wěn)定性和操縱計算需求飛行仿真其他高性能計算需求數(shù)字化裝配數(shù)字樣機主要特點計算能力vs計算規(guī)模先導(dǎo)性研究vs工程應(yīng)用超音速巡航大攻角機動武器系統(tǒng)內(nèi)埋式發(fā)射CFD終極目標:虛擬飛行試驗虛擬風(fēng)洞(CFD)設(shè)計經(jīng)驗風(fēng)洞試驗虛擬飛行試驗計算設(shè)備/用戶/內(nèi)容Today2015Source:IDF2012大數(shù)據(jù)現(xiàn)象“Data are becoming the new raw material of business: an economic input almost on a par with capi

9、tal and labor” The Economist, 2010“Information will be the oil of the 21st century” Gartner,2010Source:IDF20122015 Cloud VisionCoexistence of Opportunities and Challenges Source:IDF2012Trends to Exascale PerformanceRoughly 10 x performance every 4 years, predicts that well hit Exascale performance i

10、n 2018-19Source:IDF2012計算機體系結(jié)構(gòu)的發(fā)展計算機體系結(jié)構(gòu)的發(fā)展趨勢體系結(jié)構(gòu)的改進將技術(shù)創(chuàng)新轉(zhuǎn)變?yōu)橛嬎銠C的處理性能計算機體系結(jié)構(gòu)歷史:電子管、晶體管、集成電路、大規(guī)模集成電路超大規(guī)模集成電路(Very Large Scale Integration)的發(fā)展階段可以看做為并行處理的探索過程并行處理是提高計算機處理性能的核心技術(shù)體系結(jié)構(gòu)的發(fā)展: 并行方法的探索Greatest trend in VLSI generation is increase in parallelism1970 - 1985:位級并行(bit level parallelism) 4-bit - 8

11、bit - 16-bitslows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue)80年代中期 to 90年代中期: 指令級別并行( instruction level parallelism)pipelining and simple instruction sets, + compiler advances (RISC)on-chip caches and functional units = superscalar executiongreater sophisticat

12、ion: out of order execution, speculation, predictionto deal with control transfer and latency problemsNow: 線程級并行(thread level parallelism)VLSI三個階段Three phases:Bit-level Instruction-level Thread-levelVLSI Technology TrendsIntel announced that they have reach 1.7 billion with Itanium processorGigascal

13、e Integration (GSI) = 1 billion transistors per chip/jeff/ece4420/technology.pdf單處理器的性能增長變化VAX: 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ?%/year 2002 to present處理器功耗發(fā)展的趨勢不在提供時鐘頻率,而轉(zhuǎn)變?yōu)槊總€芯片的CPU數(shù)量風(fēng)冷芯片最大功耗的瓶頸Recent Intel Processors“We are dedicating all of our future product d

14、evelopment to multicore designs. We believe this is a key inflection point for the industry.” Intel President Paul Otellini, IDF 2005ProcessorsYearFabrication(nm)Clock(GHz)Power(W)Pentium 420001801.80-4.0035-115Pentium M200390/1301.00-2.265-27Core 2 Duo2006652.60-2.9010-65Core 2 Quad2006652.60-2.904

15、5-105Core i7(Quad)2008452.93-3.6095-130Core i5(Quad)2009453.20-3.6073-95Pentium Dual-Core 2010452.80-3.3365-130Core i3(Duo)2010322.93-3.3318-732nd Gen i3(Duo)2011322.50-3.4035-652nd Gen i5(Quad)2011323.10-3.8045-952nd Gen i7(Quad/Hexa)2011323.80-3.9065-1303rd Gen i3(Duo)201222/322.80-3.4035-553rd Ge

16、n i5(Quad)201222/323.20-3.8035-773rd Gen i7(Quad/Hexa)201222/323.70-3.9045-77Xeon E5(8-cores)2013221.80-2.9060-130Xeon Phi(60-cores)2013221.10300Intels Many Core and Multi-coreIntel 80-core TeraScale Processor (Vangal et al. 2008)億級處理器developed a solver (single precision) for this chip that ran at 1

17、 TFLOP with only 97 WattsSource: Tim Mattson, Intel LabsTrends are putting all onto one chipThe future belongs to heterogeneous, many core SOC as the standard building block of computingSOC = system on a chipSource: Tim Mattson, Intel Labs集群系統(tǒng)的發(fā)展趨勢Large-Scale Computing Systems大規(guī)模集群計算系統(tǒng)Franklin (NERS

18、C-5): Cray XT49,532 compute nodes; 38,128 coresEach node has an AMD quad core processorand 8 GB of memory25 Tflop/s on applications; 352 Tflop/s peakHPSS Archival Storage40 PB capacity4 Tape librariesNERSC Global Filesystem (NGF)Uses IBMs GPFS1.5 PB; 5.5 GB/sClusters 105 Tflops total CarverIBM iData

19、plex clusterPDSF (HEP/NP)Linux cluster (1K cores)Magellan Cloud testbedIBM iDataplex clusterAnalyticsEuclid (512 GB shared memory)Dirac GPU testbed (48 nodes)Hopper (NERSC-6): Cray XE6 Phase 1: Cray XT5, 668 nodes, 5344 coresPhase 2: 1 Pflop/s peak (2 sockets/node, 12 cores/socket)Tianhe-I(A)6,144 c

20、ompute nodes; 24576 cores2560 AMD Radeon HD 4870*2 GPU98TB memory in totalRpeak: 4.700 pflops; Rmax: 2.566 pflopsJaguar:(Cray XT5)224,256 x86-based AMD Opteron processor coresRpeak:2.331 pflops; Rmax :1.759 pflops西工大高性能計算中心高性能集群設(shè)備浪潮天梭TS10000NX5440 刀片計算節(jié)點浪潮TS10K Clusters計算能力:73 Tflops total 153 計算刀片3

21、 MIC 加速節(jié)點4 GPU 加速節(jié)點并行存儲 179TB光纖存儲系統(tǒng) 40TBLinux 操作系統(tǒng)集群的基本組成光纖存儲系統(tǒng)管理、登錄、IO節(jié)點計算節(jié)點并行存儲Top 10 list in June 2012RankSiteComputer1DOE/NNSA/LLNLUnited StatesSequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, CustomIBM2RIKEN Advanced Institute for Computational Science (AICS)JapanK computer, SPARC64 VIIIfx 2.0GHz,

22、 Tofu interconnectFujitsu3DOE/SC/Argonne National LaboratoryUnited StatesMira - BlueGene/Q, Power BQC 16C 1.60GHz, CustomIBM4Leibniz RechenzentrumGermanySuperMUC - iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDRIBM5National Supercomputing Center in TianjinChinaTianhe-1A - NUDT YH MPP, Xeo

23、n X5670 6C 2.93 GHz, NVIDIA 2050NUDT6DOE/SC/Oak Ridge National LaboratoryUnited StatesJaguar - Cray XK6, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA 2090Cray Inc.7CINECAItalyFermi - BlueGene/Q, Power BQC 16C 1.60GHz, CustomIBM8Forschungszentrum Juelich (FZJ)GermanyJuQUEEN - BlueGene/

24、Q, Power BQC 16C 1.60GHz, CustomIBM9CEA/TGCC-GENCIFranceCurie thin nodes - Bullx B510, Xeon E5-2680 8C 2.700GHz, Infiniband QDRBull10National Supercomputing Centre in Shenzhen (NSCS)ChinaNebulae - Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050Dawning2011年6月,我國進入Top50

25、0的高性能計算機2National Supercomputing Center in TianjinNUDTProprietaryProprietary4National Supercomputing Centre in Shenzhen (NSCS)DawningInfinibandInfiniband QDR33Institute of Process Engineering, Chinese Academy of SciencesIPE, Nvidia, TyanInfinibandInfiniband QDR40Shanghai Supercomputer CenterDawningI

26、nfinibandInfiniband DDR82Computer Network Information Center, Chinese Academy of ScienceLenovoInfinibandInfiniband97Tsinghua UniversityInspurInfinibandInfiniband QDR143Network CompanyIBMGigabit EthernetGigabit Ethernet164Internet ServiceIBMGigabit EthernetGigabit Ethernet199Web Company (C)Hewlett-Pa

27、ckardGigabit EthernetGigabit Ethernet201Internet ServiceIBMGigabit EthernetGigabit Ethernet202Internet ServiceIBMGigabit EthernetGigabit EthernetIPE:中國科學(xué)院過程工程研究所(原化工冶金研究所)RankSiteSystemCoresRmax (TFlop/s)Rpeak (TFlop/s)Power (kW)10National Supercomputing Centre in Shenzhen (NSCS)ChinaNebulae - Dawni

28、ng TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050Dawning1206401271.02984.3258026National Supercomputing Center in JinanChinaSunway Blue Light - Sunway BlueLight MPP, ShenWei processor SW1600 975.00 MHz, Infiniband QDRNational Research Center of Parallel Computer Engineering

29、& Technology137200795.91070.2107437Institute of Process Engineering, Chinese Academy of SciencesChinaMole-8.5 - Mole-8.5 Cluster, Xeon X5520 4C 2.27 GHz, Infiniband QDR, NVIDIA 2050IPE, Nvidia, Tyan29440496.51012.654094Shanghai Supercomputer CenterChinaMagic Cube - Dawning 5000A, QC Opteron 1.9 Ghz,

30、 Infiniband, Windows HPC 2008Dawning30720180.6233.5122GovernmentChinaSunway 4000H Cluster, Xeon X56xx (Westmere-EP) 2.93 GHz, Infiniband QDRNational Research Center of Parallel Computer Engineering & Technology14280145.6167.4127Research CenterChinaCluster Platform SL250s Gen8, Xeon E5-2660 8C 2.200G

31、Hz, Infiniband FDR, NVIDIA 2090Hewlett-Packard8064135.4270.7132Internet ServiceChinaxSeries x3650 Cluster, Xeon E5649 6C 2.530GHz, Gigabit EthernetIBM23316131.4236.0707.32012年6月,我國進入TOP500的部分超級計算機/sublistTOP 500(2011年6月)中的集群 星群系統(tǒng)(Constellations)包含了一個超大容量交換系統(tǒng),可以同時管理數(shù)千個計算引擎之間的高速數(shù)據(jù)傳輸;大規(guī)模并行機(MPP):由許多松耦合

32、的處理單元組成,每個單元內(nèi)的CPU都有自己私有的資源,如總線,內(nèi)存,硬盤等,每個處理單位只有一個微內(nèi)核;集群(Cluster):每個節(jié)點有完整的操作系統(tǒng)。2012年6月數(shù)據(jù),TOP500中有407套系統(tǒng)為ClusterArchitectureCountShare %Rmax Sum (GF)Rpeak Sum (GF)Processor SumConstellations20.40 %9497011294717648MPP8717.40 %19293725255504292984630Cluster41182.20 %39541331595165734777646Totals500100%58

33、930025.5985179949.007779924Top500國家分布TOP 500過去19年體系結(jié)構(gòu)演化TOP 500過去19年體系結(jié)構(gòu)演化2013年6月,cluster417,MPP83從TOP500看集群系統(tǒng)在高性能計算領(lǐng)域,集群系統(tǒng)已經(jīng)成為主流的系統(tǒng)結(jié)構(gòu),并將進一步擴大其所占份額在Top500中,集群結(jié)構(gòu)占了絕對大多數(shù),說明在構(gòu)建超大規(guī)模計算系統(tǒng)的時候,集群是主要的系統(tǒng)構(gòu)成方式集群系統(tǒng)的發(fā)展趨勢64位系統(tǒng)逐漸成為主流多種商業(yè)化的高速互連網(wǎng)絡(luò)SAN系統(tǒng)作為集群的存儲設(shè)備64位:突破2GB的系統(tǒng)內(nèi)存瓶頸科學(xué)計算大規(guī)模模擬應(yīng)用三維網(wǎng)格模擬應(yīng)用所需的內(nèi)存可以輕易突破2GB生物信息學(xué)基因拼接

34、等應(yīng)用需要大量的內(nèi)存,實際應(yīng)用中內(nèi)存不足是主要問題之一素數(shù)運算需要用到大量64位整數(shù)運算和大內(nèi)存商業(yè)應(yīng)用海量數(shù)據(jù)處理DB in memory媒體播放服務(wù)器大內(nèi)存高內(nèi)存帶寬減少訪問磁盤次數(shù),可將性能提高近一個數(shù)量級64位:突破2GB的系統(tǒng)內(nèi)存瓶頸64位:新的設(shè)計理念引發(fā)新的設(shè)計理念現(xiàn)有的很多算法是基于內(nèi)存不足設(shè)計的,因此很多精力花費在用時間換取空間上64位系統(tǒng)提供了訪問更大內(nèi)存的機會,因此很多應(yīng)用可能要基于新的理念進行設(shè)計,以獲得64位所帶來的好處64位:不是萬能靈藥并非所有用戶都需要現(xiàn)在就轉(zhuǎn)向64位代碼膨脹,性能反而可能會下降需要根據(jù)自己的應(yīng)用特性來分析是否需要2GB以上的內(nèi)存是否有大量64位

35、整數(shù)運算如果上述問題的答案都是否,那么不一定能夠從64位系統(tǒng)中得到預(yù)期的好處某些應(yīng)用可以從特定的64位處理器獲得很大的性能提高,但這不是64位本身的特性,而是依賴于特定處理器,需要具體分析實際情況集群系統(tǒng)的互連網(wǎng)絡(luò)評價互連網(wǎng)絡(luò)的指標延遲帶寬功能支持價格集群系統(tǒng)的互連網(wǎng)絡(luò)InterconnectInterfaceMPI Latency(us)Uni-directional Bandwidth(MB/s)說明GB EtherPCI 30-50100最便宜MyrinetPCI-X6248SCIPCI1.4326延遲最小Quadrics IIIPCI5340InfiniBand 4xPCI-X7.58

36、05帶寬最高集群系統(tǒng)的互連網(wǎng)絡(luò)功能支持都支持MPI,除GB Ethernet外都實現(xiàn)了高效率的通信協(xié)議SCI和Quadrics還提供了共享內(nèi)存的支持,但是其遠程通信延遲仍然在us數(shù)量級,對于細粒度的共享內(nèi)存程序,仍然無法很好地支持(對比SGI Altrix系列的遠程訪問延遲在200ns以下)集群系統(tǒng)所面臨的挑戰(zhàn)能耗問題不僅僅是集群系統(tǒng)的問題從芯片,單機和集群系統(tǒng)等多個層次來共同解決這個問題管理性監(jiān)控自我修復(fù)管理信息的過濾與提取分區(qū)Execution is not just about hardwareModern programmer does not see assembly languag

37、eMany do not even see “l(fā)ow-level” languages like “C”什么是并行編程?Why parallel programmingWhat is Parallel Computing?Traditionally, software has been written for serial computationTo be run on a single computer having a single Central Processing Unit (CPU)A problem is broken into a discrete series of inst

38、ructionsInstructions are executed one after anotherOnly one instruction may execute at any moment in timeFor example:發(fā)工資程序Parallel Computing同時使用多個計算資源來處理一個計算任務(wù)To be run using multiple CPUsA problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a se

39、ries of instructions Instructions from each part execute simultaneously on different CPUs ExampleExampleThe compute resources might beA single computer with multiple processors An arbitrary number of computers connected by a network A combination of both The computational problem should be able toBe

40、 broken apart into discrete pieces of work that can be solved simultaneously Execute multiple program instructions at any moment in time Be solved in less time with multiple compute resources than with a single compute resource加速比Goal of applications in using parallel machines: SpeedupFor a fixed pr

41、oblem size (input data set), performance = 1/time并行編程的重要性Why parallel programmingNow we can get: single-source approach to multi- and many-coreSource:IDF2012However, the Parallelizing CompilersAfter 30 years of intensive research only limited success in parallelism detection and program transformati

42、ons instruction-level parallelism at the basic-block level can be detected parallelism in nested for-loops containing arrays with simple index expressions can be analyzed analysis techniques, such as data dependence analysis, pointer analysis, flow sensitive analysis, abstract interpretation, . when applied across procedure boundaries often take far too long and tend to be fragile, i

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論