矩陣高性能計(jì)算_第1頁
矩陣高性能計(jì)算_第2頁
矩陣高性能計(jì)算_第3頁
矩陣高性能計(jì)算_第4頁
矩陣高性能計(jì)算_第5頁
已閱讀5頁,還剩84頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、.11High Performance NumericalLinear Algebra Computing(Part II)Zhaojun BaiDepartment of Computer Science &Department of MathematicsUniversity of California, D/bai2Outline of Dense Matrix Computing1.Motivation2.Benchmarks3.Review Gaussian Elimination (G

2、E) for solving Ax=b4.Optimizing GE for caches on sequential machines5.LAPACK and ScaLAPACK (overview)3Motivation 3 Basic Linear Algebra Problems Linear Equations: Solve Ax=b for x Least Squares: Find x that minimizes S ri2 where r=Ax-b Eigenvalues: Find l and x where Ax = l x Lots of variationsWhy d

3、ense A, as opposed to sparse A? Arent “most” large matrices sparse? Dense algorithms easier to understand Some applications yields large dense matrices-Ax=b: Computational Electromagnetics-Ax = lx: Quantum Chemistry Benchmarking-“How fast is your computer?” , “How fast can you solve dense Ax=b?” Lar

4、ge sparse matrix algorithms often yield smaller (but still large) dense problems4Computational Electromagnetics Solve Ax=b Developed during 1980s, driven by defense applications Determine the RCS (radar cross section) of airplane Reduce signature of plane (stealth technology) Other applications are

5、antenna design, medical equipment Two fundamental numerical approaches: Methods of moments (MOM) (frequency domain) Large dense matrices Finite differences (time domain) Even larger sparse matrices5Computational Electromagnetics image: NW Univ. Comp. Electromagnetics Laboratory http:/nueml.ece.nwu.e

6、du/- Discretize surface into triangular facets using standard modeling tools- Amplitude of currents on surface are unknowns - Integral equation is discretized into a set of linear equations6Computational ElectromagneticsAfter discretization the integral equation has the form A x = b whereA is the (d

7、ense) impedance matrix, x is the unknown vector of amplitudes, and b is the excitation vector.(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing 92, pp 538 - 542)7Results for Parallel Implementation on Intel DeltaTask Time (hours)Fill (comp

8、ute n2 matrix entries) 9.20 (embarrassingly parallel but slow) Factor (Gaussian Elimination, O(n3) ) 8.25 (good parallelism with right algorithm)Solve (O(n2) 2 .17 (reasonable parallelism with right algorithm) Field Calc. (O(n) 0.12 (embarrassingly parallel and fast)The problem solved was for a matr

9、ix of size 48,672. 2.6 Gflops for Factor - The world record in 1991.8Current Records for Solving Dense SystemsYear System Size Machine # Procs Gflops (Peak)1950s O(100) 1995 128,600 Intel Paragon 6768 281 ( 338)1996 215,000 Intel ASCI Red 7264 1068 (1453)1998 148,000 Cray T3E 1488 1127 (1786)1998 23

10、5,000 Intel ASCI Red 9152 1338 (1830) (200 MHz Ppro)1999 374,000 SGI ASCI Blue 5040 1608 (2520)2000 362,000 Intel ASCI Red 9632 2380 (3207) (333 MHz Xeon)2001 518,000 IBM ASCI White 8000 7226 (12000) (375 MHz Power 3) source: Alan Edelman /edelman/records.html LINPACK Benchmark

11、: /performance/html/PDSreports.html9Current Records for Solving Small Dense Systems MegaflopsMachine n=100 n=1000 PeakFujitsu VPP 5000 1156 8784 9600 (1 proc 300 MHz)Cray T90 (32 proc, 450 MHz) 29360 57600 (4 proc, 450 MHz) 1129 5735 7200IBM Power 4 (1 proc, 1300 MHz) 1074 2394 5200Dell

12、 Itanium (4 proc, 800 MHz) 7358 12800 (2 proc, 800 MHz) 4504 6400 (1 proc, 800 MHz) 600 2382 3200source: LINPACK Benchmark: /performance/html/PDSreports.html10Review of Gaussian Elimination (GE)Basic idea: Add multiples of each row to later rows to make A upper triangularInitial version

13、: for each column i zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i) * A(i,k)11Refine GE Algorithm (1)Initial VersionRemove computatio

14、n of constant A(j,i)/A(i,i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i) * A(i,k)for i = 1 to n-

15、1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i to n A(j,k) = A(j,k) - m * A(i,k)12Refine GE Algorithm (2)Last versionDont compute what we already know: zeros below diagonal in column ifor i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k)for i = 1 to n-1 fo

16、r j = i+1 to n m = A(j,i)/A(i,i) for k = i to n A(j,k) = A(j,k) - m * A(i,k)13Refine GE Algorithm (3)Last versionStore multipliers m below diagonal in zeroed entries for later usefor i = 1 to n-1 for j = i+1 to n m = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k)for i = 1 to n-1 for j =

17、 i+1 to n A(j,i) = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k)14Refine GE Algorithm (4)Last versionfor i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k)o Split Loopfor i = 1 to n-1 for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for

18、j = i+1 to n for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k)15Refine GE Algorithm (5)Last versionExpress using matrix operations (BLAS)for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) * ( 1 / A(i,i) ) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)for i = 1 to n-1 for j = i+1 to n A(j,i)

19、 = A(j,i)/A(i,i) for j = i+1 to n for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k)16What GE really computesCall the strictly lower triangular matrix of multipliers M, and let L = I+MCall the upper triangle of the final matrix ULemma (LU Factorization): If the above algorithm terminates (does not d

20、ivide by zero) then A = L*USolving A*x=b using GE Factorize A = L*U using GE (cost = 2/3 n3 flops) Solve L*y = b for y, using substitution (cost = n2 flops) Solve U*x = y for x, using substitution (cost = n2 flops)Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desiredfor i = 1 to n-1 A(i+1:n,i) = A(i+1:n

21、,i) / A(i,i) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)17Problems with basic GE algorithm1.What if some A(i,i) is zero? Or very small?1.Result may not exist, or be “unstable”, so need to pivot2.Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply)

22、is fastest (earlier lectures)for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n)PeakBLAS 3BLAS 2BLAS 118Pivoting in Gaussian Elimination A = 0 1 fails completely, even though A is “easy” 1 0

23、 Illustrate problems in 3-decimal digit arithmetic: A = 1e-4 1 and b = 1 , correct answer to 3 places is x = 1 1 1 2 1 Result of LU decomposition is L = 1 0 = 1 0 No roundoff error yet fl(1/1e-4) 1 1e4 1 U = 1e-4 1 = 1e-4 1 Error in 4th decimal place 0 fl(1-1e4*1) 0 -1e4 Check if A = L*U = 1e-4 1 (2

24、,2) entry entirely wrong 1 0 Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|5 Numerical instability Computed solution x totally inaccurate Cure: Pivot (swap rows of A) so entries of L and U bounded19Gaussian Elimination with Partial Pivoting (GEPP) Partial Pivoting: swap rows so

25、that each multiplier |L(i,j)| = |A(j,i)/A(i,i)| = 1 for i = 1 to n-1 find and record k where |A(k,i)| = maxi = j = n |A(j,i)| i.e. largest entry in rest of column i if |A(k,i)| = 0 exit with a warning that A is singular, or nearly so elseif k != i swap rows i and k of A end if A(i+1:n,i) = A(i+1:n,i

26、) / A(i,i) each quotient lies in -1,1 A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) Lemma: This algorithm computes A = P*L*U, where P is a permutation matrix Since each entry of |L(i,j)| = 1, this algorithm is considered numerically stable For details see LAPACK code at /la

27、pack/single/sgetf2.f20Converting BLAS2 to BLAS3 in GEPPBlocking Used to optimize matrix-multiplication Harder here because of data dependencies in GEPP Delayed Updates Save updates to “trailing matrix” from several consecutive BLAS2 updates Apply many saved updates simultaneously in one BLAS3 operat

28、ionSame idea works for much of dense linear algebra Open questions remainNeed to choose a blocksize b Algorithm will save and apply b updates b must be small enough so that active submatrix consisting of b columns of A fits in cache b must be large enough to make BLAS3 fast21Blocked GEPP (/la

29、pack/single/sgetrf.f)for ib = 1 to n-1 step b Process matrix b columns at a time end = ib + b-1 Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P * L * U let LL denote the strict lower triangular part of A(ib:end , ib:end) + I A(ib:end , end+1:n) = LL-1 * A(i

30、b:end , end+1:n) update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) apply delayed updates with single matrix-multiply with inner dimension b(For a correctness proof, see on-lines notes.)22Gaussian Elimination0 xxxx.Standard Waysubtract a

31、 multiple of a row0 x00. . .0LINPACKapply sequence to a columnxnb then apply nb to rest of matrixa3=a3-a1*a2a3a2a1La2 =L-1 a20 x00. . .0nb LAPACKapply sequence to nb 23Efficiency of Blocked GEPP24LAPACKStandard library for dense/banded linear algebra Linear systems: A*x=b Least squares problems: min

32、x | A*x-b |2 Eigenvalue problems: Ax = l lx, Ax = l lBx Singular value decomposition (SVD): A = US SVTCombine algorithms from LINPACK and EISPACK into a single package (User interface similar to LINPACK) single, Double, Complex, Double ComplexBuilt on the level 1, 2 and 3 BLAS Algorithms reorganized

33、 to use BLAS3 as much as possible Efficient on a wide range of computers. Basis of math libraries on many computers, incorporated in Matlab 6.x25LAPACKMost of the parallelism in the BLAS.Advantages of using the BLAS for parallelism:ClarityModularityPerformancePortability26Performance of LAPACK (n=10

34、00)27Performance of LAPACK (n=100)28Derivation of Blocked Algorithms: Cholesky Decomposition Equating coefficient of the jth column, we obtainHence, if U11 has already been computed, we can compute uj and ujj from the equations:AaAaaAAUuuUUUuUuUjjTjjjTTjTjTjjTjTjjjjT11131333111333111333000000aU ujTj

35、11au uujjjTjjj2U uaTjj11uau ujjjjjTj2Cholesky Factorization A = UTU29LINPACK Implementation Here is the body of the LINPACK routine SPOFA which implements the method: DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF( JM1.LT.1 ) GO TO 20 DO 10 K = 1, JM1 T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1,

36、J ), 1 ) T = T / A( K, K ) A( K, J ) = T S = S + T*T 10 CONTINUE 20 CONTINUE S = A( J, J ) - S C .EXIT IF( S.LE.0.0E0 ) GO TO 40 A( J, J ) = SQRT( S ) 30 CONTINUE 30BLAS 2 Implementation DO 10 J = 1, N CALL STRSV( Upper, Transpose, Non-Unit, J-1, A, LDA, A( 1, J ), 1 ) S = A( J, J ) - SDOT( J-1, A(

37、1, J ), 1, A( 1, J ), 1 ) IF( S.LE.ZERO ) GO TO 20 A( J, J ) = SQRT( S )10 CONTINUE This change by itself is sufficient to significantly improve the performance on a number of machines. From 238 to 312 Mflop/s for a matrix of order 500 on a Pentium 4-1.7 GHz. However on peak is 1,700 Mflop/s. Sugges

38、t further work needed.31Derivation of Blocked Algorithms: Cholesky Decomposition Equating coefficient of second block of columns, we obtainHence, if U11 has already been computed, we can compute U12 as the solution of the following equations by a call to the Level 3 BLAS routine STRSM:AAAAAAAAAUUUUU

39、UUUUUUUTTTTTTTTTT111213122212131233111222132333111213222333000000AU UT121112AU UU UTT2212122222U UAT111212U UAU UTT222222121232Derivation of Blocked Algorithms: Cholesky Decomposition DO 10 J = 1, N, NB CALL STRSM( Left, Upper, Transpose,Non-Unit, J-1, JB, ONE, A, LDA, $ A( 1, J ), LDA ) CALL SSYRK(

40、 Upper, Transpose, JB, J-1,-ONE, A( 1, J ), LDA, ONE, $ A( J, J ), LDA ) CALL SPOTF2( Upper, JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 2010 CONTINUE On Pentium 4, L3 BLAS squeezes a lot more out of 1 procIntel Pentium 4 1.7 GHzN = 500Rate of ExecutionLinpack variant (L1B)238 Mflop/sLevel 2 BL

41、AS Variant312 Mflop/sLevel 3 BLAS Variant1262 Mflop/sLAPACK routine: Blocked Partitioned AlgorithmsLU FactorizationCholesky factorizationSymmetric indefinite factorizationMatrix inversionQR, QL, RQ, LQ factorizationsForm Q or QTCOrthogonal reduction to: (upper) Hessenberg form symmetric tridiagonal

42、form bidiagonal formBlock QR iteration for nonsymmetric eigenvalue problemsEarly algorithms involved use of small main memory using tapes as secondary storage.Existing block algorithms:Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory34LU Algo

43、rithm: 1: Split matrix into two rectangles (m x n/2) if only 1 column, scale by reciprocal of pivot & return 2: Apply LU Algorithm to the left part 3: Apply transformations to right part (triangular solve A12 = L-1A12 and matrix multiplication A22=A22 -A21*A12 ) 4: Apply LU Algorithm to right pa

44、rtRecent development: GE via a Recursive AlgorithmLA12A21A22F. Gustavson and S. ToledoMost of the work in the matrix multiply Matrices of size n/2, n/4, n/8, 35Recursive Factorizations Just as accurate as conventional method Same number of operations Automatic variable blockingLevel 1 and 3 BLAS onl

45、y ! Extreme clarity and simplicity of expression Highly efficient The recursive formulation is just a rearrangement of the point-wise LINPACK algorithm The standard error analysis applies (assuming the matrix operations are computed the “conventional” way).Ref: F. Gustavson, New Generalized Data Str

46、uctures for Matrices Lead to a Variety of High Performance Algorithms36DGEMM ATLAS & DGETRF Recursive AMD Athlon 1GHz ($1100 system)010020030040050010001500200025003000OrderMFlop/sPentium III 550 MHz Dual Processor LU Factorization0200400600800500100015002000250030003500400045005000OrderMflop/sL

47、APACKRecursive LURecursive LULAPACKDual-processorUniprocessorLAPACK Ongoing WorkAdd functionality updating/downdating, divide and conquer least squares,bidiagonal bisection, bidiagonal inverse iteration, band SVD, Jacobi methods, .Move to new generation of high performance machines IBM SPs, CRAY T3E

48、, SGI Origin, clusters of workstations New challenges New languages: FORTRAN 90, HP FORTRAN, . (CMMD, MPL, NX .) -many flavors of message passing, need standard (PVM, MPI): BLACSHighly varying ratio: computational speed/communication speedMany ways to layout data,Fastest parallel algorithm sometimes

49、 less stable numerically.38ScaLAPACK Library of software dealing with dense & banded matrix computations Distributed Memory - Message Passing MIMD Computers and Networks of Workstations Clusters of SMPs3940Choosing a Data DistributionMain issues are:Load balancingUse of the Level 3 BLAS41Differe

50、nt Data Layouts for Parallel GE (on 4 procs) The winner!Bad load balance:P0 idle after firstn/4 stepsLoad balanced, but cant easilyuse BLAS2 or BLAS3Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneckComplicated addressing42Distribution and St

51、orage Matrix is block-partitioned & maps blocks Distributed 2-D block-cyclic scheme 5x5 matrix partitioned in 2x2 blocks 2x2 process grid point of view Routines available to distribute/redistribute data.A11A12A13A14A15A21A22A23A24A25A31A32A33A34A35A41A42A43A44A45A51A52A53A54A55A11A12A15A13A14A21

52、A22A25A23A24A51A52A55A53A54A31A32A35A33A34A41A42A45A43A4443Row and Column Block Cyclic Layoutprocessors and matrix blocksare distributed in a 2d arraypcol-fold parallelismin any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcolserial bottleneck is easedneed not be symmetric i

53、n rows andcolumns44PDGESV = ScaLAPACK parallel LU routineSince it can run no faster than its inner loop (PDGEMM), we measure:Efficiency = Speed(PDGESV)/Speed(PDGEMM)Observations: Efficiency well above 50% for large enough problems For fixed N, as P increases, efficiency decreases (just as for PDGEMM

54、) For fixed P, as N increases efficiency increases (just as for PDGEMM) From bottom table, cost of solving Ax=b about half of matrix multiply for large enough matrices. From the flop counts we would expect it to be (2*n3)/(2/3*n3) = 3 times faster, but communication makes it a little slower.45Parall

55、elism in ScaLAPACK Level 3 BLAS block operationsAll the reduction routines PipeliningQR Algorithm, Triangular Solvers, classic factorizations Redundant computationsCondition estimators Static work assignmentBisection Task parallelismSign function eigenvalue computations Divide and ConquerTridiagonal

56、 and band solvers, symmetric eigenvalue problem and Sign function Cyclic reductionReduced system in the band solver Data parallelismSign function46Scales well, nearly full machine speed47Old version,pre 1998 Gordon Bell PrizeStill have ideas to accelerateProject Available!Old Algorithm, plan to aban

57、don48Have good ideas to speedupProject available!Hardest of all to parallelize495051Outline of Sparse Matrix Computing1)The landscape of sparse Ax = b solvers2)Sparse direct solvers 3)Iterative methods 4)Parallel matrix-vector multiplication5)Templates for sparse linear systems and eigenproblems52Th

58、e Landscape of Sparse Ax=b Solvers Pivoting LU GMRES, QMR, Cholesky Conjugate gradient DirectA = LUIterativey = AyNon-symmetricSymmetricpositivedefiniteMore RobustLess StorageMore RobustMore GeneralD53Sparse direct solversGaussian elimination (LU, LLT,LDLT factorization), followed by lower/upper tri

59、angular solutions: Dense: PA = LU permutation for stabilitySparse: Pr*A*PcT = LU permutation for stability and sparsity of L, UDistinct steps for sparse matrices: 1. Ordering step: order equations and variables to minimize fill in L, U -Heuristics based on combinatorics 2. Analysis step (Symbolic fa

60、ctorization): set up data structures and allocate memory of L and U3. Numerical factorization -Usually dominates total time 4. Triangular solves- Usually approx optimal fill and flop count Practice: often wins for very large problemsBanded orderings (Reverse Cuthill-McKee, Sloan, . . .): Try to keep all nonzeros clos

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論