容錯計算英文版課件:17 Malfunction Diagnosis_第1頁
容錯計算英文版課件:17 Malfunction Diagnosis_第2頁
容錯計算英文版課件:17 Malfunction Diagnosis_第3頁
容錯計算英文版課件:17 Malfunction Diagnosis_第4頁
容錯計算英文版課件:17 Malfunction Diagnosis_第5頁
已閱讀5頁,還剩67頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、 Robust Parallel Processing Resilient AlgorithmsNov. 20121Part V Malfunctions: Architectural AnomaliesAbout This PresentationThis presentation is intended to support the use of the textbook Dependable Computing: A Multilevel Approach (traditional print or on-line open publication, TBD). It is update

2、d regularly by the author as part of his teaching of the graduate course ECE 257A, Fault-Tolerant Computing, at Univ. of California, Santa Barbara. Instructors can use these slides freely in classroom teaching or for other educational purposes. Unauthorized uses, including distribution for profit, a

3、re strictly prohibited. Behrooz ParhamiEditionReleasedRevisedRevisedRevisedRevisedFirstSep. 2006Oct. 2007Nov. 2009Nov. 2012Nov. 20122Part V Malfunctions: Architectural Anomalies17 Malfunction DiagnosisNov. 20123Part V Malfunctions: Architectural AnomaliesNov. 20124Part V Malfunctions: Architectural

4、Anomalies Robust Parallel Processing Resilient AlgorithmsNov. 20125Part V Malfunctions: Architectural Anomalies17.1 Self-Diagnosis in SubsystemsLayered approach: A small part of a unit is tested, which then forms a trusted kernel The trusted kernel is used to test the next layer of subsystems Region

5、 of trust is gradually extended, until it covers the entire unitOne approach to go/no-go testing based on self-diagnosis Tester supplies a random seed to the built-in test routine The test routine steps through a long computation that exercises nearly all parts of the system, producing a final resul

6、t The tester compares the final result to the expected resultIdeally, if a properly designed self-test routine returns a 32-bit value, the value will match the expected result despite the presence of faults with probability 232 109.6 test coverage = 1 109.6 KernelSelf-diagnosis initiationUnit AUnit

7、BUnit CNov. 20126Part V Malfunctions: Architectural Anomalies17.2 Malfunction Diagnosis ModelDiagnosis of one unit by another The tester sends a self-diagnosis request, expecting a response The unit under test eventually sends some results to the tester The tester interprets the results received and

8、 issues a verdictTesting capabilities among units is represented by a directed graphi Testerj Testee Test capabilityI think j is okay(passed test)0i Testerj Testee Test capabilityI think j is bad(failed test1The verdict of unit i about unit j is denoted by Dij 0, 1All the diagnosis verdicts constitu

9、te the n n diagnosis matrix D The diagnosis matrix D is usually quite sparse M0M1M3M2 Nov. 20127Part V Malfunctions: Architectural AnomaliesMore on Terminology and AssumptionsMalfunction diagnosis in our terminology corresponds to “system-level fault diagnosis” in the literatureThe qualification “sy

10、stem-level” implies that the diagnosable units are subsystems with significant computational capabilities (as opposed to gates or other low-level components)We do not use the entries on the main diagonal of the diagnosis matrix D (a unit does not judge itself) and we usually do not let two units tes

11、t one anotherA good unit always issues a correct verdict about another unit (i.e., tests have perfect coverage), but the verdict of a bad unit is arbitrary and cannot be trustedThis is known as the PMC model (Preparata, Metze, & Chien)We consider the PMC model only, but other models also exist(e.g.,

12、 in comparison-based models, verdicts are derived from comparing the outputs of unit pairs) - D01 - - - - D12 D13D20 - - - D30 - D32 -Nov. 20128Part V Malfunctions: Architectural Anomalies17.3 One-Step DiagnosabilityConsider this system, with the test outcomes shownWe say that the system above is 1-

13、step 1-diagnosable (we can correctly diagnose up to 1 malfunctioning unit in a single round of tests)M0M1M3M2D01D12D30D23D20 D13Diagnosis syndromesMalfnD01D12 D13 D20D30 D32 None 0 0 0 0 0 0M00/1 0 0 1 1 0M1 10/10/1 0 0 0M2 0 1 00/1 0 1M3 0 0 1 00/10/1M0,M10/10/10/1 1 1 0M1,M2 10/10/10/1 0 1Syndrome

14、 dictionary:0 0 0 0 0 0 OK0 0 0 1 1 0 M0 0 0 1 0 0 0 M30 0 1 0 0 1 M30 0 1 0 1 0 M30 0 1 0 1 1 M30 1 0 0 0 1 M20 1 0 1 0 1 M21 0 0 0 0 0 M11 0 0 1 1 0 M01 0 1 0 0 0 M11 1 0 0 0 0 M11 1 1 0 0 0 M1Nov. 20129Part V Malfunctions: Architectural AnomaliesRequirements for One-Step t-DiagnosabilityAn n-unit

15、 system is 1-step t-diagnosable if the diagnosis syndromes for conditions involving up to t malfunctions are all distinctNecessary conditions:1. n 2t + 1; i.e., a majority of units must be good2. Each unit must be tested by at least t other unitsM0M1M3M2D01D12D30D23D20 D13Sufficient condition:An n-u

16、nit system in which no two units test one another is 1-step t-diagnosable iff each unit is tested by at least t other unitsThe system above, has each unit tested by 1 or 2 units; it is 1-step 1-diagnosableIt cannot be made 1-step 2-diagnosable via adding more test connectionsSo, each unit being test

17、ed by t other units is both necessary and sufficientNov. 201210Part V Malfunctions: Architectural AnomaliesAnalogy: “Liars and Truth-Tellers” PuzzlesYou visit an island whose inhabitants are from two tribesMembers of one tribe (“l(fā)iars”) consistently lieMembers of the other tribe (“truth-tellers”) al

18、ways tell the truthYou encounter a person on the islandWhat single yes/no question would you ask him to determine his tribe?More generally, how can you derive correct conclusions from info provided by members of these tribes, without knowing their tribes?How would the problem change if the two tribe

19、s were “truth-tellers” and “randoms” (whose members give you random answers)M0M1M3M200 x x 0 1In the context of malfunction diagnosis:Truth-tellers are akin to good modulesRandoms correspond to bad modulesYou do not know whether a module is good or badModule “opinions” about other modules must be us

20、ed to derive correct diagnosesNov. 201211Part V Malfunctions: Architectural Anomalies1-Step Diagnosability: Analysis & SynthesisA degree-t directed chordal ring, in which node i tests the t nodes i + 1, i + 2, . . . , i + t (all mod n) has the required propertySynthesis problem:Specify the test link

21、s (connection assignment) that makes an n-unit system 1-step t-diagnosable; use as few test links as possibleAnalysis problems:1. Given a directed graph defining the test links, find the largest value of t for which the system is 1-step t-diagnosable (easy if no two units test one another; fairly di

22、fficult, otherwise)2. Given a directed graph and its associated test outcomes, identify all the malfunctioning units, assuming there are no more than tThere is a vast amount of published work dealing with Problem 1Problem 2 arises when we want to repair or reconfigure a system using test outcomes (s

23、olved via table lookup or analytical methods)Nov. 201212Part V Malfunctions: Architectural AnomaliesAn O(n3)-Step Diagnosis AlgorithmInput: The diagnosis matrixOutput: Every unit labeled G or Bwhile some unit remains unlabeled repeat choose an unlabeled unit and label it G or B use labeled units to

24、label other units if the new label leads to a contradiction then backtrack endifendwhileM0M1M3M210 1 0 1 01-step 1-diagnosable systemM0 is G (arbitrary choice)M1 is BM2 is B (contradiction, 2 Bs)M0 is B (change label)M1 is G M2 is GM3 is G More efficient algorithms existNov. 201213Part V Malfunction

25、s: Architectural AnomaliesAn O(n2.5)-Step Diagnosis AlgorithmFrom the original testing graph, derive an L-graphThe L-graph has the same nodesThere is a link from node i to node j in the L-graph iff node i can be assumed to be malfunctioning when node j is known to be goodM0M1M3M210 1 0 1 0Theorem: T

26、he unique minimal vertex cover of the L-graph is the set of t or fewer malfunctioning unitsTesting graph and test resultsMiMjMj good Mi malfunctioningDefinition Vertex cover of a graph: A subset of vertices that contains at least one of the two endpoints of each edge M0M1M3M2Corresponding L-graphNov

27、. 201214Part V Malfunctions: Architectural Anomalies17.4 Sequential DiagnosabilityAn n-unit system is sequentially t-diagnosable if the diagnosis syndromes when there are t or fewer malfunctions are such that they always identify, unambiguously, at least one malfunctioning unitNecessary condition:n

28、2t + 1; i.e., a majority of units must be goodThis is useful because some systems that are not 1-step t-diagnosable are sequentially t-diagnosable, and they can be restored by removing the identified malfunctioning unit(s) and repeating the processThis system is sequentially 2-diagnosableIn one step

29、, it is only 1-diagnosableSequential diagnosability of directed rings:An n-node directed ring is sequentially t-diagnosable for any t that satisfies (t2 1)/4 + t + 2 n M0M1M4M2M3Nov. 201215Part V Malfunctions: Architectural AnomaliesSyndromes for M0 bad:0 0 0 0 10 0 0 1 00 0 0 1 10 0 1 0 10 0 1 1 10

30、 1 0 0 10 1 1 0 11 0 0 0 11 0 0 1 01 0 0 1 11 0 1 0 11 0 1 1 11 1 0 0 11 1 1 0 1Sequential 2-Diagnosability ExampleConsider this system, with the test outcomes shownThe system above is sequentially 2-diagnosable (we can correctly diagnose up to two malfunctioning units, but only one at a time)Malfun

31、ction syndromes (x means 0 or 1)MalfnD01D12 D23 D34D40 M0 x 0 0 0 1M1 1 x 0 0 0M2 0 1 x 0 0M3 0 0 1 x 0M4 0 0 0 1 xM0,M1 x x 0 0 1M0,M2 x 1 x 0 1M0,M3 x 0 1 x 1M0,M4 x 0 0 1 xM0M1M4M2M3D01D12D34D23D40Nov. 201216Part V Malfunctions: Architectural AnomaliesSequential Diagnosability: Analysis & Synthes

32、isAn n-node ring, with n 2t + 1, with added test links from 2t 2 other nodes to node 0 (besides node n 1 which already tests it) has the required propertySynthesis problem:Specify the test links (connection assignment) that makes an n-unit system 1-step t-diagnosable; use as few test links as possib

33、leAnalysis problems:1. Given a directed graph defining the test links, find the largest value of t for which the system is sequentially t-diagnosable2. Given a directed graph and its associated test outcomes, identify at least one malfunctioning unit (preferably more), assuming there are no more tha

34、n tThese problems have been extensively studiedNov. 201217Part V Malfunctions: Architectural Anomalies17.5 Diagnostic Accuracy and ResolutionAn n-unit system is 1-step t/s-diagnosable if a set of no more than t malfunctioning units can always be identified to within a set of s units, where s t An n-

35、unit system is sequentially t/r-diagnosable if from a set of up to t malfunctioning units, r can be identified in one step, where r t The special case of 1-step t/t-diagnosability has been widely studiedGiven the values of t and s, the problem of deciding whether a system is t/s-diagnosable is co-NP

36、-completeHowever, there exist efficient, polynomial-time, algorithms to find the largest integer t such that the system is t/t- or t/(t + 1)-diagnosableSafe diagnosability: Up to t malfunctions are correctly diagnosed and up to u detected (no danger of incorrect diagnosis for up to u malfunctions; r

37、eminiscent of combo error-correcting/detecting codes)Nov. 201218Part V Malfunctions: Architectural Anomalies17.6 Other Topics in DiagnosabilityDiagnosability results have been published for a variety of regular interconnection networksNov. 201219Part V Malfunctions: Architectural AnomaliesWhat Comes

38、 after Malfunction Diagnosis?When one or more malfunctioning units have been identified, the system must be reconfigured to allow it to isolate those units and to function without the unavailable resourcesIn a bus-based system, we isolate malfunctioning units, remove them, and plug in good modules (

39、standby spares or repaired ones)In a system having point-to-point connectivity, we reconfigure by rearranging the connections in order to switch in (shared) spares, using methods similar to those developed for defect circumventionReconfiguration may involve:1. Recovering state info from removed modu

40、les or back-up storage2. Reassigning tasks and reallocating data3. Restarting the computation from last checkpoint or from scratchNov. 201220Part V Malfunctions: Architectural Anomalies18 Malfunction ToleranceNov. 201221Part V Malfunctions: Architectural AnomaliesNov. 201222Part V Malfunctions: Arch

41、itectural Anomalies Robust Parallel Processing Resilient AlgorithmsNov. 201223Part V Malfunctions: Architectural AnomaliesSD18.1 System-Level ReconfigurationOvercoming the effect of link malfunctions requires the availability of multiple paths from each source to every possible destinationA system c

42、onsists of modular resources (processors, memory banks, disk storage, . . . ) and interconnectsRedundant resources can mitigate the effect of module malfunctionsA main challenge in reconfiguration is dealing with interconnectsAssumption: Module/interconnect malfunctions are promptly diagnosedIn grap

43、h-theoretic terms, we need “edge-disjoint” pathsExistence of k edge-disjoint paths between two nodes provides the means for tolerating k 1 link malfunctionsThis particular interconnection scheme (torus) is 4-connected and tolerates 3 link/node losses without becoming disconnectedNov. 201224Part V Ma

44、lfunctions: Architectural AnomaliesReconfiguration Switching, RevisitedQuestion: How do we know which cells/nodes must be bypassed?Must devise a scheme in which healthy nodes set the switchesNov. 201225Part V Malfunctions: Architectural AnomaliesReconfiguration via Programmable ConnectionsInterconne

45、ction switch with 4 ports (horizontal lines) and 3 channels(vertical lines)Each port can be connected to 2 of the 3 channels1122If each module port were connected to every channel, the maximum flexibility would result (leads to complex hardware & control, though)The challenge lies in using more limi

46、ted connections effectivelyNov. 201226Part V Malfunctions: Architectural Anomalies18.2 Isolating a Malfunctioning UnitIsolation is needed to prevent malfunctioning units from interfering with the operation of the remaining good unitsSlide to be completed with other examplesNotion of “bus guardian” f

47、rom the SIFT systemFrom moduleTo busPermission 1Permission 2Nov. 201227Part V Malfunctions: Architectural AnomaliesBus-Based ReconfigurationFailed units can be isolated from the busesNo single bus failure can isolate a module from the rest of the systemThe vertical channels may be viewed as buses an

48、d the heavy dots as controllable bus connections, making this method applicable to fault-tolerant multiprocessorsRead / Write dataWrite enableQFFWrite pathRead pathConnection FFBusIf we have extra buses, then faults in the bus connection logic can be tolerated by avoiding the particular busFor relia

49、bility analysis, lump the failure rate of reconfiguration logic with that of its associated busNov. 201228Part V Malfunctions: Architectural AnomaliesMalfunction-Stop ModulesMalfunction tolerance would be much easier if modules simply stopped functioning, rather than engage in arbitrary behaviorUnpr

50、edictable (Byzantine) malfunctions are notoriously hard to handleAssuming the availability of a reliable stable storage along with its controlling s-process and (approximately) synchronized clocks, a k-malfunction-stop module can be implemented from k + 1 unitsOperation of s-process to decide whethe

51、r the module has stopped:R := bag of received requests with appropriate timestampsif |R| = k+1 all requests identical and from different sources stopthen if request is a write then perform the write operation in stable storage else if request is a read, send value to all processeselse set variable s

52、top in stable storage to TRUENov. 201229Part V Malfunctions: Architectural Anomalies18.3 Data and State RecoveryLog-based recovery is performed via undo/redo: Undoing the effects of incomplete transactions Redoing transactions whose effects are not reflected in stable storageLogs maintain redundant

53、info (in stable storage, of course) for the sole purpose of recovery from malfunctionsThe write-ahead log (WAL) protocol requires that a transaction:Write an undo-log entry before it overwrites an object in stable storage with uncommitted updates Write both undo-log and redo-log entries before commi

54、tting an update to an object in stable storageNot safe to write logs after overwriting or committingResearch is being done at Microsoft and elsewhere to allow querying a database on its state at any desired time instant in the pastNov. 201230Part V Malfunctions: Architectural Anomalies18.4 Regular A

55、rrays of ModulesRegularity refers to the interconnection pattern, not physical layout (the latter may be the case for on-chip systems)Many of the methods of malfunction tolerance in regular arrays are similar to those used for circumventing defects to improve yieldNov. 201231Part V Malfunctions: Arc

56、hitectural AnomaliesRow/Column Bypassing in 2D ArraysSimilar mechanisms needed for northward links in columns and for the eastward and westward links in rowsRow iBypass row i0 10 1Question: What types of mechanisms do we need at the edges of this array to allow the row and column edge connections to

57、 be directed to the appropriate (nonbypassed) rows and columns?0 11 0 0 1100 10 10 10 1Bypass row i 1 Nov. 201232Part V Malfunctions: Architectural AnomaliesChoosing the Rows/Columns to BypassIn the adjacent diagram, can we choose up to 2 rows and 2 columns so that they contain all the bad nodes?0 1

58、 2 3 4 5 6 701234567Spare columnsSpare rowsConvert to graph problem (Kuo-Fuchs):Form bipartite graph, with nodes corresponding to bad rows and columnsFind a cover for the bipartite graph (set of nodes that touch every edge)Question: In a large array, with r spare rows and c spare columns, what is th

59、e smallest number of bad nodes that cannot be reconfigured around with row/column bypassing?0236Rows with faults0135Columns withfaults70317Rowswith badnodesColumnswith badnodesNov. 201233Part V Malfunctions: Architectural AnomaliesSwitch Modules in FPGAsInterconnection switch with 8 ports and four c

60、onnection choices for each port:0 No connection1 Straight through2 Right turn3 Left turn8 control bits (why?)1212334455887667Nov. 201234Part V Malfunctions: Architectural AnomaliesAn Array Reconfiguration SchemeThree-state switchNov. 201235Part V Malfunctions: Architectural AnomaliesOne-Track and Tw

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論