《python高維數(shù)據(jù)分析》課件-第3章

上傳人：1*** IP屬地：廣東上傳時間：2025-02-17 格式：PPTX 頁數(shù)：181 大小：8.86MB 積分：15 舉報 版權(quán)申訴

已閱讀5頁，還剩176頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進行舉報或認領(lǐng)

文檔簡介

Chapter3

Principal

Component

Analysi3.1Introductory

Example3.2Theory3.3History

PCA3.4Practical

Aspect3.5Sklearn

PCA3.6Principal

Component

Regression3.7Subspace

Methodsfor

Dynamic

Model

Estimation

PAT

Applications

3.1IntroductoryExample

Table3.1ChemicalParametersDeterminedontheWineSamples(datafromhttp://www.modelsl.ife.ku.dk/Wine_GCMS_FTIR)

Hence,adatasetisobtainedwhichconsistsof44samplesand14variables.Theactualmeasurementscanbearrangedinatableoramatrixofsize44×14.AportionofthistableisshowninFig.3.1.Fig.3.1Asubsetofthewinedataset

With44samplesand14columns,itisquitecomplicatedtogetanoverviewofwhatkindofinformationisavailableinthedata.Agoodstartingpointistoplotindividualvariablesorsamples.ThreeofthevariablesareshowninFig.3.2.ItcanbeseenthattotalacidaswellasmethanoltendstobehigherinsamplesfromAustraliaandSouthAfricawhereastherearelesspronouncedregionaldifferencesintheethanolcontent.Fig.3.2Threevariablescolouredaccordingtotheregion

EventhoughFig.3.2maysuggestthatthereislittlerelevantregionalinformationinethanol,itisdangeroustorelytoomuchonunivariateanalysis.Inunivariateanalysis,anyco-variationwithothervariablesisexplicitlyneglectedandthismayleadtoimportantfeaturesbeingignored.Forexample,plottingethanolversusglycerol(Fig.3.3)showsaninterestingcorrelationbetweenthetwo.Thisisdifficulttodeducefromplotsoftheindividualvariables.Ifglycerolandethanolwerecompletelycorrelated,itwould,infact,bepossibletosimplyusee.g.theaverageorthesumofthetwoasonenewvariablethatcouldreplacethetwooriginalones.Noinformationwouldbelostasitwouldalwaysbepossibletogofrome.g.theaveragetothetwooriginalvariables.Fig.3.3Aplotofethanolversusglycero

ThisconceptofusingsuitablelinearcombinationsoftheoriginalvariableswillturnouttobeessentialinPCAandisexplainedinabitmoredetailandaslightlyunusualwayhere.Thenewvariable,say,theaverageofthetwooriginalones,canbedefinedasaweightedaverageofall14variables;onlytheothervariableswillhaveweightzero.These14weightsareshowninFig.3.4.ig.3.4Definingtheweightsforavariablethatincludesonlyethanolandglycerolinformation

Fig.3.5Theconceptofaunitvector

Fig.3.6.

Asmentionedabove,itispossibletogobackandforthbetweentheoriginaltwovariablesandthenewvariable.Multiplyingthenewvariablewiththeweightsprovidesanestimationoftheoriginalvariables(Fig.3.7).Fig.3.7Usingthenewvariableandtheweightstoestimatetheoldoriginalvariables

Thisisapowerfulproperty;thatitispossibletouseweightstocondenseseveralvariablesintooneandviceversa.Togeneralizethis,noticethatthecurrentconceptonlyworksperfectlywhenthetwovariablesarecompletelycorrelated.Thinkofanaveragegradeinaschoolsystem.Manyparticulargradescanleadtothesameaveragegrade,soitisnotingeneralpossibletogobackandforth.Tomakeanintelligentnewvariable,itisnaturaltoaskforanewvariablethatwillactuallyprovideanicemodelofthedata.Thatis,anewvariablewhich,whenmultipliedwiththeweights,willdescribeasmuchaspossiblethewholematrix(Fig.3.8).Suchavariablewillbeanoptimalrepresentativeofthewholedatainthesensethatnootherweightedaveragesimultaneouslydescribesasmuchoftheinformationinthematrix.Fig.3.8Definingweights(w's)thatwillgiveanewvariablewhichleadstoagoodmodelofthedata

ItturnsoutthatPCAprovidesasolutiontothisproblem.Principalcomponentanalysisprovidestheweightsneededtogetthenewvariablethatbestexplainsthevariationinthewholedatasetinacertainsense.Thisnewvariableincludingthedefiningweights,iscalledthefirstprincipalcomponent.

Withthispre-processingofthedata,PCAcanbeperformed.Thetechnicaldetailsofhowtodothatwillfollow,butthefirstprincipalcomponentisshowninFig.3.9.Inthelowerplot,theweightsareshown.InsteadofthequitesparseweightsinFig.3.4,theseweightsarenon-zeroforallvariables.Thisfirstcomponentdoesnotexplainallthevariation,butitdoesexplain25%ofwhatishappeninginthedata.Asthereare14variables,itwouldbeexpectedthatifeveryvariableshowedvariationindependentoftheother,theneachoriginalvariablewouldexplain100%/14=7%ofthevariation.Hence,thisfirstcomponentiswrappingupinformation,whichcanbesaidtocorrespondtoapproximately3-4variables

3.2Theory3.2.1Taking

Linear

Combinations

Thevariationintcanbemeasuredbyitsvariance,var(t),definedintheusualwayinstatistics.Thentheproblemtranslatestomaximizingthisvariancechoosingoptimalweightsw1,w2…,wJ.Thereisonecaveat,however,sincemultiplyinganoptimalwwithanarbitrarylargenumberwillmakethevarianceoftalsoarbitrarylarge.Hence,tohaveaproperproblem,theweightshavetobenormalized.Thisisdonebyrequiringthattheirnorm,i.e.thesum-of-squaredvaluesisone(Fig.3.5).Throughoutwewillusethesymbol‖·‖2toindicatethesquaredFrobeniusnorm(sum-of-squares).Thus,theformalproblembecomes

which

should

read

the

problemoffinding

the

wof

length

one

that

maximizes

thevarianceof

(notethat‖w‖=1

the

same

requiring‖w‖2=1).The

functionargmax

the

mathematical

notation

for

returning

the

argument

wof

the

maximizationfunction.

This

can

made

explicit

using

the

fact

that

t=Xw:

3.2.2Explained

Variation

The

variance

can

nowbe

calculated

but

amore

meaningful

assessment

thesummarizingcapabilityoftisobtainedbycalculatinghowrepresentativetisintermsofreplacingX.ThiscanbedonebyprojectingthecolumnsofXontandcalculatingtheresidualsofthatprojection.ThisisperformedbyregressingallvariablesofXontusingtheordinaryregressionequation

where

pis

thevectorofregressioncoefficientsandEisthematrixofresiduals.Interestingly,

equals

and

the

whole

machinery

regression

can

used

judge

thequality

the

summarizer

t.Traditionally,this

done

calculating

which

referred

the

percentage

explained

variationof

InFig.3.10,itisillustratedhowtheexplainedvariationiscalculatedasalsoexplainedaroundequation(3.2.4).Fig.3.10Exemplifyinghowexplainedvariationiscalculatedusingthedataandtheresiduals

3.2.3PCA

Model

Equation(3.2.3)

highlights

important

interpretation

PCA:

can

seen

amodelling

activity(Fig.3.11).

rewriting

equation(3.2.3)

shows

that

the

(outer-)

product

tpT

serves

model

X(indicated

with

hat).In

thisequation,

vector

was

fixed

regressor

and

vector

the

regression

coefficient

found.It

can

shown

that

actually

both

and

pcan

established

from

such

equation

bysolving

3.2.4Taking

Components

the

percentage

explained

variation

equation(3.2.4)

too

small,then

the

t,pcombination

not

sufficiently

good

summarizes

the

data.Equation(3.2.5)suggestsan

extension

writing

where

T=[t1,t2,…,tR](I×R)

and

P=[p1,p2,…pR](J×R)

are

nowmatricescontaining,respectively,Rscore

vectors

and

vectors.If

is(much)

smallerthan

then

Tand

still

amount

considerably

parsimonious

description

thevariation

X.To

identify

the

solution,P

can

taken

such

that

PTP=I

and

can

betaken

such

that

TTT

diagonal

matrix.This

corresponds

the

normalization

theloadings

mentioned

above.

Each

vector,

thus

has

normone

and

orthogonal

toother

vectors(an

orthogonal

basis).

The

constraint

implies

that

the

scorevectorsareorthogonal

toeachother.This

istheusualwaytoperformPCAinchemometrics.

Due

theorthogonality

inP,theRcomponentshave

independentcontributions

the

overall

explained

variation

and

the

term“explained

variation

per

component”

can

beused,

similarly

equation(3.2.4).

3.3History

PCA

PCAhasbeen(re-)inventedseveraltimes.Theearliestpresentationwasintermsofequation(3.2.6).ThisinterpretationstressesthemodellingpropertiesofPCAandisverymuchrootedinregression-thinking:variationexplainedbytheprincipalcomponents(Pearsonsview).Later,inthethirties,theideaoftakinglinearcombinationsofvariableswasintroducedandthevariationoftheprincipalcomponentswasstressed(equation(3.2.1);Hotellingsview).

Thisisamoremultivariatestatisticalapproach.Later,itwasrealizedthatthetwoapproacheswereverysimilar.Similar,butnotthesame.Thereisafundamentalconceptualdifferencebetweenthetwoapproaches,whichisimportanttounderstand.IntheHotellingapproach,theprincipalcomponentsaretakenseriouslyintheirspecificdirection.Thefirstcomponentexplainsthemostvariation,thesecondcomponentthesecondmost,etc.Thisiscalledtheprincipalaxisproperty:theprincipalcomponentsdefinenewaxeswhichshouldbetakenseriouslyandhaveameaning.

PCAfindstheseprincipalaxes.Incontrast,inthePearsonapproachitisthesubspace,whichisimportant,nottheaxesassuch.Theaxesmerelyserveasabasisforthissubspace.IntheHotellingapproach,rotatingtheprincipalcomponentsdestroystheinterpretationofthesecomponentswhereasinthePearson

conceptualmodelrotationsmerelygenerateadifferentbasisforthe(optimal)subspace.

3.4Practical

Aspects3.4.1Preprocessing

OftenaPCAperformedon

the

rawdata

not

verymeaningful.

regressionanalysis,

often

intercepto

ffset

included

since

the

deviation

fromsuchanoffset,

which

represents

the

interesting

variation.

terms

the

prototypical

example,the

absolute

levels

the

not

that

interesting

but

the

variation

the

differentCabernets

relevant.

For

PCA

focus

this

type

variation

necessary

mean-center

the

data.ThisissimplyperformedbysubtractingfromeveryvariableinXthecorrespondingmean-level.

Sometimesitisalsonecessarytothinkaboutthescalesofthedata.Inthewineexample,thereweremeasurementsofconcentrationsandofpH.Thesearenotonthesamescales(noteveninthesameunits)andtomakethevariablesmorecomparable,thevariablesarescaledbydividingthembythecorrespondingstandarddeviations.Thecombinedprocessofcenteringandscalinginthiswayisoftencalledautoscaling.Foramoredetailedaccountofcenteringandscaling,seethereferences.

Centeringandscalingarethetwomostcommontypesofpreprocessingandtheynormallyalwayshavetobedecidedupon.Therearemanyothertypesofpreprocessingmethodsavailablethough.Theappropriatepreprocessingtypicallydependsonthenatureofthedatainvestigate.

3.4.2Choosing

the

Number

Components

basic

rationale

PCA

that

the

for

mative

rank

the

datais

less

than

thenumber

original

variables.

Hence,

possible

replace

the

original

variables

withR(R?J)

components

and

gain

number

benefits.

The

influence

noise

minimizedas

the

original

variables

are

replaced

with

weighted

averages,and

the

interpretation

andvisualization

greatlyaidedbyhavingasimpler(fewer

variables)viewto

all

thevariations.

Furthermore,

the

compression

the

variation

into

fewer

components

can

yieldstatistica

benefits

further

modelling

with

the

data.

Hence,there

are

many

good

reasonsto

use

PCA.

order

use

PCA,though,it

necessary

beable

decide

how

manycomponents

use.The

answer

that

problem

depends

little

bit

the

purpose

theanalysis,

which

why

the

following

three

sections

will

provi

dedifferent

ans

wers

thatquestion.

EigenvaluesandTheirRelationtoPCA

Beforethemethodsaredescribed,itisnecessarytoexplaintherelationbetweenPCAandeigenvalues.Aneigenvectorofa(square)matrixAisdefinedasthenonzerovectorzwiththefollowingproperty:

Whereziscalledtheeigenvector.IfmatrixAissymmetric(semi-)positivedefinite,thenthefulleigenvaluedecompositionofAbecomes:

ScreeTest

ThescreetestwasdevelopedbyR.B.Cattellin1966.Itisbasedontheassumptionthatrelevantinformationislargerthanrandomnoiseandthatthemagnitudeofthevariationofrandomnoiseseemstoleveloffquitelinearlywiththenumberofcomponents.Traditionally,theeigenvaluesofthecross-productofthepreprocesseddata,areplottedasafunctionofthenumberofcomponents,andwhenonlynoiseismodelled,itisassumedthattheeigenvaluesaresmallanddeclinegradually.Inpractice,itmaybedifficulttoseethisintheplotofeigenvaluesduetothehugeeigenvaluesandoftenthelogarithmoftheeigenvaluesisplottedinstead.

BothareshowninFig.3.12forasimulateddatasetofrankfourandwithvariousamountsofnoiseadded.Itisseenthattheeigenvaluesleveloffafterfourcomponents,butthedetailsaredifficulttoseeintheraweigenvaluesunlesszoomedin.Itisalsoseen,thatthedistinctionbetween‘real’andnoiseeigenvaluesaredifficulttodiscernathighnoiselevels.

Forthewinedata,itisnoteasytofirmlyassessthenumberofcomponentsbasedonthescreetest(Fig.3.13).Onemayarguethatsevenormaybeninecomponentsseemfeasible,butthiswouldimplyincorporatingcomponentsthatexplainverylittlevariation.Amoreobviouschoicewouldprobablybetoassessthreecomponentsassuitablebasedonthescreeplotandthenbeawarethatfurthercomponentsmayalsocontainusefulinformation.

EigenvaluebelowOne

Ifthedataisautoscaled,eachvariablehasavarianceofone.Ifallvariablesareorthogonaltoeachother,theneverycomponentinaPCAmodelwouldhaveaneigenvalueofonesincethepreprocessedcross-productmatrix(thecorrelationmatrix)isidentity.Itisthenfairtosay,thatifacomponenthasaneigenvaluelargerthanone,itexplainsvariationofmorethanonevariable.Thishasledtotheruleofselectingallcomponentswitheigenvaluesexceedingone(seethefulllineinFig.3.13).

ItissometimesalsoreferredtoastheKaisers‘ruleorKaiser-Guttmans’ruleandmanyadditionalargumentshavebeenprovidedforthismethod.Whileitremainsaveryadhocapproach,itisneverthelessausefulrule-of-thumbtogetanideaaboutthecomplexityofadataset.Forthewinedata(Fig.3.13),therulesuggeststhataroundfourorfivecomponentsarereasonable.Note,thatforveryprecisedata,itisperfectlypossiblethatevencomponentswitheigenvaluesfarbelowonecanberealandsignificant.Realphenomenacanbesmallinvariation,yetaccurate.

BrokenStick

morerealisticcutofffortheeigenvaluesisobtainedwiththesocalledbrokenstickrule.Alineisaddedtothescreeplotthatshowstheeigenvaluesthatwouldbeexpectedforrandomdata(thedottedlineinFig.3.13).Thislineiscalculatedassumingthatrandomdatawillfollowaso-calledbrokenstickdistribution.ThebrokenstickdistributionhypothesizeshowrandomvariationwillpartitionandusestheanalogyofhowthelengthsofpiecesofastickwillbedistributedwhenbrokenatrandomplacesintoJpieces.

Itcanbeshownthatforauto-scaleddata,thistheoreticaldistributioncanbecalculatedas

AsseeninFig.3.13,thebrokenstickwouldseemtoindicatethatthreetofourcomponentsarereasonable

HighFractionofVariationExplained

Ifthedatameasuredhase.g.onepercentnoise,itisexpectedthatPCAwilldescribeallthevariationdowntoaroundonepercent.Hence,ifatwo-componentmodeldescribesonly50%ofthevariationandisotherwisesound,itisprobablethatmorecomponentsareneeded.Ontheotherhand,ifthedataareverynoisycominge.g.fromprocessmonitoringorconsumerpreferencemappingandhasanexpectednoisefractionofmaybe40%,thenanotherwisesoundmodelfitting90%ofthevariationwouldimplyoverfittingandfewercomponentsshouldbeused.

Havingknowledgeonthequalityofthedatacanhelpinassessingthenumberofcomponents.InFig.3.14,thevariationexplainedisshown.Theplotisequivalenttotheeigenvalueplotexceptitiscumulativeandonadifferentscale.Forthewinedata,theuncertaintyisdifferentforeachvariable,andvariesfromapproximately5andevenupto50%relativetothevariationinthedata.Thisisquitevariableandmakesitdifficulttoestimatehowmuchvariationshouldbeexplained,butmostcertainlylessthan50%wouldmeanthatallisnotexplainedandexplainingmorethan,say90%95%ofthevariationwouldbemeaninglessandjustmodellingofnoise.Therefore,basedonvariationexplained,itislikelythatthereismorethantwobutlessthan,say,sevencomponents.

Fig.3.14Cumulatedpercentagevariationexplained

ValidInterpretation

Asindicatedbytheresults,thedifferentrulesaboveseldomagree.Thisisnotasbigaproblemasitmightseem.Quiteoften,theonlythingneededistoknowtheneighbourhoodofhowmanycomponentsareneeded.Usingtheabovemethods“informally”andcritically,willoftenprovidethatanswer.Furthermore,oneofthemostimportantstrategiesforselectingthenumberofcomponentsistosupplementsuchmethodswithinterpretationsofthemodel.Forthecurrentdata,itmaybequestionedwhethere.g.threeorfourcomponentsshouldbeused.

InFig.3.15,itisshown,thatthereisdistinctstructureinthescoresofcomponentfour.Forexample,thewinesfromArgentinaallhavepositivescores.Suchastructureorgroupingwillnothappenaccidentallyunlessunfortunateconfoundinghasoccurred.Hence,aslongasArgentinianwineswerenotmeasuredseparatelyonadifferentsystemorsomethingsimilar,themerefactthatcomponentfour(eitherscoresorloadings)showsdistinctbehaviourisanargumentinfavourofincludingthatcomponent.Thisholdsregardlessofwhatothermeasuresmightindicate.Fig.3.15Left:scorenumberfourofwinedata;Right:scoretwoversusscorefour

Theloadingsmayalsoprovidesimilarvalidationbyhighlightingcorrelationsexpectedfromaprioriknowledge.Inthecaseofcontinuousdatasuchastimeseriesorspectraldata,itisalsoinstructivetolookattheshapeoftheresiduals.AnexampleisprovidedinFig.3.16.Adatasetconsistingofvisualandnearinfraredspectraof40beersamplesisshowningrey.Afteronecomponent,theresidualsarestillfairlybigandquitestructuredfromaspectralpointofview.Aftersixcomponents,thereisverylittleinformationleftindicatingthatmostofthesystematicvariationhasbeenmodeled.Notefromthetitleoftheplot,that95%ofthevariationexplainedisquitelowforthisdatasetwhereasthatwouldbecriticallyhighforthewinedataasdiscussedabove.

Cross-validation

Theideaincross-validationistoleaveoutpartofthedataandthenestimatetheleft-outpart.Ifthisisdonewisely,thepredictionoftheleft-outpartisindependentoftheactualleft-outpart.Hence,overfittingleadingtotoooptimisticmodelsisnotpossible.Conceptually,asingleelement(typicallymorethanoneelement)ofthedatamatrixisleftout.APCAmodelhandlingmissingdata,canthenbefittedtothedatasetandbasedonthisPCAmodel,anestimateoftheleftoutelementcanbeobtained.Hence,asetofresidualsisobtainedwheretherearenoproblemswithoverfitting.

Takingthesumofsquaresoftheseyieldstheso-calledPredictedREsidualSumsofSquares(PRESS)

wherexij(r)istheresidualofsampleiandvariablejafterrcomponents.FromthePRESStheRootMeanSquaredErrorofCross-Validation(RMSECV)isobtainedas

InFig.3.17,theresultsofcross-validationareshown.AsshowninFig.3.15thefittodatawilltriviallyimprovewiththenumberofcomponentsbuttheRMSECVgetsworseafterfourcomponents,indicatingthatnomorethanfourcomponentsshouldbeused.Infact,theimprovementgoingfromthreetofourcomponentsissosmall,thatthreeislikelyamorefeasiblechoicefromthatperspective.Fig.3.17AplotofRMSECVforPCAmodelswithdifferentnumberofcomponents

3.4.3When

Using

PCA

for

Other

Purposes

quite

common

use

PCA

preprocessing

step

order

get

nicely

compactrepresentation

dataset.

Instead

the

original

many

(J)

variables,

the

dataset

can

beexpressed

terms

the

few

(R)

principal

components.

These

components

can

then

inturn

used

for

many

different

purposes(Fig.3.18).Fig.3.18UsingthescoresofPCAforfurthermodelling

3.4.4Detecting

Outliers

are

samples

that

are

somehow

disturbing

unusual.

Often,

out

liers

aredownright

wrong

samples.

For

example,indetermining

theheigh

fpersons,fivesamples

are

obtained([1.78,1.92,1.83,1.67,1.87]).The

values

are

inmeters

butaccidentally,

the

fourth

sample

has

beenmeasured

centimeters.

the

sample

noteithercorrectedorremoved,thesubsequentanalysisisgoingtobedetrimentallydisturbedbythisoutlier.Outlierdetectionisaboutidentifyingandhandlingsuchsamples.Analternativeorsupplementtooutlierhandlingistheuseofrobustmethods,whichwillhowever,notbetreatedindetailhere.

Thissectionismainlygoingtofocusonidentifyingoutliers,butunderstandingtheoutliersisreallythecriticalaspect.Oftenoutliersaremistakenlytakentomeanwrongsamplesandnothingcouldbemorewrong!Outlierscanbeabsolutelyright,bute.g.justbadlyrepresented.Insuchacase,thesolutionisnottoremovetheoutlier,buttosupplementthedatawithmoreofthesametype.Thebottomlineisthatitisimperativetounderstandwhyasampleisanoutlier.Thissectionwillgivethetoolstoidentifythesamplesandseeinwhatwaytheydiffer.Itisthenuptothedataanalysttodecidehowtheoutliersshouldbehandled.

DataInspection

Anoftenforgotten,butimportant,firststepindataanalysisistoinspecttherawdata.Dependingonthetypeofdata,manykindsofplotscanberelevantasalreadymentioned.Forspectraldata,lineplotsmaybenice.Fordiscretedata,histograms,normalprobabilityplots,orscatterplotscouldbefeasible.Inshort,anykindofvisualizationthatwillhelpelucidateaspectsofthedatacanbeuseful.Severalsuchplotshavealreadybeenshownthroughoutthispaper.Itisalsoimportant,andfrequentlyforgotten,tolookatthepre-processeddata.Whiletherawdataareimportant,theyactuallyneverenterthemodeling.Itisthepreprocesseddatathatwillbemodeledandtherecanbebigdifferencesintheinterpretationsoftherawandthepreprocesseddata.

ScorePlots

Whilerawandpreprocesseddatashouldalwaysbeinvestigated,sometypesofoutlierswillbedifficulttoidentifyfromthere.ThePCAmodelitselfcanprovidefurtherinformation.Therearetwoplaceswhereoutlyingbehaviorwillshowupmostevidently:inthescoresandintheresiduals.Itisappropriatetogothroughallselectedscoresandlookforsamplesthathavestrangebehaviour.Often,itisonlycomponentoneandtwothatareinvestigatedbutitisnecessarytolookatalltherelevantcomponents.

Asforthedata,itisagoodideatoplotthescoresinmanyways,usingdifferentcombinationsofscatterplots,lineplots,histograms,etc.Also,itisoftenusefultogothroughthesameplotbutcolouredbyallthevarioustypesofadditionalinformationavailable.Thiscouldbeanykindofinformationsuchastemperature,storagetimeofsample,operatororanyotherkindofeitherqualitativeorquantitativeinformationavailable.Forthewinedatamodel,itisseeninFig.3.19thatonesampleisbehavingdifferentlyfromtheothersinscoreplotoneversustwo(upperleftcorner).Fig.3.19ScoreplotofafourcomponentPCAmodelofthewinedata

Lookingattheloadingplot(Fig.3.20)indicatesthatthesamplemustbe(relatively)highinvolatileandlacticacidandlowinmalicacid.Thisshouldthenbeverifiedintherawdata.Afterremovingthissample,themodelisrebuiltandreevaluated.Nomoreextremesamplesareobservedinthescores.Fig.3.20Scatterplotofloading1versusloading2

Hotelling’sT2

Lookingatscoresishelpful,butitisonlypossibletolookatfewcomponentsatatime.Ifthemodelhasmanycomponents,itcanbelaboriousandtheriskofaccidentallymissingsomethingincreases.Inaddition,insomecases,outlierdetectionhastobeautomatedinordertofunctione.g.inanon-lineprocessmonitoringsystem.Therearewaystodoso,andacommonwayistousetheso-calledHotelling’sT2whichwasintroducedin1931.Thisdiagnosticcanbeseenasanextensionofthet-testandcanalsobeappliedtothescoresofaPCAmodel.Itiscalculatedas

WhereTisthematrixofscores(I×R)fromallthecalibrationsamplesandtiisanR×1vectorholdingtheRscoresoftheithsample.Assumingthatthescoresarenormallydistributed,thenconfidencelimitsforT

人人文庫> 全部分類> 專業(yè)文獻 > IT計算機

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

《python高維數(shù)據(jù)分析》課件-第3章

文檔簡介

溫馨提示

最新文檔

評論

《python高維數(shù)據(jù)分析》課件-第3章

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔