《python高維數(shù)據(jù)分析》課件-第3章_第1頁
《python高維數(shù)據(jù)分析》課件-第3章_第2頁
《python高維數(shù)據(jù)分析》課件-第3章_第3頁
《python高維數(shù)據(jù)分析》課件-第3章_第4頁
《python高維數(shù)據(jù)分析》課件-第3章_第5頁
已閱讀5頁,還剩176頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

Chapter3

Principal

Component

Analysi3.1Introductory

Example3.2Theory3.3History

of

PCA3.4Practical

Aspect3.5Sklearn

PCA3.6Principal

Component

Regression3.7Subspace

Methodsfor

Dynamic

Model

Estimation

in

PAT

Applications

3.1IntroductoryExample

Table3.1ChemicalParametersDeterminedontheWineSamples(datafromhttp://www.modelsl.ife.ku.dk/Wine_GCMS_FTIR)

Hence,adatasetisobtainedwhichconsistsof44samplesand14variables.Theactualmeasurementscanbearrangedinatableoramatrixofsize44×14.AportionofthistableisshowninFig.3.1.Fig.3.1Asubsetofthewinedataset

With44samplesand14columns,itisquitecomplicatedtogetanoverviewofwhatkindofinformationisavailableinthedata.Agoodstartingpointistoplotindividualvariablesorsamples.ThreeofthevariablesareshowninFig.3.2.ItcanbeseenthattotalacidaswellasmethanoltendstobehigherinsamplesfromAustraliaandSouthAfricawhereastherearelesspronouncedregionaldifferencesintheethanolcontent.Fig.3.2Threevariablescolouredaccordingtotheregion

EventhoughFig.3.2maysuggestthatthereislittlerelevantregionalinformationinethanol,itisdangeroustorelytoomuchonunivariateanalysis.Inunivariateanalysis,anyco-variationwithothervariablesisexplicitlyneglectedandthismayleadtoimportantfeaturesbeingignored.Forexample,plottingethanolversusglycerol(Fig.3.3)showsaninterestingcorrelationbetweenthetwo.Thisisdifficulttodeducefromplotsoftheindividualvariables.Ifglycerolandethanolwerecompletelycorrelated,itwould,infact,bepossibletosimplyusee.g.theaverageorthesumofthetwoasonenewvariablethatcouldreplacethetwooriginalones.Noinformationwouldbelostasitwouldalwaysbepossibletogofrome.g.theaveragetothetwooriginalvariables.Fig.3.3Aplotofethanolversusglycero

ThisconceptofusingsuitablelinearcombinationsoftheoriginalvariableswillturnouttobeessentialinPCAandisexplainedinabitmoredetailandaslightlyunusualwayhere.Thenewvariable,say,theaverageofthetwooriginalones,canbedefinedasaweightedaverageofall14variables;onlytheothervariableswillhaveweightzero.These14weightsareshowninFig.3.4.ig.3.4Definingtheweightsforavariablethatincludesonlyethanolandglycerolinformation

Fig.3.5Theconceptofaunitvector

Fig.3.6.

Asmentionedabove,itispossibletogobackandforthbetweentheoriginaltwovariablesandthenewvariable.Multiplyingthenewvariablewiththeweightsprovidesanestimationoftheoriginalvariables(Fig.3.7).Fig.3.7Usingthenewvariableandtheweightstoestimatetheoldoriginalvariables

Thisisapowerfulproperty;thatitispossibletouseweightstocondenseseveralvariablesintooneandviceversa.Togeneralizethis,noticethatthecurrentconceptonlyworksperfectlywhenthetwovariablesarecompletelycorrelated.Thinkofanaveragegradeinaschoolsystem.Manyparticulargradescanleadtothesameaveragegrade,soitisnotingeneralpossibletogobackandforth.Tomakeanintelligentnewvariable,itisnaturaltoaskforanewvariablethatwillactuallyprovideanicemodelofthedata.Thatis,anewvariablewhich,whenmultipliedwiththeweights,willdescribeasmuchaspossiblethewholematrix(Fig.3.8).Suchavariablewillbeanoptimalrepresentativeofthewholedatainthesensethatnootherweightedaveragesimultaneouslydescribesasmuchoftheinformationinthematrix.Fig.3.8Definingweights(w's)thatwillgiveanewvariablewhichleadstoagoodmodelofthedata

ItturnsoutthatPCAprovidesasolutiontothisproblem.Principalcomponentanalysisprovidestheweightsneededtogetthenewvariablethatbestexplainsthevariationinthewholedatasetinacertainsense.Thisnewvariableincludingthedefiningweights,iscalledthefirstprincipalcomponent.

Withthispre-processingofthedata,PCAcanbeperformed.Thetechnicaldetailsofhowtodothatwillfollow,butthefirstprincipalcomponentisshowninFig.3.9.Inthelowerplot,theweightsareshown.InsteadofthequitesparseweightsinFig.3.4,theseweightsarenon-zeroforallvariables.Thisfirstcomponentdoesnotexplainallthevariation,butitdoesexplain25%ofwhatishappeninginthedata.Asthereare14variables,itwouldbeexpectedthatifeveryvariableshowedvariationindependentoftheother,theneachoriginalvariablewouldexplain100%/14=7%ofthevariation.Hence,thisfirstcomponentiswrappingupinformation,whichcanbesaidtocorrespondtoapproximately3-4variables

3.2Theory3.2.1Taking

Linear

Combinations

Thevariationintcanbemeasuredbyitsvariance,var(t),definedintheusualwayinstatistics.Thentheproblemtranslatestomaximizingthisvariancechoosingoptimalweightsw1,w2…,wJ.Thereisonecaveat,however,sincemultiplyinganoptimalwwithanarbitrarylargenumberwillmakethevarianceoftalsoarbitrarylarge.Hence,tohaveaproperproblem,theweightshavetobenormalized.Thisisdonebyrequiringthattheirnorm,i.e.thesum-of-squaredvaluesisone(Fig.3.5).Throughoutwewillusethesymbol‖·‖2toindicatethesquaredFrobeniusnorm(sum-of-squares).Thus,theformalproblembecomes

which

should

be

read

as

the

problemoffinding

the

wof

length

one

that

maximizes

thevarianceof

t

(notethat‖w‖=1

is

the

same

as

requiring‖w‖2=1).The

functionargmax

is

the

mathematical

notation

for

returning

the

argument

wof

the

maximizationfunction.

This

can

be

made

more

explicit

by

using

the

fact

that

t=Xw:

3.2.2Explained

Variation

The

variance

of

t

can

nowbe

calculated

but

amore

meaningful

assessment

of

thesummarizingcapabilityoftisobtainedbycalculatinghowrepresentativetisintermsofreplacingX.ThiscanbedonebyprojectingthecolumnsofXontandcalculatingtheresidualsofthatprojection.ThisisperformedbyregressingallvariablesofXontusingtheordinaryregressionequation

where

pis

thevectorofregressioncoefficientsandEisthematrixofresiduals.Interestingly,

p

equals

w

and

the

whole

machinery

of

regression

can

be

used

to

judge

thequality

of

the

summarizer

t.Traditionally,this

is

done

by

calculating

which

is

referred

to

as

the

percentage

of

explained

variationof

t.

InFig.3.10,itisillustratedhowtheexplainedvariationiscalculatedasalsoexplainedaroundequation(3.2.4).Fig.3.10Exemplifyinghowexplainedvariationiscalculatedusingthedataandtheresiduals

3.2.3PCA

as

a

Model

Equation(3.2.3)

highlights

an

important

interpretation

of

PCA:

it

can

be

seen

as

amodelling

activity(Fig.3.11).

By

rewriting

equation(3.2.3)

as

shows

that

the

(outer-)

product

tpT

serves

as

a

model

of

X(indicated

with

a

hat).In

thisequation,

vector

t

was

a

fixed

regressor

and

vector

p

the

regression

coefficient

to

be

found.It

can

be

shown

that

actually

both

t

and

pcan

be

established

from

such

an

equation

bysolving

3.2.4Taking

More

Components

If

the

percentage

of

explained

variation

of

equation(3.2.4)

is

too

small,then

the

t,pcombination

is

not

a

sufficiently

good

summarizes

of

the

data.Equation(3.2.5)suggestsan

extension

by

writing

where

T=[t1,t2,…,tR](I×R)

and

P=[p1,p2,…pR](J×R)

are

nowmatricescontaining,respectively,Rscore

vectors

and

R

loading

vectors.If

R

is(much)

smallerthan

J,

then

Tand

P

still

amount

to

a

considerably

more

parsimonious

description

of

thevariation

in

X.To

identify

the

solution,P

can

be

taken

such

that

PTP=I

and

T

can

betaken

such

that

TTT

is

a

diagonal

matrix.This

corresponds

to

the

normalization

of

theloadings

mentioned

above.

Each

loading

vector,

thus

has

normone

and

is

orthogonal

toother

loading

vectors(an

orthogonal

basis).

The

constraint

on

T

implies

that

the

scorevectorsareorthogonal

toeachother.This

istheusualwaytoperformPCAinchemometrics.

Due

to

theorthogonality

inP,theRcomponentshave

independentcontributions

to

the

overall

explained

variation

and

the

term“explained

variation

per

component”

can

beused,

similarly

as

in

equation(3.2.4).

3.3History

of

PCA

PCAhasbeen(re-)inventedseveraltimes.Theearliestpresentationwasintermsofequation(3.2.6).ThisinterpretationstressesthemodellingpropertiesofPCAandisverymuchrootedinregression-thinking:variationexplainedbytheprincipalcomponents(Pearsonsview).Later,inthethirties,theideaoftakinglinearcombinationsofvariableswasintroducedandthevariationoftheprincipalcomponentswasstressed(equation(3.2.1);Hotellingsview).

Thisisamoremultivariatestatisticalapproach.Later,itwasrealizedthatthetwoapproacheswereverysimilar.Similar,butnotthesame.Thereisafundamentalconceptualdifferencebetweenthetwoapproaches,whichisimportanttounderstand.IntheHotellingapproach,theprincipalcomponentsaretakenseriouslyintheirspecificdirection.Thefirstcomponentexplainsthemostvariation,thesecondcomponentthesecondmost,etc.Thisiscalledtheprincipalaxisproperty:theprincipalcomponentsdefinenewaxeswhichshouldbetakenseriouslyandhaveameaning.

PCAfindstheseprincipalaxes.Incontrast,inthePearsonapproachitisthesubspace,whichisimportant,nottheaxesassuch.Theaxesmerelyserveasabasisforthissubspace.IntheHotellingapproach,rotatingtheprincipalcomponentsdestroystheinterpretationofthesecomponentswhereasinthePearson

conceptualmodelrotationsmerelygenerateadifferentbasisforthe(optimal)subspace.

3.4Practical

Aspects3.4.1Preprocessing

OftenaPCAperformedon

the

rawdata

is

not

verymeaningful.

In

regressionanalysis,

often

an

intercepto

ro

ffset

is

included

since

it

is

the

deviation

fromsuchanoffset,

which

represents

the

interesting

variation.

In

terms

of

the

prototypical

example,the

absolute

levels

of

the

pH

is

not

that

interesting

but

the

variation

in

pH

of

the

differentCabernets

is

relevant.

For

PCA

to

focus

on

this

type

of

variation

it

is

necessary

to

mean-center

the

data.ThisissimplyperformedbysubtractingfromeveryvariableinXthecorrespondingmean-level.

Sometimesitisalsonecessarytothinkaboutthescalesofthedata.Inthewineexample,thereweremeasurementsofconcentrationsandofpH.Thesearenotonthesamescales(noteveninthesameunits)andtomakethevariablesmorecomparable,thevariablesarescaledbydividingthembythecorrespondingstandarddeviations.Thecombinedprocessofcenteringandscalinginthiswayisoftencalledautoscaling.Foramoredetailedaccountofcenteringandscaling,seethereferences.

Centeringandscalingarethetwomostcommontypesofpreprocessingandtheynormallyalwayshavetobedecidedupon.Therearemanyothertypesofpreprocessingmethodsavailablethough.Theappropriatepreprocessingtypicallydependsonthenatureofthedatainvestigate.

3.4.2Choosing

the

Number

of

Components

A

basic

rationale

in

PCA

is

that

the

in

for

mative

rank

of

the

datais

less

than

thenumber

of

original

variables.

Hence,

it

is

possible

to

replace

the

original

J

variables

withR(R?J)

components

and

gain

a

number

of

benefits.

The

influence

of

noise

is

minimizedas

the

original

variables

are

replaced

with

weighted

averages,and

the

interpretation

andvisualization

is

greatlyaidedbyhavingasimpler(fewer

variables)viewto

all

thevariations.

Furthermore,

the

compression

of

the

variation

into

fewer

components

can

yieldstatistica

benefits

in

further

modelling

with

the

data.

Hence,there

are

many

good

reasonsto

use

PCA.

In

order

to

use

PCA,though,it

is

necessary

to

beable

to

decide

on

how

manycomponents

to

use.The

answer

to

that

problem

depends

a

little

bit

on

the

purpose

of

theanalysis,

which

is

why

the

following

three

sections

will

provi

dedifferent

ans

wers

to

thatquestion.

EigenvaluesandTheirRelationtoPCA

Beforethemethodsaredescribed,itisnecessarytoexplaintherelationbetweenPCAandeigenvalues.Aneigenvectorofa(square)matrixAisdefinedasthenonzerovectorzwiththefollowingproperty:

Whereziscalledtheeigenvector.IfmatrixAissymmetric(semi-)positivedefinite,thenthefulleigenvaluedecompositionofAbecomes:

ScreeTest

ThescreetestwasdevelopedbyR.B.Cattellin1966.Itisbasedontheassumptionthatrelevantinformationislargerthanrandomnoiseandthatthemagnitudeofthevariationofrandomnoiseseemstoleveloffquitelinearlywiththenumberofcomponents.Traditionally,theeigenvaluesofthecross-productofthepreprocesseddata,areplottedasafunctionofthenumberofcomponents,andwhenonlynoiseismodelled,itisassumedthattheeigenvaluesaresmallanddeclinegradually.Inpractice,itmaybedifficulttoseethisintheplotofeigenvaluesduetothehugeeigenvaluesandoftenthelogarithmoftheeigenvaluesisplottedinstead.

BothareshowninFig.3.12forasimulateddatasetofrankfourandwithvariousamountsofnoiseadded.Itisseenthattheeigenvaluesleveloffafterfourcomponents,butthedetailsaredifficulttoseeintheraweigenvaluesunlesszoomedin.Itisalsoseen,thatthedistinctionbetween‘real’andnoiseeigenvaluesaredifficulttodiscernathighnoiselevels.

Forthewinedata,itisnoteasytofirmlyassessthenumberofcomponentsbasedonthescreetest(Fig.3.13).Onemayarguethatsevenormaybeninecomponentsseemfeasible,butthiswouldimplyincorporatingcomponentsthatexplainverylittlevariation.Amoreobviouschoicewouldprobablybetoassessthreecomponentsassuitablebasedonthescreeplotandthenbeawarethatfurthercomponentsmayalsocontainusefulinformation.

EigenvaluebelowOne

Ifthedataisautoscaled,eachvariablehasavarianceofone.Ifallvariablesareorthogonaltoeachother,theneverycomponentinaPCAmodelwouldhaveaneigenvalueofonesincethepreprocessedcross-productmatrix(thecorrelationmatrix)isidentity.Itisthenfairtosay,thatifacomponenthasaneigenvaluelargerthanone,itexplainsvariationofmorethanonevariable.Thishasledtotheruleofselectingallcomponentswitheigenvaluesexceedingone(seethefulllineinFig.3.13).

ItissometimesalsoreferredtoastheKaisers‘ruleorKaiser-Guttmans’ruleandmanyadditionalargumentshavebeenprovidedforthismethod.Whileitremainsaveryadhocapproach,itisneverthelessausefulrule-of-thumbtogetanideaaboutthecomplexityofadataset.Forthewinedata(Fig.3.13),therulesuggeststhataroundfourorfivecomponentsarereasonable.Note,thatforveryprecisedata,itisperfectlypossiblethatevencomponentswitheigenvaluesfarbelowonecanberealandsignificant.Realphenomenacanbesmallinvariation,yetaccurate.

BrokenStick

A

morerealisticcutofffortheeigenvaluesisobtainedwiththesocalledbrokenstickrule.Alineisaddedtothescreeplotthatshowstheeigenvaluesthatwouldbeexpectedforrandomdata(thedottedlineinFig.3.13).Thislineiscalculatedassumingthatrandomdatawillfollowaso-calledbrokenstickdistribution.ThebrokenstickdistributionhypothesizeshowrandomvariationwillpartitionandusestheanalogyofhowthelengthsofpiecesofastickwillbedistributedwhenbrokenatrandomplacesintoJpieces.

Itcanbeshownthatforauto-scaleddata,thistheoreticaldistributioncanbecalculatedas

AsseeninFig.3.13,thebrokenstickwouldseemtoindicatethatthreetofourcomponentsarereasonable

HighFractionofVariationExplained

Ifthedatameasuredhase.g.onepercentnoise,itisexpectedthatPCAwilldescribeallthevariationdowntoaroundonepercent.Hence,ifatwo-componentmodeldescribesonly50%ofthevariationandisotherwisesound,itisprobablethatmorecomponentsareneeded.Ontheotherhand,ifthedataareverynoisycominge.g.fromprocessmonitoringorconsumerpreferencemappingandhasanexpectednoisefractionofmaybe40%,thenanotherwisesoundmodelfitting90%ofthevariationwouldimplyoverfittingandfewercomponentsshouldbeused.

Havingknowledgeonthequalityofthedatacanhelpinassessingthenumberofcomponents.InFig.3.14,thevariationexplainedisshown.Theplotisequivalenttotheeigenvalueplotexceptitiscumulativeandonadifferentscale.Forthewinedata,theuncertaintyisdifferentforeachvariable,andvariesfromapproximately5andevenupto50%relativetothevariationinthedata.Thisisquitevariableandmakesitdifficulttoestimatehowmuchvariationshouldbeexplained,butmostcertainlylessthan50%wouldmeanthatallisnotexplainedandexplainingmorethan,say90%95%ofthevariationwouldbemeaninglessandjustmodellingofnoise.Therefore,basedonvariationexplained,itislikelythatthereismorethantwobutlessthan,say,sevencomponents.

Fig.3.14Cumulatedpercentagevariationexplained

ValidInterpretation

Asindicatedbytheresults,thedifferentrulesaboveseldomagree.Thisisnotasbigaproblemasitmightseem.Quiteoften,theonlythingneededistoknowtheneighbourhoodofhowmanycomponentsareneeded.Usingtheabovemethods“informally”andcritically,willoftenprovidethatanswer.Furthermore,oneofthemostimportantstrategiesforselectingthenumberofcomponentsistosupplementsuchmethodswithinterpretationsofthemodel.Forthecurrentdata,itmaybequestionedwhethere.g.threeorfourcomponentsshouldbeused.

InFig.3.15,itisshown,thatthereisdistinctstructureinthescoresofcomponentfour.Forexample,thewinesfromArgentinaallhavepositivescores.Suchastructureorgroupingwillnothappenaccidentallyunlessunfortunateconfoundinghasoccurred.Hence,aslongasArgentinianwineswerenotmeasuredseparatelyonadifferentsystemorsomethingsimilar,themerefactthatcomponentfour(eitherscoresorloadings)showsdistinctbehaviourisanargumentinfavourofincludingthatcomponent.Thisholdsregardlessofwhatothermeasuresmightindicate.Fig.3.15Left:scorenumberfourofwinedata;Right:scoretwoversusscorefour

Theloadingsmayalsoprovidesimilarvalidationbyhighlightingcorrelationsexpectedfromaprioriknowledge.Inthecaseofcontinuousdatasuchastimeseriesorspectraldata,itisalsoinstructivetolookattheshapeoftheresiduals.AnexampleisprovidedinFig.3.16.Adatasetconsistingofvisualandnearinfraredspectraof40beersamplesisshowningrey.Afteronecomponent,theresidualsarestillfairlybigandquitestructuredfromaspectralpointofview.Aftersixcomponents,thereisverylittleinformationleftindicatingthatmostofthesystematicvariationhasbeenmodeled.Notefromthetitleoftheplot,that95%ofthevariationexplainedisquitelowforthisdatasetwhereasthatwouldbecriticallyhighforthewinedataasdiscussedabove.

Cross-validation

Theideaincross-validationistoleaveoutpartofthedataandthenestimatetheleft-outpart.Ifthisisdonewisely,thepredictionoftheleft-outpartisindependentoftheactualleft-outpart.Hence,overfittingleadingtotoooptimisticmodelsisnotpossible.Conceptually,asingleelement(typicallymorethanoneelement)ofthedatamatrixisleftout.APCAmodelhandlingmissingdata,canthenbefittedtothedatasetandbasedonthisPCAmodel,anestimateoftheleftoutelementcanbeobtained.Hence,asetofresidualsisobtainedwheretherearenoproblemswithoverfitting.

Takingthesumofsquaresoftheseyieldstheso-calledPredictedREsidualSumsofSquares(PRESS)

wherexij(r)istheresidualofsampleiandvariablejafterrcomponents.FromthePRESStheRootMeanSquaredErrorofCross-Validation(RMSECV)isobtainedas

InFig.3.17,theresultsofcross-validationareshown.AsshowninFig.3.15thefittodatawilltriviallyimprovewiththenumberofcomponentsbuttheRMSECVgetsworseafterfourcomponents,indicatingthatnomorethanfourcomponentsshouldbeused.Infact,theimprovementgoingfromthreetofourcomponentsissosmall,thatthreeislikelyamorefeasiblechoicefromthatperspective.Fig.3.17AplotofRMSECVforPCAmodelswithdifferentnumberofcomponents

3.4.3When

Using

PCA

for

Other

Purposes

It

is

quite

common

to

use

PCA

as

a

preprocessing

step

in

order

to

get

a

nicely

compactrepresentation

of

a

dataset.

Instead

of

the

original

many

(J)

variables,

the

dataset

can

beexpressed

in

terms

of

the

few

(R)

principal

components.

These

components

can

then

inturn

be

used

for

many

different

purposes(Fig.3.18).Fig.3.18UsingthescoresofPCAforfurthermodelling

3.4.4Detecting

Outliers

Outliers

are

samples

that

are

somehow

disturbing

or

unusual.

Often,

out

liers

aredownright

wrong

samples.

For

example,indetermining

theheigh

to

fpersons,fivesamples

are

obtained([1.78,1.92,1.83,1.67,1.87]).The

values

are

inmeters

butaccidentally,

the

fourth

sample

has

beenmeasured

in

centimeters.

If

the

sample

is

noteithercorrectedorremoved,thesubsequentanalysisisgoingtobedetrimentallydisturbedbythisoutlier.Outlierdetectionisaboutidentifyingandhandlingsuchsamples.Analternativeorsupplementtooutlierhandlingistheuseofrobustmethods,whichwillhowever,notbetreatedindetailhere.

Thissectionismainlygoingtofocusonidentifyingoutliers,butunderstandingtheoutliersisreallythecriticalaspect.Oftenoutliersaremistakenlytakentomeanwrongsamplesandnothingcouldbemorewrong!Outlierscanbeabsolutelyright,bute.g.justbadlyrepresented.Insuchacase,thesolutionisnottoremovetheoutlier,buttosupplementthedatawithmoreofthesametype.Thebottomlineisthatitisimperativetounderstandwhyasampleisanoutlier.Thissectionwillgivethetoolstoidentifythesamplesandseeinwhatwaytheydiffer.Itisthenuptothedataanalysttodecidehowtheoutliersshouldbehandled.

DataInspection

Anoftenforgotten,butimportant,firststepindataanalysisistoinspecttherawdata.Dependingonthetypeofdata,manykindsofplotscanberelevantasalreadymentioned.Forspectraldata,lineplotsmaybenice.Fordiscretedata,histograms,normalprobabilityplots,orscatterplotscouldbefeasible.Inshort,anykindofvisualizationthatwillhelpelucidateaspectsofthedatacanbeuseful.Severalsuchplotshavealreadybeenshownthroughoutthispaper.Itisalsoimportant,andfrequentlyforgotten,tolookatthepre-processeddata.Whiletherawdataareimportant,theyactuallyneverenterthemodeling.Itisthepreprocesseddatathatwillbemodeledandtherecanbebigdifferencesintheinterpretationsoftherawandthepreprocesseddata.

ScorePlots

Whilerawandpreprocesseddatashouldalwaysbeinvestigated,sometypesofoutlierswillbedifficulttoidentifyfromthere.ThePCAmodelitselfcanprovidefurtherinformation.Therearetwoplaceswhereoutlyingbehaviorwillshowupmostevidently:inthescoresandintheresiduals.Itisappropriatetogothroughallselectedscoresandlookforsamplesthathavestrangebehaviour.Often,itisonlycomponentoneandtwothatareinvestigatedbutitisnecessarytolookatalltherelevantcomponents.

Asforthedata,itisagoodideatoplotthescoresinmanyways,usingdifferentcombinationsofscatterplots,lineplots,histograms,etc.Also,itisoftenusefultogothroughthesameplotbutcolouredbyallthevarioustypesofadditionalinformationavailable.Thiscouldbeanykindofinformationsuchastemperature,storagetimeofsample,operatororanyotherkindofeitherqualitativeorquantitativeinformationavailable.Forthewinedatamodel,itisseeninFig.3.19thatonesampleisbehavingdifferentlyfromtheothersinscoreplotoneversustwo(upperleftcorner).Fig.3.19ScoreplotofafourcomponentPCAmodelofthewinedata

Lookingattheloadingplot(Fig.3.20)indicatesthatthesamplemustbe(relatively)highinvolatileandlacticacidandlowinmalicacid.Thisshouldthenbeverifiedintherawdata.Afterremovingthissample,themodelisrebuiltandreevaluated.Nomoreextremesamplesareobservedinthescores.Fig.3.20Scatterplotofloading1versusloading2

Hotelling’sT2

Lookingatscoresishelpful,butitisonlypossibletolookatfewcomponentsatatime.Ifthemodelhasmanycomponents,itcanbelaboriousandtheriskofaccidentallymissingsomethingincreases.Inaddition,insomecases,outlierdetectionhastobeautomatedinordertofunctione.g.inanon-lineprocessmonitoringsystem.Therearewaystodoso,andacommonwayistousetheso-calledHotelling’sT2whichwasintroducedin1931.Thisdiagnosticcanbeseenasanextensionofthet-testandcanalsobeappliedtothescoresofaPCAmodel.Itiscalculatedas

WhereTisthematrixofscores(I×R)fromallthecalibrationsamplesandtiisanR×1vectorholdingtheRscoresoftheithsample.Assumingthatthescoresarenormallydistributed,thenconfidencelimitsforT

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論