全文預(yù)覽已結(jié)束
下載本文檔
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
SearchingSingleNucleotidePolymorphismMarkerstoComplexDiseasesusingGeneticAlgorithmFrameworkandaBoostModeSupportVectorMachineKhantharatAnekboon,SuphakantPhimoltares,andChidchanokLursinsapAVIC,DepartmentofMathematics,ChulalongkornUniversity,Bangkok,ThailandKhantharat.AStudent.chula.ac.th,suphakant.pchula.ac.th,andlchidchachula.ac.thSissadesTongsimaGenomeInstitute,NationalCenterforGeneticEngineeringandBiotechnology,Pathumtani,Thailandsissadesbiotec.or.thSuthatFucharoenThalassemiaResearchCenter,InstituteofMolecularBiosciences,MahidolUniversity,SalayaCampus,Nakhonpathom,Thailandgrsfcmahidol.ac.thAbstractWiththeadventoflarge-scalehighdensitysinglenucleotidepolymorphism(SNP)arrays,case-controlassociationstudieshavebeenperformedtoidentifypredisposinggeneticfactorsthatinfluencemanycommoncomplexdiseases.ThesegenotypingplatformsprovideverydenseSNPcoverageperonechip.Muchresearchhasbeenfocusingonmultivariategeneticmodeltoidentifygenesthatcanpredictthediseasestatus.However,increasingthenumberofSNPsgenerateslargenumberofcombinedgeneticoutcomestobetested.ThisworkpresentsanewmathematicalalgorithmforSNPanalysiscalledIFGAthatusesa“BoostMode”supportvectormachine(SVM)toselectthebestsetofSNPmarkersthatcanpredictastateofcomplexdiseases.Theproposedalgorithmhasbeenappliedtotestfortheassociationstudyintwodiseases,namelyCrohnsandseverityspectrumof0/HbEThalassemiadiseases.TheresultsrevealedthatourpredictedSNPscanrespectivelybestclassifybothdiseasesat71.57%and71.06%accuracyusing10-foldcrossvalidationcomparingwiththeoptimumrandomforest(ORF)andclassificationandregressiontrees(CART)techniques.Keywords-SingleNucleotidePolymorphism;SupportVectorMachine;GeneticAlgorithmI.INTRODUCTIONScientistshavelongbeeninterestedinidentifyinggeneticfactorsthatinfluencetheoccurrenceofcomplexdiseases.Withtheadventofparallelgenotypingtechnology,costandtimeinfindingSNPsarenotoutofreach.Largecase-controlcohortsgeneratedfromverydenseSNParrays(DNAchipcontainsdensearrayofSNPs)challengingresearcherstosearchforSNPsthatareassociatedwiththediseases.Incontrasttothesinglegenedisorders,thestateofcomplexdiseasescouldbetriggeredfrommultiplegeneswhenexposingtocertainenvironmentalfactors1,2.However,searchingformultiplemarkerinteractionsfromalargepoolofSNPsimposeshighcomputationalandmemorycomplexity.Atechniqueofselectingsubsetofrelevantfeatures,namedFeatureSelection3,hasbeenwidelyusedinalmostfields,includingbioinformatics.Thistechniqueprovidesmoreeffectivewaytoimprovelearningaccuracytounderstandtheimportanceofthefeaturesbyremovingirreverentorredundantones.II.THEPROPOSEDIFGAMETHODInthissection,weintroduceanewencodingmethodcalledIFGA.Fig.1demonstratesthesummaryoftheIFGAmethod.Thefirstpopulationisconstructedbyourproposedintegerencodingapproach.Thedatainthechromosome(inGeneticAlgorithm(GA)context)arerepresentedbyasetofselectedfeatures.Afterthepopulationisgenerated,eachchromosomeisevaluatedbyafitnessscore.ThisscoreisobtainedbyusingtheBoostMode-SVMapproach.Then,theIFGAre-generatesthenextpopulationbyIFGAselection,IFGAcross-over,andIFGAmutationuntilaterminationcriterionissatisfied.A.TheIntegerEncodingMethodUtilizingGAtoperformfeatureselectioncanbedonebyconvertinginputdatausingbinaryencoding4.Thelengthof978-1-4244-4713-8/10/$25.002010IEEEFigure1.TheoverallIFGAflowchar.achromosomeequalsanumberofallfeatures.Thesizeofencodedchromosomecorrespondsdirectlytothenumberofinputfeatures.This,however,presentsaproblemduetotworeasons.First,therunningtimehighlydependsonthelengthofchromosome.Second,ageneralbinaryencodingdoesnotfixanumberofselectedfeatures.Itfixesonlythelengthofthechromosome.TheIFGAintegerencodingmethodisproposedtosolvetheseproblems.Assumethatacase-controldatausedinthisstudyhavemnumberofgenotypes.LetQibetheithchromosomeprocessedinthealgorithm.ThelengthofQi,denotedby|Qi|,issettoaconstantlessthanorequaltom.Then,random|Qi|numbers,representthelocationstoselectthecorrespondinggenotypesfromagivenfeaturesequence.DuringtheIFGA,thelengthofeachchromosomeisnotnecessarilyidentical.Forexample,supposem=7,thechromosomesize(|Qi|)issetto3,andtherandomlyselectedlocationsare1,5,and6.So,thechromosomeQi=1,5,6.B.IFGASelectionEachindividualchromosomeisselectedbasedonitsfitnessscoreintoamatingpoolbyastochasticuniversalsamplingmethod(SUS)5.TheIFGAalsousesanelitismtechnique6,inwhichthenextgenerationchromosomederivesfromthebestchromosomeinacurrentgeneration.C.IFGACross-OverThecross-overfunctionoftraditionalGArandomlyselectstherecombinationpointandswapsthetwochromosomesflankingthispoint.Cross-overfromtheoriginalGA,however,cannotbeappliedtotheIFGAapproachbecauseallchromosomesmusthavethesamesizeandfeaturesfromthesamelocicannotbeonthesamechromosome.WemustdeviseanIFGAcross-overtechniquetoovercomethisproblem.Assumethat,parent1andparent2aretheparentalchromosomeswhereeachlocusisthepositionofselectedfeature.Eithernumberofparent1sorparent2slocusmustbemorethan1.Numberofbothparentsloci(parent1andparent2)mustbegreaterthanorequaltoone.Outputsfromthisalgorithmareoffspring1sandoffspring2.1:x2:y3:tmp1parent14:fori=0to|parent1|do5:v|tmp1|6:selrandom(1,2,.,v)7:xxsel8:tmp1tmp1parent1selsuppress9:endfor10:tmp2parent211:fori=0to|parent2|do12:v|tmp2|13:selrandom(1,2,.,v)14:yysel15:tmp2tmp2parent2sel16:endfor17:crandom1,min(|parent1|,|parent2|)118:offspring1x1,x2,.,xc,yc+1,.,y|parent2|19:offspring2y1,y2,.,yc,xc+1,.,x|parent1|D.IFGAMutationMutationfunctionaltersthevalueofaspecifiedlocus.Ithardlyoccurswhencomparingwiththecross-overprocess.IFGAmutationispresentedhere.Letmdenotethelengthofagivengenotypesequence,input_chromisachromosomethatwillbemutated,andoutput_chromisamutatedchromosome.Eachelementinachromosomeisaselectedfeature.1:pos_outrandom1,|input_chrom|2:pos_inrandom1,m3:fori=1to|input_chrom|do4:ifi=pos_outthen5:output_chromipos_in6:else7:output_chromiinput_chromi8:endif9:endforE.GeneratingaPopulationTherearetwokindsofpopulation,theinitialpopulationandthenextgenerationpopulation.TogeneratetheinitialpopulationwithPchromosomes,wherePisauser-definednumberofchromosomesinthepopulation,thealgorithmrepeatedlygeneratesthechromosomesbyintegerencodingmethodandaddsthemintothesetofpopulationuntilthenumberofthechromosomesinthepopulationisequaltoP.Ontheotherhand,thepopulationinthenextgenerationconsistsofthechromosomeb,thebestfitnessscorefromthecurrentgeneration,egroupsoffeaturesfromevolution,cross-overandmutation,andrgroupsofthefeaturesfromthenewre-selectedfeatures.Afteraddingbandetothenextgeneration,thosechromosomesarecheckedforredundancy.Eachchromosomemustbeidenticalinthenextgeneration.Duplicatedchromosomeswillberemoved.Ifthenumberofchromosomesinthenextgenerationislessthanthenumberofchromosomesinthecurrentgenerationthenanewsubsetsoffeatures,r,willberandomlycreatedandaddedtothenextgeneration.F.TerminationThisIFGAalgorithmconsistsofasetofrecursivestepsforgeneratingthepopulation,evaluationbyaBoostMode-SVM,IFGAselection,IFGAcross-over,andIFGAmutation.Thesestepsareexecuteduntilthenumberofthebestresultsremainsconstantinthenext300iterations.III.THEPROPOSEDBOOSTMODE-SVMMETHODThegoalofSVM7istofindamaximalseparatinghyperplane:eitherfor(1)linearlyseparablecaseor(2)thenonlinearlyseparablecase.Notedthat,wTisatransposevectorofweight,xiisaninputvector,isamappingfunction,andbisabiasvalue.yi=sign(wxi+b)(1)yi=sign(w(xi)+b)(2)Theseequationsfacethesameproblemoccurredwhentheinputdataareimbalanced.Thelearnedseparatinghyperplanefromimbalanceddatasetmayshifttoomuchinthedirectiontowardsthesmallergroupcomparedwiththetrueseparatinghyperplane8.Tosolvethisproblem,thedecisionhyperplaneshouldbeadjusted.Itcanbeseenfrom(1)and(2)thattheparameterweffectstheclassificationoutput.So,modifyingwwilladjustthedecisionhyperplane,whichmayimprovetheclassifier.A.BoostMode-SVMAnewtechniqueofoversamplingfornominalfeatureisproposedtoimprovetheperformanceoftheSVM.TheBoostMode-SVM(Fig.1)generatestwoSVMs,namelySVM1andSVM2.TheSVM1isconstructedforgeneratingthescoreofthetrainingdatasetwhereastheSVM2isthefinalSVMmodelforclassificationthetestset.First,onlythetrainingsetisusedtoconstructtheSVM1andtofindtheBoostMode.ThisBoostModeistheindicatorvectoroftheminoritydataset.ItisbroughttotestwiththeSVM1.Twoscoringmethods,anUnbiasedScoring(US)andaBiasScoring(BS),areproposedtofindthescoringvalue.TheUSmethodisperformedwhentheSVM1correctlyclassifiestheBoostMode,otherwisetheBSmethodisperformed.Afterthat,aScoringOver-Samplingapproach(SOS)isprocessedforaddingartificialdatatominoritygroupbysamplingthedataoftheminoritygroupuntilanumberofdataofbothgroupsareequal.Theminoritygroupinthispapermeansthegroupofdatahavingfewerelements.ThenewSVM2isconstructedfortheclassificationbytheprevioustrainingdatasetandnewsetofdatafromtheSOStechnique.Finally,thetestsetisrunintheSVM2fortheevaluation.TheerrorrateforthetestsetisthefitnessscorevalueusingintheIFGAsectionabove.B.FindingtheBoostModeTobalancethesizeofdatafrombothclasses,someadditionaldataintheminoritygroupmustbegenerated.Theselectedgeneratingmethod(eitherUSorBS)willdependuponaBoostModevector.ThefollowingproceduredescribeshowtocomputetheBoostModevector.Letnminorbethenumberofdataintheminoritygroup.Boostrapsamplingwithreplacementisappliedontheminoritygrouptogeneratetdatasets,i.e.BoostGroup1,.,BoostGroupt.EachBoostGroupicontainsnminordata.1:fori=1totdo.2:allmodeimode(BoostGroupi)3:endfor4:BoostModemode(allmodei)iC.TheUnbiasedScoringMethodThistechniqueisprocessedwhentheSVM1classifiestheBoostModecorrectly.Alldatapointshaveequalchances(equalscoringvalues)tobeselectedfortheover-samplingtechnique.ThefollowingalgorithmdescribestheprocessoffindingthescoringvaluebytheUStechnique.ThescoreValisanoutputfromthisalgorithm.1:fori=1tonminordo2:scoreVali=1/nminor3:endforD.TheBiasScoringMethodTheBStechniqueisrunwhentheSVM1incorrectlyclassifiesbytheBoostMode.Thescoringvalueiscalculatedfromthedistanceofitspointtothedecisionhyperplaneby(3)forlinearseparabilityor(4)fornonlinearseparability.distancei=wxi+b(3)distancei=w(xi)+b(4)Thedatapointthatiscorrectlyclassifiedhaslesserchance(lessscoringvalue)tobeselectedfortheoversamplingprocessthantheonethatiswronglyclassified.Therefore,increasinginnumberofincorrectclassificationswouldinfluencethehigherchanceofsamplestobechosenandviceversa.ThescoringvaluefortheBSmethodisdescribedbythefollowingalgorithm.Letdistancebeasetofdistancesofalldatapointsintheminoritygroup.TheoutputfromthisalgorithmisasetofscoreVal.1:sumSV102:minValmin(distancei)i3:addVal=absolute(minVal)+14:fori=1tonminordo5:tmpi=distancei+addVal6:sumSV1=sumSV1+tmpi7:endfor8:iftheminoritygroupisthecontrolgroupthen9:fori=1tonminordo10:tmpi=2tmpi11:endfor12:endif13:fori=1tonminordo14:sumSV2=015:forj=1toido16:sumSV2=sumSV2+tmpj17:endfor18:scoreValisumSV2/sumSV119:endforE.TheScoredOver-SamplingMethodTheobjectiveoftheSOSalgorithmistoselectdatafromtheminoritygroupdependingonthescoreVal,computedbyeitherUSalgorithmorBSalgorithm.LetMDidenotedataith,for1ithnminor.Thenumberofdatainminoritygroupandmajoritygroupsarenminorandnmajor,respectively.Anoutputofthisalgorithmisasetofadditionaldataaddedtotheminoritygroup,samp_data.1:z=nmajornminor2:fori=1to|scoreVal|do3:sumSV1=sumSV1+scoreVali4:endfor5:fori=1to|scoreVal|do6:sumSV2=07:forj=1toido8:sumSV2=sumSV2+scoreValj9:endfor10:mapScorei=sumSV2/sumSV111:endfor12:fori=1tozdo13:selectPos=rand(1)14:ifselectPos0andselectPosmapScore1then15:samp_datai=MD116:else17:forj=2to|scoreVal|do18:ifselectPosmapScorej1andselectPosmapScorejthen19:samp_datai=MDj20:endif21:endfor22:endif23:endforIV.EXPERIMENTSANDRESULTSTableIIshowsthecomparisonoftheIFGA-BoostMode-SVM,ORF9,andCART10by10-foldcrossvalidationofThalassemiasandCrohnsdiseases.OurIFGA-BoostMode-SVMperformsbetterclassificationthanthestandardORFandCARTmethods.Notethat,nofeat.,acc.,sen.,andspec.inTableIIarethenumberoffeatures,accuracy,sensitivity,andspecificity,respectively.Thalassemiadataset(503patientswith835SNPs)wereobtainedfromtheThalassemiaResearchCenter,MahidolUniversityandtheCrohndataset(357patientswith103SNPs)areobtainedfrom11.Missingdatawereinferredby2SNPphasingmethod12.ForSVM,asoftmarginRBFkernelfunctionwith=0.5wasdeployedtoanalyzebothCrohnsandThalasemiasdataset.DummyencodingisappliedforSVMasvectors100,010,and001whereagenotypeismajorhomozygote,minorhomozygote,andheterozygote,respectively.InIFGA,eachchromosomesizeisvariedfrom1to10.Therefore,featureselectionfrom1featureto10featuresisprocessed.ParametersintheIFGAweresetasfollows:thenumberofchromosomesis1000,thecross-overrateis0.7forThalassemiasand0.8forCrohnsdiseases,andthemutationrateis0.035forThalassemiasand0.001forCrohnsdiseases.TABLEI.THEEXPERIMENTALRESULTSDataset+Algorithmnofeat.acc.(%)sen.(%)spec.(%)Thal.+IFGA-BoostMode-SVM671.5776.3964.14Thal.+ORF654.2769.8430.30Thal.+CART669.3876.0759.09Crohn+IFGA-BoostMode-SVM871.0664.5874.90Crohn+ORF857.8820.1480.25Crohn+CART863.3123.6186.83V.CONCLUSIONAnewIFGAwithBoostMode-SVMwasproposedtoidentifythesusceptiblelocifromthecase-controlassociationstudies.TheIFGAtechniqueencodeschromosomesasdifferentintegersizes.TheSOStechniquesamplestheminoritydatasetbytwoscoringapproaches(USandBS)areproposed.Thismethodcanverywellbeappliedinthecase-controlassociationstudies.Theexperimentalresultsfromtworealdatasets:CrohnsandThalassemiasdiseasesshowthatfeatureselectionandclassificationbytheIFGAwithBoostMode-SVMoutperformsthestandardORF,andCARTtechniques.REFERENCES1J.Marchini,P.Donnelly,andL.R.Cardon,“Genome-widestrategiesfordetectingmultiplelocithatinfluencecomplexdiseases,”NatureGenetics,vol.37,pp.413417,March2005.2D.J.Weatherall,“Science,medicine,andthefuture:Singlegenedisordersorcomplextraits:Lessonsfromthethalassaemiasandothermonogenicdiseases,”BMJ,v
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 科技實(shí)訓(xùn)室的安全隱患與防范策略
- 語文教育中的辦公自動(dòng)化如何使用工具輔助作文批改
- 教育專家談家庭教育在提升兒童睡眠質(zhì)量中的重要性
- 高考物理重難點(diǎn)核心知識(shí)復(fù)習(xí)寶典
- 閱讀理解能力與語言表達(dá)能力的關(guān)系
- 零售企業(yè)如何運(yùn)用CRM提高客戶滿意度
- 2025年龍巖道路運(yùn)輸從業(yè)資格證模擬考試年新版
- 2025年新余下載b2貨運(yùn)從業(yè)資格證模擬考試考試
- 環(huán)保主題閱讀推廣小學(xué)生環(huán)保主題的課外閱讀活動(dòng)
- 職場(chǎng)母嬰保健孕晚期管理方案
- 山東省濰坊市2024-2025學(xué)年高三上學(xué)期1月期末 英語試題
- 春節(jié)節(jié)后收心會(huì)
- 《榜樣9》觀后感心得體會(huì)四
- 七年級(jí)下冊(cè)英語單詞表(人教版)-418個(gè)
- 交警安全進(jìn)校園課件
- 潤(rùn)滑油過濾培訓(xùn)
- 內(nèi)蒙自治區(qū)烏蘭察布市集寧二中2025屆高考語文全真模擬密押卷含解析
- 浙江省紹興市2023-2024學(xué)年高一上學(xué)期期末考試物理試題(含答案)
- 《住院患者身體約束的護(hù)理》團(tuán)體標(biāo)準(zhǔn)解讀課件
- 中國急性缺血性卒中診治指南(2023版)
- 學(xué)前教育普及普惠質(zhì)量評(píng)估幼兒園準(zhǔn)備工作詳解
評(píng)論
0/150
提交評(píng)論