2024哈佛大學(xué)稀疏事件數(shù)據(jù)的邏輯回歸_第1頁(yè)
2024哈佛大學(xué)稀疏事件數(shù)據(jù)的邏輯回歸_第2頁(yè)
2024哈佛大學(xué)稀疏事件數(shù)據(jù)的邏輯回歸_第3頁(yè)
2024哈佛大學(xué)稀疏事件數(shù)據(jù)的邏輯回歸_第4頁(yè)
2024哈佛大學(xué)稀疏事件數(shù)據(jù)的邏輯回歸_第5頁(yè)
已閱讀5頁(yè),還剩24頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

LogisticRegressioninRareEventsDataWestudyrareeventsdata,binarydependentvariableswithdozenstothousandsoftimesfewerones(events,suchaswars,vetoes,casesofpoliticalactivism,orepidemiologicalinfections)thanzeros(“nonevents”).Inmanyliteratures,thesevariableshaveprovendif-ficulttoexplainandpredict,aproblemthatseemstohaveatleasttwosources.First,popularstatisticalprocedures,suchaslogisticregression,cansharplyunderestimatetheprobabilityofrareevents.Werecommendcorrectionsthatoutperformexistingmethodsandchangetheestimatesofabsoluteandrelativerisksbyasmuchassomeestimatedeffectsreportedintheliterature.Second,commonlyuseddatacollectionstrategiesaregrosslyinefficientforrareeventsdata.Thefearofcollectingdatawithtoofeweventshasledtodatacollectionswithhugenumbersofobservationsbutrelativelyfew,andpoorlymeasured,explanatoryvariables,suchasininternationalconflictdatawithmorethanaquarter-milliondyads,onlyafewofwhichareatwar.Asitturnsout,moreefficientsam-wars)andatinyfractionofnonevents(peace).Thisenablesscholarstosaveasmuchas99%oftheir(nonfixed)datacollectioncostsortocollectmuchmoremeaningfulex-KristianGleditsch,GuidoImbens,ChuckManski,PeterMcCullagh,WalterMebane,JonathanNagler,BruceRussett,KenScheve,PhilSchrodt,MartinTanner,andRichardTuckerforhelpfulsuggestions;ScottBennett,KristianGleditsch,PaulHuth,andRichardTuckerfordata;andtheNationalScienceFoundation(SBR-9729884andSBR-9753126),theCentersforDiseaseControlandPrevention(DivisionofDiabetesTranslation),theintheSocialSciencesforresearchsupport.Softwarewewrotetoimplementthemethodsinthispaper,called“ReLogit:RareEventsLogisticRegression,”isavailableforStataandforGaussfrom\hhttp://GKing.Harvard.Edu.Wehavewrittenacompanionpiecetothisarticlethatoverlapsthisone:itexcludesthemathematicalproofsandothertechnicalmaterial,andhaslessgeneralnotation,butitincludesempiricalexamplesandmorepedagogicallyorientedmaterial(seeKingandZeng2000b;copyavailableat\hhttp://GKing.Harvard.Edu).Copyright2001bytheSocietyforPolitical

我們研究罕見(jiàn)事件數(shù)據(jù),這些數(shù)據(jù)是二元依賴變量,其中事件(如戰(zhàn)爭(zhēng)、否決、政治活動(dòng)案例或流行病感染)少幾十到幾千倍。在許多文獻(xiàn)中,這以節(jié)省多達(dá)99%的(非固定)數(shù)據(jù)收集成本,或者收集更多有意義的解釋變量。我們提作者注:感謝JamesFowler、EthanKatz和MikeTomz提供研究協(xié)助;JimAlt、JohnFreeman、KristianGleditsch、GuidoImbens、ChuckManski、PeterMcCullagh、WalterMebane、JonathanNagler、BruceRussett、KenScheve、PhilSchrodt、MartinTanner和RichardTucker出有益的建議;ScottBennett、KristianGleditsch、PaulHuth和RichardTucker提供數(shù)據(jù);以及美國(guó)國(guó)家科學(xué)基金會(huì)(SBR?9729884和SBR?9753126)、疾病控制與預(yù)防中心(糖尿病翻譯部)、美國(guó)國(guó)家老齡化研究所(P01AG17625?01)、世界衛(wèi)生組織和社會(huì)科學(xué)基礎(chǔ)研究中心提供研究支持。我們編寫的用于實(shí)現(xiàn)本文中方法的軟件“ReLogit:罕見(jiàn)事件邏輯回歸”,可在Stata和Gauss上使用,網(wǎng)址為\hhttp://GKing.Harvard.Edu。我們還為本文撰寫了一篇配套文章,與本文重疊:它不包括數(shù)學(xué)證明和其他技術(shù)材料,符號(hào)也不夠通用,但包括經(jīng)驗(yàn)示例和更多面向教學(xué)的材料(參見(jiàn)King和Zeng2000b;副本可在\hhttp://GKing.Harvard.Edu獲?。?。P1:P1:P1:P1: GaryKingandLangche1WEADDRESSPROBLEMSinthestatisticalanalysisofrareeventsdata—binarydepen-dentvariableswithdozenstothousandsoftimesfewerones(events,suchaswars,coups,presidentialvetoes,decisionsofcitizenstorunforpoliticaloffice,orinfectionsbyun-commondiseases)thanzeros(“nonevents”).(Ofcourse,bytrivialrecoding,thisdefinitionandrelatedsocialsciencesandperhapsmostprevalentininternationalconflict(andothertoexplainandpredict,aproblemwebelievehasamultiplicityofsources,includingthetwoweaddresshere:mostpopularstatisticalprocedures,suchaslogisticregression,cansharplyunderestimatetheprobabilityofrareevents,andcommonlyuseddatacollectionstrategiesaregrosslyinefficient.First,althoughthestatisticalpropertiesoflinearregressionmodelsareinvarianttothe(unconditional)meanofthedependentvariable,thesameisnottrueforbinarydependentvariablemodels.Themeanofabinaryvariableistherelativefrequencyofeventsinthedata,which,inadditiontothenumberofobservations,constitutestheinformationcontentofthedataset.Weshowthatthisoftenoverlookedpropertyofbinaryvariablemodelshasbiasedinsmallsamples(underabout200)iswelldocumentedinthestatisticalliterature,butnotaswidelyunderstoodisthatinrareeventsdatathebiasesinprobabilitiescanbesubstantivelymeaningfulwithsamplesizesinthethousandsandareinapredictabledirection:estimatedeventprobabilitiesaretoosmall.Aseparate,andalsooverlooked,problemisthatthealmost-universallyusedmethodofcomputingprobabilitiesofeventsinlogitanalysisissuboptimalinfinitesamplesofrareeventsdata,leadingtoerrorsinthesamedirectionasbiasesinthecoefficients.Appliedresearchersvirtuallynevercorrectfortheunderestimationofeventprobabilities.Theseproblemswillbeinnocuousinsomeapplications,butweoffersimpleMonteCarloexampleswherethebiasesareaslargeassomeestimatedeffectsreportedintheliterature.Wedemonstratehowtocorrectfortheseproblemsandprovidesoftwaretomakethecomputationstraightforward.Asecondsourceofthedifficultiesinanalyzingrareeventsliesindatacollection.Givenbetteroradditionalvariables.Inrareeventsdata,fearofcollectingdatasetswithnoevents(andthuswithoutvariationonY)hasledresearcherstochooseverylargenumbersofobservationswithfew,andinmostcasespoorlymeasured,explanatoryvariables.Thisisareasonablechoice,giventheperceivedconstraints,butitturnsoutthatfarmoreefficientonesandasmallrandomsampleofzerosandnotloseconsistencyorevenmuchefficiencyrelativetothefullsample.Thisresultdrasticallychangestheoptimaltrade-offbetweenmoreobservationsandbettervariables,enablingscholarstofocusdatacollectioneffortswheretheymattermost.Asanexample,weusealldyads(pairsofcountries)foreachyearsinceWorldWarIItogenerateadatasetbelowwith303,814observations,ofwhichonly0.34%,or1042dyads,wereatwar.Datasetsofthissizearenotuncommonininternationalrelations,buttheymakedatamanagementdifficult,statisticalanalysestime-consuming,anddatacollectionexpensive.1(Eventhemorecommon5000–10000observationdatasetsareinconvenienttodealwithifonehastocollectvariablesforallthecases.)Moreover,mostdyads1BennettandStam(1998b)analyzeadatasetwith684,000dyad-yearsand(1998a)haveevendevelopedsophis-ticatedsoftwareformanagingthelarger,1.2million-dyaddatasettheydistribute.

GaryKing和Langche計(jì)文獻(xiàn)中,logit(200)中存在偏差是眾所周知的事實(shí),logit中,擔(dān)心收集到?jīng)]有事件(因此沒(méi)有在Y上的變化)的數(shù)據(jù)集,導(dǎo)致研究人員選例如,我們使用自第二次世界大戰(zhàn)以來(lái)每年的所有雙邊關(guān)系(國(guó)家對(duì))來(lái)生成以下數(shù)據(jù)集,其中包含303,814個(gè)觀測(cè)值,其中只有0.34%,即1042這種規(guī)模的數(shù)據(jù)集在國(guó)際關(guān)系研究中并不罕見(jiàn),但它們使得數(shù)據(jù)管理變得困難,統(tǒng)計(jì)分析耗時(shí),數(shù)據(jù)收集成本高昂。1(即使更常見(jiàn)的5000?10000個(gè)觀測(cè)值的數(shù)據(jù)集,如果必須收集所有案例的變量,也會(huì)變得難以處理。)此外,大多數(shù)雙邊關(guān)系涉及Bennett和Stam(1998b)分析了一個(gè)包含684,000個(gè)雙邊年的數(shù)據(jù)集,而(1998a)甚至為它們分發(fā)的1,200萬(wàn)個(gè)雙邊關(guān)系的數(shù)據(jù)集開(kāi)發(fā)了復(fù)雜的軟件。LogisticRegressioninRareEvents countrieswithlittlerelationshipatall(sayBurkinaFasoandSt.Lucia),muchlesswithsomerealisticprobabilityofgoingtowar,andsothereisawell-foundedperceptionthatmanyofthedataare“nearlyirrelevant”(MaozandRussett1993,p.627).Indeed,manyofthedatahaveverylittleinformationcontent,whichiswhywecanavoidcollectingthevastinpoliticalsciencedesignedtocopewiththisproblem,suchasselectingdyadsthatare“politicallyrelevant”(MaozandRussett1993),arereasonableandpracticalapproachestoadifficultproblem,buttheynecessarilychangethequestionasked,alterthepopulationtowhichweareinferring,orrequireconditionalanalysis(suchasonlycontiguousdyadsoronlythoseinvolvingamajorpower).Lesscarefulusesofthesetypesofdataselectionappropriateeasy-to-applycorrections,nearly300,000observationswithzerosneednotbecollectedorcouldevenbedeletedwithonlyaminorimpactonsubstantiveconclusions.Withtheseprocedures,scholarswhowishtoaddnewvariablestoanexistingcollectioncansaveapproximately99%ofthenonfixedcostsintheirdatacollectionbudgetorcanreallocatedatacollectioneffortstogeneratealargernumberofmoreinformativeandmeaningfulvariablesthanwouldotherwisebepossible.2Relativetosomeotherfieldsinofmeasurementovermanyyearsandhavegeneratedalargequantityofdata.Selectingonthedependentvariableinthewaywesuggesthasthepotentialtobuildontheseefforts,ThisprocedureofselectiononYalsoaddressesalong-standingcontroversyintheinternationalconflictliteraturewherebyqualitativescholarsdevotetheireffortswheretheIncontrast,quantitativescholarsarecriticizedforspendingtimeanalyzingverycrudedeMesquita1981;GellerandSinger1998;Levy1989;Rosenau1976;Vasquez1993).Itmuchmorewiththeonesthanthezeros,butresearchersmustbecarefultoavoidbias.Fortunately,thecorrectionsareeasy,andsothegoalsofbothcampscanbeThemainintendedcontributionofthispaperistointegratethesetwotypesofcorrec-tions,whichhavebeenstudiedmostlyinisolation,andtoclarifythelargelyunnoticedconsequencesofrareeventsdatainthiscontext.Wealsotrytoforgeacriticallinkbetweeneventsbias,andstandarderrorinconsistency,inapopularmethodofcorrectingselectiononY.ThisisusefulwhenselectingonYleadstosmallersamples.Wealsoprovideanimprovedmethodofcomputingprobabilityestimates,proofsoftheequivalenceofsomeleadingeconometricmethods,andsoftwaretoimplementthemethodsdeveloped.Weofferappearinourcompanionpaper(KingandZeng2000b).32Thefixedcostsinvolvedingearinguptocollectdatawouldbebornewitheitherdatacollectionstrategy,andsoselectingonthedependentvariableaswesuggestsavessomethinglessinresearchdollarsthanthefractionofobservationsnotcollected.3WehavefoundnodiscussioninpoliticalscienceoftheeffectsoffinitesamplesandrareeventsonlogisticregressionorofmostofthemethodswediscussthatallowselectiononY.Thereisabriefdiscussionofoneandinanunpublishedpapertheycitethathasrecentlybecomeavailable(Achen1999).

和Russett1993,第627頁(yè))。事實(shí)上,許多數(shù)據(jù)的信息含量非常低,這就是為二元組進(jìn)行推斷,是有偏見(jiàn)的。通過(guò)適當(dāng)?shù)囊子趹?yīng)用的校正,幾乎30萬(wàn)個(gè)零值觀Y(BuenodeMesquita1981;Geller和Singer1998;Levy1989;Rosenau1976Vasquez1993)的非常粗略的測(cè)量而受到批評(píng)。結(jié)果證明,雙方都有一YY導(dǎo)致樣本量更小的時(shí)驗(yàn)的形式提供證據(jù)。經(jīng)驗(yàn)示例見(jiàn)我們的配套論文(King和Zeng2000b)。3本和罕見(jiàn)事件對(duì)邏輯回歸或我們討論的大多數(shù)允許對(duì)Y進(jìn)行選擇的方法的影響的討論。BuenodeMesquita和Lalman(1992年附錄)以及他們引用的一篇未發(fā)表的論文(Achen1999年)中簡(jiǎn)要討論了一種在漸近樣本中糾正對(duì)Y的選擇的方法。 February16, GaryKingandLangcheLogisticRegression:ModelandInlogisticregression,asingleoutcomevariableYi(i=1,...,n)followsaBernoulliprobabilityfunctionthattakesonthevalue1withprobabilityπiand0withprobability1?πi.Thenπivariesovertheobservationsasaninverselogisticfunctionofavectorxi,whichincludesaconstantandk?1explanatoryvariables:Yi~Bernoulli(Yi|πi)πi=1+e?xi

GaryKing和LangcheYi(例如,個(gè)人的健康狀況或一個(gè)國(guó)家發(fā)動(dòng)戰(zhàn)爭(zhēng)的可能性)10的概率為1πiπi隨著觀察值的xik1個(gè)解釋變量:Yi~Bernoulli(Yi|πi)πi=1+e?xi

Yi1?πi TheBernoullihasprobabilityfunctionP(Yi|πi)=πi(1?πi .Theunknown

伯努利概率函數(shù)P(Yi|πi)=πi i。未知參數(shù)β=(β0,1)r是一個(gè)k×meterβ=(β0,βr)risak×1vector,whereβ0isascalarconstanttermandβ1isavectorwithelementscorrespondingtotheexplanatoryvariables.

β0是一個(gè)標(biāo)量常數(shù)項(xiàng),β1Analternativewaytodefinethesamemodelisbyimagininganunobservedcontinuousfunctionofxi.ThemodelwouldbeveryclosetoalinearregressionifY?wereobserved:

Y(例如,個(gè)人的健康狀況或一個(gè)國(guó)家發(fā)動(dòng)戰(zhàn)爭(zhēng)的可能性)μiμi隨著觀察值xiY?,則該模型將非常接近線性回歸: μi=xi

?|

μi=xi whereLogistic(Yi|μi)istheone-parameterlogisticprobabilitye?(Y??μiP(Y?)

e?(Y??μiP(Y?)

1+e?(Y??μi)2 1+e?(Y??μi)2Unfortunately,insteadofobservingY?,weseeonlyitsdichotomousrealization,YiwhereYi=1ifY?>0andYi=0ifY?≤0.Forexample,ifY?measureshealth,Yi

Y?YiYi1

bedead(1)oralive(0).IfY?werethepropensitytogotowar,Yicouldbeatwar(1)oratpeace(0).ThemodelremainsthesamebecausePr(Yi=1|β)=πi=Pr(Y?>0|

Pr(Yi=1|β)=πi=Pr(Y?>0|∫

Logistic(Yi|μi)dYi=1+e?xi whichisexactlyasinEq.(1).Wealsoknowthattheobservationmechanism,whichturnsthecontinuousY?intothedichotomousYi,generatesmostofthemischief.Thatis,weransimulationstryingtoestimateβfromanobservedY?andmodel2andfoundthatmaximum-likelihoodestimationofβisapproximatelyunbiasedinsmallsamples.Theparametersareestimatedbymaximumlikelihood,withthelikelihood

Logistic(Yi|μi)dYi=1+e?xi 產(chǎn)生了大部分的麻煩。也就是說(shuō),我們進(jìn)行了模擬,試圖從觀察到的βY?和模型2中估計(jì)β,發(fā)現(xiàn)最大似然估計(jì)在樣本量較小的情況下是近似無(wú)偏的。 πYi(1?πi)1?Yi

BytakinglogsandusingEq.(1),thelog-likelihoodsimplifieslnL(β|y)=Σln(πi)+Σln(1?πi

i=1

lnL(β|y)=Σln(πi)+Σln(1?πi

= ln1+e(1?2Yi)xi (e.g.,Greene1993,p.643).Maximum-likelihoodlogitanalysisthenworksbyfindingthe

= ln1+e(1?2Yi)xi Greene1993643)。最大似然對(duì)數(shù)分析通過(guò)找到使該函數(shù)值最大的β的值來(lái)工作,我們將其標(biāo)記為β?。漸近 February16, LogisticRegressioninRareEvents variancematrix,V(β?),isalsoretainedtocomputestandarderrors.Whenobservationsareselectedrandomly,orrandomlywithinstratadefinedbysomeoralloftheexplanatorycollinearityamongthecolumnsinXorperfectdiscriminationbetweenzerosandones).Thatinrareeventsdataonesaremorestatisticallyinformativethanzeroscanbeseenbystudyingthevariancematrix,

所有解釋變量定義的層內(nèi)隨機(jī)選擇時(shí),β?是一致的,并且漸近有效(X列比

Thepartofthismatrixaffectedbyrareeventsisthefactorπi(1?πi).Mostrareeventslogitmodelhassomeexplanatorypower,theestimateofπiamongobservationsforwhichrareeventsareobserved(i.e.,forwhichYi=1)willusuallybelarger[andcloserto0.5,oneswillcausethevariancetodropmoreandhencearemoreinformativethanadditionalzeros(seeImbens1992,pp.1207,1209;Cosslett1981a;LancasterandImbens1996b).Finally,wenotethatthequantityofinterestinlogisticregressionisrarelytherawoutputbymostcomputerprograms.Instead,scholarsarenormallyinterestedinmoredirectfunctionsoftheprobabilities.Forexample,absoluteriskistheprobabilitythataneventoccursgivenchosenvaluesoftheexplanatoryvariables,Pr(Y=1|X=x).TherelativeriskisthesameprobabilityrelativetotheprobabilityofaneventgivensomebaselinevaluesofX,e.g.,Pr(Y=1|X=1)/Pr(Y=1|X=0),thefractionalincreaseintheThisquantityisfrequentlyreportedinthepopularmedia(e.g.,theprobabilityofsomeformsofcancerincreaseby50%ifonestopsexercising)andiscommoninmanyscholarlyliteratures.Inpoliticalscience,thetermisnotoftenused,butthemeasureisusuallycomputeddirectlyorstudiedimplicitly.Alsoofconsiderableinterestisthefirstdifference(orattributablerisk),thechangeinprobabilityasafunctionofachangeininformativewhenmeasuringeffects,whereasrelativeriskisdimensionlessandsotobeeasiertocompareacrossapplicationsortimeperiods.AlthoughscholarsoftenarguethetwoprobabilitiesthatmakeupeachrelativeriskandeachfirstdifferenceisbestwhenHowtoSelectontheDependentWefirstdistinguishamongalternativedatacollectionstrategiesandshowhowtoadaptthelogitmodelforeach.Then,inSection5,webuildonthesemodelstoalsoallowrareeventandfinitesamplecorrections.Thissectiondiscussesresearchdesignissues,andSection4considersthespecificstatisticalcorrectionsnecessary.DataCollectiontions(X,Y)areselectedatrandom,orexogenousstratifiedsampling,whichallowsYtoberandomlyselectedwithincategoriesdefinedbyX.Optimalstatisticalmodelsareidenticalunderthesetwosamplingschemes.Indeed,inepidemiology,bothareknownunderonename,cohort(orcross-sectional,todistinguishitfromapanel)study.

πi(1πiPr(Yi1|πilogit的稀疏事件的估計(jì)πi(即對(duì)于Yi=1的觀察值)通常較大[,并且更接近0.5,因?yàn)樵谙∈枋录芯恐?,概率通常非常小(Beck2000),而在沒(méi)有觀察到Y(jié)i0πi(1πi0(其倒數(shù))較小。在這種情況下,額外的1將使方差下降更多,因此比額外的0更有信息量(參見(jiàn)Imbens1992年,第1207頁(yè),第1209頁(yè);Cosslett1981a;Lancaster和Imbens1996b)在給定解釋變量的選擇值的情況下事件發(fā)生的概率,Pr(Y=1|Xx)。相對(duì)風(fēng)險(xiǎn)是相對(duì)于給定某些基線值X的事件發(fā)生概率的同一概率,例如,Pr(Y1|X1)Pr(Y1|X0),風(fēng)險(xiǎn)的分?jǐn)?shù)增加。這個(gè)量經(jīng)常在大眾媒體中報(bào)道(例如,如果停止鍛煉,某些形式的癌癥的患病概率會(huì)增加50%)并且在許多學(xué)術(shù)文獻(xiàn)中很常例如Pr(Y1|X1?Pr(Y1|X0)。第一差分在測(cè)量效應(yīng)時(shí)通常最有信息量,經(jīng)常爭(zhēng)論它們的相對(duì)優(yōu)點(diǎn)(參見(jiàn)Breslow和Day1980年,第2章;以及Manski1999),但在方便的時(shí)候報(bào)告構(gòu)成每個(gè)相對(duì)風(fēng)險(xiǎn)和每個(gè)第一差分的兩在計(jì)量經(jīng)濟(jì)學(xué)中通常使用的策略,要么是隨機(jī)抽樣,其中所有觀測(cè)值(X,Y)都是隨機(jī)選擇的,要么是fi定抽樣,這允許在由X定義的類別內(nèi)隨機(jī)選擇 GaryKingandLangcheWhenoneofthevaluesofYisrareinthepopulation,considerableresourcesindatacollectioncanbesavedbyrandomlyselectingwithincategoriesofY.Thisisknownineconometricsaschoice-basedorendogenousstratifiedsamplingandinepidemiologyasacase-controldesign(Breslow1996);itisalsousefulforchoosingqualitativecasestudies(Kingetal.1994,Sect.4.4.2).ThestrategyistoselectonYbycollecting(randomlyorallthoseavailable)forwhichY=1(the“cases”)andarandomselectionofvariablescollectedonalargecohort,andthensubsampleusingalltheonesandarandomvariabletoanexistingcollection,suchasthedyadicdatadiscussedaboveandanalyzedfromalargerrandomsample,withveryfewvariables,oftheentireU.S.population.Inthispaper,weuseinformationonthepopulationfractionofoneswhenitisavailable,andsothesamemodelswedescribeapplytobothcase-controlandcase-cohortstudies.MesquitaandLalman’s(1992)designisfairlyclosetoacase-controlstudywith“contam-inatedcontrols,”meaningthatthe“control”samplewasfromthewholepopulationratherthanonlythoseobservationsforwhichY=0(seeLancasterandImbens1996a).Althoughwedonotanalyzehybriddesignsinthispaper,ourviewisnotthatpurecase-controlsam-plingisappropriateforallpoliticalsciencestudiesofrareevents.(Forexample,additionalefficienciesmightbegainedbymodifyingadatacollectionstrategytofitvariablesthatareeasiertocollectwithinregionalorlanguageclusters.)Rather,ourargumentisthatscholarsshouldconsideramuchwiderrangeofpotentialsamplingstrategies,andassociatedsta-tisticalmethods,thanisnowcommon.Thispaperfocusesonlyontheleadingalternativedesignwhichwebelievehasthepotentialtoseewidespreaduseinpoliticalscience.Problemstocarefullyavoided.First,thesamplingdesignforwhichthepriorcorrectionandweightingmethodsareappropriaterequiresindependentrandom(orcomplete)selectionofobser-vationsforwhichY=1andY=0.Thisencompassesthecase-controlandcase-cohortselection,orviahybridapproaches—requiredifferentstatisticalSecond,whenselectingonY,wemustbecarefulnottoselectonXdifferentlyforthetwosamples.Theclassicexampleisselectingallpeopleinthelocalhospitalwithcancer(Y=1)andarandomselectionoftheU.S.populationwithoutlivercancer(Y=0).TheproblemisthatthesampleofcancerpatientsselectsonY=1andimplicitlyontheinclinationtoseekhealthcare,findtherightmedicalspecialist,havetherighttests,etc.NotrecognizingtheimplicitselectiononXistheproblemhere.SincetheY=0sampledoesnotsimilarlyselectonthesameexplanatoryvariables,thesedatawouldinduceselectionbias.OnesolutioninthisexamplemightbetoselecttheY=0samplefromthosewhosymptoms.Anothersolutionwouldbetomeasureandcontrolfortheomittedvariables.ThistypeofinadvertentselectiononXcanbeaseriousprobleminendogenousdesigns,justasselectiononYcanbiasinferencesinexogenousdesigns.Moreover,although

GaryKing和Langche當(dāng)Y在總體中的某個(gè)值很罕見(jiàn)時(shí),通過(guò)在Y的類別內(nèi)隨機(jī)選擇,可以在數(shù)據(jù)收集上節(jié)省大量資源。這在計(jì)量經(jīng)濟(jì)學(xué)中被稱為基于選擇或內(nèi)生分層抽樣,在流行病學(xué)中則稱為病例?對(duì)照設(shè)計(jì)(Breslow1996);它也適用于選擇定性案例研究(King等人,19944.4.2節(jié))。該策略是通過(guò)收集觀察值(隨機(jī)或所有可用的觀察值)Y(“”)Y(“對(duì)照”)。這種抽樣方法通常輔以對(duì)總體中一個(gè)的已知或估計(jì)的先驗(yàn)知識(shí)——這種的解釋變量不可用時(shí)也是如此)。最后,-隊(duì)列研究開(kāi)始于對(duì)大型隊(duì)列的一Verba(1995)對(duì)活動(dòng)家進(jìn)行的詳細(xì)研究,每個(gè)活動(dòng)家都是從更大的隨我們使用一個(gè)的總體分?jǐn)?shù)信息,因此我們描述的相同模型也適用于病例?對(duì)照和病例?隊(duì)列研究。還嘗試了許多其他混合數(shù)據(jù)收集策略。例如,BuenodeMesquita和Lalman(1992)的設(shè)計(jì)與病例?對(duì)照研究中的“污染對(duì)照”相當(dāng),這意味著“對(duì)照”樣本來(lái)自整個(gè)總體,而不僅僅是那些Y=0的觀察值(參見(jiàn)Lancaster和Imbens1996a)。盡管我們?cè)谶@篇論文中沒(méi)有分析混合設(shè)計(jì),但按照我們建議的方式選擇因變量有幾個(gè)陷阱應(yīng)該小心避免。首先,適用于先驗(yàn)校正和加權(quán)方法的抽樣設(shè)計(jì)需要獨(dú)立隨機(jī)(或完全)選擇觀察值,這些觀察值包括1Y0階段抽樣、非隨機(jī)選擇或混合方法——需要不同的統(tǒng)計(jì)方法。其次,在選擇Y時(shí),我們必須小心不要對(duì)兩個(gè)樣本選擇不同的X。一個(gè)經(jīng)典的例子是選擇當(dāng)?shù)蒯t(yī)院中所有患有肝癌的人(Y=1)的整個(gè)人口(Y=0)。問(wèn)題是癌癥患者的樣本在選擇Y=1的同時(shí),也隱含地選擇了尋求醫(yī)療保健、找到合適的醫(yī)療專家、進(jìn)行正確的檢查等傾向。沒(méi)有認(rèn)識(shí)到對(duì)X的隱含選擇是這里的問(wèn)題。由于Y=0樣本不會(huì)以類似的方式選擇相同的解釋變量,這些數(shù)據(jù)會(huì)導(dǎo)致選擇偏差。在這個(gè)例子中,一個(gè)可能的解決方案是從那些接受了相同的肝癌檢查但最終沒(méi)有患病的人中選擇Y=0樣本。這種設(shè)計(jì)會(huì)產(chǎn)生有效的推論,但僅適用于有肝癌樣癥狀的健康意識(shí)人群。另一個(gè)解決方案是測(cè)量并控制遺漏的變量。XY的選擇在LogisticRegressioninRareEvents thesocialsciencesrandom(orexperimentercontrolover)assignmentofthevaluesoftheexplanatoryvariablesforeachunitisoccasionallypossibleinexogenousorrandomsampling(andwithalargenisgenerallydesirablesinceitrulesoutomittedvariablebias),randomassignmentonXisimpossibleinendogenoussampling.Fortunately,biasduetoselectiononXismucheasiertoavoidinapplicationssuchasinternationalconflictandrelatedfields,sinceaclearlydesignatedcensusofcasesisnormallyavailablefromwhichtodrawasample.Insteadofrelyingonthedecisionsofsubjectsaboutwhethertocometoahospitalandtakeatest,theselectionintothedatasetinourfieldcanoftenbeentirelydeterminedbytheinvestigator.SeeHollandandRubin(1988).Third,anotherproblemwithintentionalselectiononYisthatvalidexploratorydataanalysiscanbemorehazardous.Inparticular,onecannotuseanexplanatoryvariableasadependentvariableinanauxiliaryanalysiswithoutspecialprecautions(seeNagelkerkeetal.1995).Finally,theoptimaltrade-offbetweencollectingmoreobservationsversusbetterorjudgmentcallsandqualitativeassessments.Fortunately,tohelpguidethesedecisionsinfieldslikeinternationalrelationswehavelargebodiesofworkonmethodsofquantitativemeasurementand,also,manyqualitativestudiesthatmeasurehard-to-collectvariablesforasmallnumberofcases(suchasleaders’perceptions).ontheoptimaltrade-offbetweenmoreobservationsandbettervariables.First,whenzerosandonesareequallyeasytocollect,andanunlimitednumberofeachareavailable,an“equalsharessamplingdesign”(i.e.,yˉ=0.5)isoptimalinalimitednumberofsituationsandclosetooptimalinalargenumber(Cosslett1981b;Imbens1992).Thisisausefulbutinfieldslikeinternationalrelations,thenumberofobservableones(suchaswars)isstrictlylimited,andsoinmostofourapplicationscollectingallavailableoralargesampleofonesisbest.Theonlyrealdecision,then,ishowmanyzerostocollectinaddition.Ifcollectingzeroswerecostless,weshouldcollectasmanyaswecanget,sincemoredataarealwaysbetter.Ifcollectingzerosisnotcostless,butnot(much)moreexpensivethancollectingones,thenoneshouldcollectmorezerosthanones.However,sincethemarginaltodropasthenumberofzerospassesthenumberofones,wewillnotoftenwanttocollectmorethan(roughly)twotofivetimesmorezerosthanones.Ingeneral,theoptimalnumberofzerosdependsonhowmuchmorevaluabletheexplanatoryvariablesbecomewiththeresourcessavedbycollectingfewerobservations.Finally,ausefulpracticeissequential,involvingfirstthecollectionofallonesand(say)anequalnumberofzeros.Then,ifthestandarderrorsandconfidenceintervalsarenarrowenough,stop.Otherwise,continuetoexplanatoryvariablessequentiallyaswell,butthisisnotoftenthecase.CorrectingEstimatesforSelectiononDesignsthatselectonYcanbeconsistentandefficientbutonlywiththeappropriatestatisticalcorrections.Sections4.1and4.2introducethepriorcorrectionandweightingforthelogitmodel.InAppendixA,weexplicatethisresultandthenprovethatthebesteconometricestimatorinthistraditionalsoreducestothemethodofpriorcorrection

(n)中偶爾是可能的。在內(nèi)生抽樣中,在X上隨機(jī)分配是不可能的。幸運(yùn)的是,在諸如國(guó)際沖突和相關(guān)領(lǐng)域等應(yīng)用中,由于通??梢詮囊粋€(gè)明確指定的案例普查中抽取樣本,因此對(duì)X而不是依賴于受試者是否來(lái)醫(yī)院接受測(cè)試的決定。參見(jiàn)Holland和Rubin殊預(yù)防措施(參見(jiàn)Nagelkerke等人,1995年)。在有限的情況下,“等份抽樣設(shè)計(jì)(即y的等份分配)是最佳的,在大多數(shù)情況下接近最佳(Cosslett1981b;Imbens1992年)。這是一個(gè)有用的事實(shí),但4.14.220等人(1985年)已經(jīng)證明,這些計(jì)量經(jīng)濟(jì)學(xué)方法中的兩種等同于logit模型的先驗(yàn)修正。在附錄A中,我們解釋了這一結(jié)果,并證明在這一傳統(tǒng)中最佳的計(jì)量經(jīng) February16, GaryKingandLangchethemodelislogitandthesamplingprobability,E(yˉ),isunknown.Toourknowledge,thisresulthasnotappearedpreviouslyintheliterature.PriorPriorcorrectioninvolvescomputingtheusuallogisticregressionMLEandcorrectingtheestimatesbasedonpriorinformationaboutthefractionofonesinthepopulation,τ,andtheobservedfractionofonesinthesample(orsamplingprobability),yˉ.Knowledgeofτcancomefromcensusdata,arandomsamplefromthepopulationmeasuringYonly,acase-cohortsample,orothersources.InAppendixB,wetrytoelucidatethismethodbypresentingaderivationofthemethodofpriorcorrectionforlogitandmostotherstatisticalinanyoftheabovesamplingdesigns,theMLEβ?1isastatisticallyconsistentestimateβ1andthefollowingcorrectedestimateisconsistentfor

GaryKing和LangchelogitEˉyMLE1τ和樣本中觀察到的1的比例(或采樣概率)ˉy的先驗(yàn)信息來(lái)校正估計(jì)。τ的知識(shí)可以來(lái)自人口普查數(shù)據(jù)、從人口中隨機(jī)抽取的僅測(cè)量Y的樣本、病例隊(duì)列樣本或其他來(lái)源。Blogit來(lái)闡明這種方法。對(duì)于logit模型,在任何上述抽樣設(shè)計(jì)中,MLE?β1β1的一個(gè)統(tǒng)計(jì)一致估計(jì),以下校正估計(jì)是一致的β0:1?β??ln1?τy1?whichequalsβ?0onlyinrandomlyselectedcross-sectionaldata.Ofcourse,scholarsarenotnormallyinterestedinβbutratherintheprobabilitythataneventoccurs,Pr(Yi=1|β)=πi=(1+exiβ)?1,whichrequiresgoodestimatesofbothβ1andβ0.EpidemiologistsandbiostatisticiansusuallyattributepriorcorrectiontoPrenticeandPyke(1979);ciansattributetheresulttoManskiandLerman(1977),whointurncreditanunpublishedcommentbyDanielMcFadden.Theresultwaswell-knownpreviouslyinthespecialcaseofalldiscretecovariates(e.g.,Bishopetal.1975,p.63)andhasbeenshowntoapplytoothermultiplicativeinterceptmodels(Hsiehetal.1985,p.659).Priorcorrectionrequiresknowledgeofthefractionofonesinthepopulation,τ.For-tunately,τisstraightforwardtodetermineininternationalconflictdatasincethenumberofconflictsisthesubjectofthestudyandthedenominator,thepopulationofcountriesordyads,iseasytocountevenifnotentirelyintheanalysis.4Akeyadvantageofpriorcorrectioniseaseofuse.Anystatisticalsoftwarethatcanestimatelogitcoefficientscanbeused,andEq.(7)iseasytoapplytotheintercept.Ifthefunctionalformandexplanatoryvariablesarecorrect,estimatesareconsistentandasymptoticallyefficient.Thechiefdisadvantageofpriorcorrectionisthatifthemodelismisspecified,estimatesofbothβ0andβ1areslightlylessrobustthanweighting(XieandManski1989),amethodtowhichwenowturn.Analternativeprocedureistoweightthedatatocompensatefordifferencesinthesample(yˉ)andpopulation(τ)fractionsofonesinducedbychoice-basedsampling.Theresultingweightedexogenoussamplingmaximum-likelihoodestimator(duetoManskiandLerman4KingandZeng(2000a),buildingonresultsofManski(1999),modifythemethodsinthispaperforthesituationwhenτisunknownorpartiallyknown.KingandZenguse“robustbayesiananalysis”tospecifyclassesofpriordistributionsonτ,representingfullorpartialignorance.Forexample,theusercanspecifythatτiscompletelyunknownorknowntofallwithsomeprobabilitytolieonlyinagiveninterval.Theresultisclassesofposteriordistributions(insteadofasingleposterior)that,inmanycases,provideinformativeestimatesofquantitiesof

β??ln1?τyˉ 1?這僅在隨機(jī)選擇的橫斷面數(shù)據(jù)中等于?β0。當(dāng)然,學(xué)者們通常對(duì)β不感興趣,而是對(duì)事件發(fā)生的概率感興趣,Pr(Yi1|β)=πi(1+exiβ)?1,這需要β1和β0的良好估計(jì)。流行病學(xué)家和生物統(tǒng)計(jì)學(xué)家通常將先驗(yàn)校正歸功于Prentice和Pyke(1979);計(jì)量經(jīng)濟(jì)學(xué)家將結(jié)果歸功于Manski和Lerman(1977),他們反過(guò)來(lái)又歸功于DanielMcFadden的一篇未發(fā)表的評(píng)論。在所有離散協(xié)變量(例如,Bishop1?β1的估計(jì)略低于加權(quán)(XieManski1989),Lerman1977)相對(duì)簡(jiǎn)單。我們不是最大化公式(5)King和Zeng(2000a)在Manski(1999)的結(jié)果基礎(chǔ)上,修改了本文中當(dāng)τ未知或部分已知時(shí)的方法。King和Zeng使用“穩(wěn)健貝葉斯分析”來(lái)指定τ上的先驗(yàn)分布類別,代表完全或部分未知。例如,用戶可以指定τ完全未知或以某種概率僅位于給定的區(qū)間內(nèi)。結(jié)果是后驗(yàn)分布類別(而不是單 February16, LogisticRegressioninRareEvents theweightedlog-lnLw(β|y)=w1Σln(πi

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論