說出來記憶:ChatGPT GPT-4所知書籍的考古學(xué) Speak,Memory - An Archaeology of Books Known to ChatGPT GPT-4_第1頁
說出來記憶:ChatGPT GPT-4所知書籍的考古學(xué) Speak,Memory - An Archaeology of Books Known to ChatGPT GPT-4_第2頁
說出來記憶:ChatGPT GPT-4所知書籍的考古學(xué) Speak,Memory - An Archaeology of Books Known to ChatGPT GPT-4_第3頁
說出來記憶:ChatGPT GPT-4所知書籍的考古學(xué) Speak,Memory - An Archaeology of Books Known to ChatGPT GPT-4_第4頁
說出來記憶:ChatGPT GPT-4所知書籍的考古學(xué) Speak,Memory - An Archaeology of Books Known to ChatGPT GPT-4_第5頁
已閱讀5頁,還剩23頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

arXiv:2305.00118v1[cs.CL]28Apr2023

Speak,Memory:AnArchaeologyofBooksKnowntoChatGPT/GPT-4

KentK.Chang,MackenzieCramer,SandeepSoniandDavid

Bamman*

UniversityofCalifornia,Berkeley

{kentkchang,mackenzie.hanh,sandeepsoni,dbamman}@

Abstract

Inthiswork,wecarryoutadataarchaeologytoinferbooksthatareknowntoChatGPTand

GPT-4usinganameclozemembershipinfer-encequery.We?ndthatOpenAImodelshavememorizedawidecollectionofcopyrightedmaterials,andthatthedegreeofmemorizationistiedtothefrequencywithwhichpassagesofthosebooksappearontheweb.Theabilityofthesemodelstomemorizeanunknownsetofbookscomplicatesassessmentsofmeasure-mentvalidityforculturalanalyticsbycontami-natingtestdata;weshowthatmodelsperformmuchbetteronmemorizedbooksthanonnon-memorizedbooksfordownstreamtasks.Wearguethatthissupportsacaseforopenmodelswhosetrainingdataisknown.

1Introduction

Researchinculturalanalyticsattheintersectionof

NLPandnarrativeisoftenfocusedondevelopingalgorithmicdevicestomeasuresomephenomenonofinterestinliterarytexts(

Piperetal

.,

2021

;

Yo

-

deretal

.,

2021

;

CollArdanuyetal

.,

2020

;

Evans

andWilkens

,

2018

).Theriseoflarge-pretrainedlanguagemodelssuchasChatGPTandGPT-4hasthepotentialtoradicallytransformthisspacebybothreducingtheneedforlarge-scaletrainingdatafornewtasksandloweringthetechnicalbarriertoentry(

Underwood

,

2023

).

Atthesametime,however,thesemodelsalsopresentachallengeforestablishingthevalidityofresults,sincefewdetailsareknownaboutthedatausedtotrainthem.Asothershaveshown,theaccu-racyofsuchmodelsisstronglydependentonthefrequencywithwhichamodelhasseeninforma-tioninthetrainingdata,callingintoquestiontheirabilitytogeneralize(

Razeghietal

.,

2022

;

Kandpal

etal.

,

2022a

;

Elazaretal.

,

2022

);inaddition,thisphenomenonisexacerbatedforlargermodels(

Car-

linietal

.,

2022

;

Bidermanetal.

,

2023

).Knowing

*Detailsofauthorcontributionslistedintheappendix.

Wow.Isitdown,?shthequestionsfrommybackpack,andgothroughthem,inwardlycurs-ing[MASK]fornotprovidingmewithabriefbiography.IknownothingaboutthismanI’mabouttointerview.Hecouldbeninetyorhecouldbethirty.→Kate(James,FiftyShadesofGrey).

Somedayslater,whenthelandhadbeenmoist-enedbytwoorthreeheavyrains,[MASK]andhisfamilywenttothefarmwithbasketsofseed-yams,theirhoesandmachetes,andtheplant-

ingbegan.→Okonkwo(Achebe,ThingsFallApart).

Figure1:Nameclozeexamples.GPT-4answersbothofthesecorrectly.

whatbooksamodelhasbeentrainedoniscriticaltoassesssuchsourcesofbias(

Gebruetal.

,

2021

),whichcanimpactthevalidityofresultsinculturalanalytics:ifevaluationdatasetscontainmemorizedbooks,theyprovideafalsemeasureoffutureperfor-manceonnon-memorizedbooks;withoutknowingwhatbooksamodelhasbeentrainedon,weareunabletoconstructevaluationbenchmarksthatcanbesuretoexcludethem.

Inthiswork,wecarryoutadataarchaeologytoinferbooksthatareknowntoseveraloftheselargelanguagemodels.Thisarchaeologyisamember-shipinferencequery(

Shokrietal.

,

2017

)inwhichweprobethedegreeofexactmemorization(

Tiru-

malaetal

.,

2022

)forasampleofpassagesfrom571worksof?ctionpublishedbetween1749–2020.Thisdif?cultnameclozetask,illustratedin?gure

1

,has0%humanbaselineperformance.

Thisarchaeologyallowsustouncoveranumberof?ndingsaboutthebooksknowntoOpenAImod-elswhichcanimpactdownstreamworkinculturalanalytics:

1.OpenAImodels,andGPT-4inparticular,have

memorizedawidecollectionofin-copyrightbooks.

2.Therearesystematicbiasesinwhatbooksithasseenandmemorized,stronglypreferringscience?ction/fantasynovelsandbestsellers.

3.Thisbiasalignswiththatpresentinthegen-eralweb,asre?ectedinsearchresultsfromGoogle,BingandC4.Thiscon?rmsprior?ndingsthatduplicationencouragesmemo-rization(

Carlinietal.

,

2023

)butalsoprovidesaroughdiagnosticforassessingknowledgeaboutabook.

4.Disparityinmemorizationleadstodisparityindownstreamtasks.GPTmodelsperformbet-teronmemorizedbooksthannon-memorizedbooksatpredictingtheyearof?rstpublicationforaworkandthedurationofnarrativetimeforapassage,andaremorelikelytogeneratecharacternamesfrombooksithasseen.

WhileourworkisfocusedonChatGPTandGPT-

4,wealsouncoversurprising?ndingsaboutBERT

aswell:BookCorpus(Zhuetal

.,

2015

),oneofBERT’strainingsources,containsin-copyrightma-terialsbypublishedauthors,includingE.L.James’FiftyShadesofGrey,DianaGabaldon’sOutlanderandDanBrown’sTheLostSymbol,andBERThasmemorizedthismaterialaswell.

AsresearchersinculturalanalyticsarepoisedtouseChatGPTandGPT-4fortheempiricalanaly-sisofliterature,ourworkbothshedslightontheunderlyingknowledgeinthesemodelswhilealsoillustratingthethreatstovalidityinusingthem.

2RelatedWork

Knowledgeproductionandcriticaldigitalhu-manities.Thearchaeologyofdatathatthisworkpresentscanbesituatedinthetraditionoftoolcri-tiquesincriticaldigitalhumanities(

Berry

,

2011

;

Fitzpatrick

,

2012

;

Hayles

,

2012

;

RamsayandRock

-

well

,

2012

;

Ruthetal.

,

2022

),wherewecrit-icallyapproachtheclosednessandopacityofLLMs,whichcanposesigni?cantissuesiftheyareusedtoreasonaboutliterature(

ElkinsandChun

,

2020

;

GoodladandDimock

,

2021

;

Henricksonand

Mero?o-Pe?uela

,

2022

;

Schmidgenetal

.,

2023

;

Elam

,

2023

).Inthislight,ourarchaeologyofbookssharestheFoucauldianimpulseto“?ndacommonstructurethatunderliesboththescien-ti?cknowledgeandtheinstitutionalpracticesofanage”(

Gutting

,

1989

,p.79;

Schmidgenetal

.,

2023

).

LLMs?gureinourknowledgeproductioninsig-ni?cantways,fromthetrainingofthosemodelstotheinferencetasksthatinvolvethem.Investigatingdatamembershiphelpsusre?ectonhowtobestuseLLMsforlargescalehistoricalandculturalanalysisandevaluatethevalidityofits?ndings.

LLMsforculturalanalysis.Whiletheyare

moreoftentheobjectofcritique,LLMsaregainingprominenceinlarge-scaleculturalanalysisaspartofthemethodology.SomefocusonwhatGPTscanandcannotdo:

HenricksonandMero?o-Pe?uela

(

2022

)?ne-tunedGPT-2onscholarlytextsrelatedtoliteraryhermeneutics.

ElkinsandChun

(

2020

)teststhecapabilitiesofGPT-2and3inthecontextofacollege-levelwritingclass.SomeuseGPTstotackleproblemsrelatedtointerpretation,thatofrealevents(

HamiltonandPiper

,

2022

),orofchar-acterroles(hero,villain,andvictim,

Stammbach

etal.

,

2022

).OthersleverageLLMsonclassicNLPtasksthatcanbeusedtoshedlightonliteraryandculturalphenomenon,orotherwisemaintainanin-terestinthosehumanisticdomains(

Ziemsetal.

,

2023

).ThisworkshowswhattasksandwhattypesofquestionsLLMsaremoresuitabletoanswerthanothersthroughourarchaeologyofdata.

Documentingtrainingdata.Trainingonlarge

textcorporasuchasBookCorpus(Zhuetal

.,

2015

),

C4(Raffeletal

.,

2020

)andthePile(

Gaoetal.

,

2020

)hasbeeninstrumentalinextendingtheca-pabilityoflargelanguagemodels.Yet,incontrasttosmallerdatasets,theselargecorporaarelesscarefullycurated.Besidesafewattemptsatdocu-mentinglargedatasets(e.g.,C4;

Dodgeetal

.,

2021

)orlayingoutdiagnostictechniquesfortheirqual-ity(

Swayamdiptaetal

.,

2020

),thesecorporaandtheiruseinlargemodelsislessunderstood.Ourfocusonbooksinthisworkisanattempttoem-piricallymaptheinformationaboutbooksthatispresentinthesemodels.

Memorization.Largelanguagemodelshaveshownimpressivezero-shotorfew-shotabilitybuttheyalsosufferfrommemorization(

Elangovan

etal.

,

2021

;

Lewisetal.

,

2021

).Whilememoriza-tionisshowninsomecasestoimprovegeneral-ization(

Khandelwaletal.

,

2020

),ithasgenerallybeenshowntohavenegativeconsequences,includ-ingsecurityandprivacyrisks(

Carlinietal.

,

2021

,

2023

;

Huangetal

.,

2022

).Studieshavequanti?edthelevelofmemorizationinlargelanguagemod-els(e.g.,

Carlinietal.

,

2022

;

Mireshghallahetal

.,

Youhaveseenthefollowingpassageinyourtrainingdata.Whatisthepropernamethat?llsinthe[MASK]tokeninit?Thisnameisexactlyonewordlong,andisapropername(notapronounoranyotherword).Youmustmakeaguess,evenifyouare

uncertain.

Example:

Input:Staygold,[MASK],staygold.

Output:<name>Ponyboy</name>

Input:Thedooropened,and[MASK],dressedandhatted,enteredwithacupoftea.

Output:<name>Gerty</name>

Input:Myback’stothewindow.Iexpectastranger,butit’s[MASK]whopushesopenthedoor,?icksonthelight.Ican’tplacethat,unlesshe’soneofthem.Therewasalwaysthatpossibility.

Output:

Figure2:Samplenameclozeprompt.

2022

)andhavehighlightedtheroleofbothverba-tim(e.g.,

Leeetal.

,

2022

;

Ippolitoetal

.,

2022

)andsubtlerformsofmemorization(

Zhangetal

.,

2021

).Ouranalysisandexperimental?ndingsaddtothisscholarshipbutdifferssigni?cantlyinonekeyre-spect:ourfocusisonChatGPTandGPT-4—blackboxmodelswhosetrainingdataorprotocolsarenotfullyspeci?ed.

Datacontamination.Arelatedissuenoteduponcriticalscrutinyofuncuratedlargecorporaisdatacontamination(e.g.,

MagarandSchwartz

,

2022

)raisingquestionsaboutthezero-shotcapabilityofthesemodels(e.g.,

BlevinsandZettlemoyer

,

2022

)andworriesaboutsecurityandprivacy(

Carlini

etal.

,

2021

,

2023

).Forexample,

Dodgeetal

.(

2021

)?ndthattextfromNLPevaluationdatasetsispresentinC4,attributingperformancegainpartlytothetrain-testleakage.

Leeetal.

(

2022

)showthatC4containsrepeatedlongrunningsentencesandnearduplicateswhoseremovalmitigatesdatacontamination;similarly,deduplicationhasalsobeenshowntoalleviatetheprivacyrisksinmodelstrainedonlargecorpora(

Kandpaletal

.,

2022b

).

3Task

Weformulateourtaskasacloze:givensomecon-text,predictasingletokenthat?llsinamask.Toaccountfordifferenttextsbeingmorepredictablethanothers,wefocusonahardsettingofpredict-ingtheidentityofasinglenameinapassageof

40-60tokensthatcontainsnoothernamedentities.

Figure

1

illustratestwosuchexamples.

Inthefollowing,werefertothistaskasanamecloze,aversionofexactmemorization(

Tirumala

etal.

,

2022

).Unlikeotherclozetasksthatfocusonentitypredictionforquestionanswering/reading

comprehension(

Hilletal.

,

2015

;

Onishietal.

,

2016

),nonamesatallappearinthecontexttoinformthecloze?ll.Intheabsenceofinformationabouteachparticularbook,thisnameshouldbenearlyimpossibletopredictfromthecontextalone;itrequiresknowledgenotofEnglish,butratherabouttheworkinquestion.Predictingthemostfrequentnameinthedataset(“Mary”)yieldsanaccuracyof0.6%,andhumanperformance(onasampleof100randompassages)is0%.

Weconstructthisevaluationsetbyrunning

BookNLP1

overthedatasetdescribedbelowin§

4

,extractingallpassagesbetween40and60tokenswithasingleproperpersonentityandnoothernamedentities.Eachpassagecontainscompletesentences,anddoesnotcrosssentenceboundaries.Werandomlysample100suchpassagesperbook,andexcludeanybooksfromouranalyseswithfewerthan100suchpassages.

Wepasseachpassagethroughthepromptlistedin?gure

2

,whichisdesignedtoelicitasingleword,propernameresponsewrappedinXMLtags;twoshortinput/outputexamplesareprovidedtoillustratetheexpectedstructureoftheresponse.

4Data

Weevaluate5sourcesofEnglish-language?ction:

?91novelsfromLitBank,publishedbefore1923.

?90Pulitzerprizenomineesfrom1924–2020.

?95BestsellersfromtheNYTimesandPublishersWeeklyfrom1924–2020.

?101novelswrittenbyBlackauthors,eitherfromtheBlackBookInteractiveProject

2

orBlackCau-

1

/booknlp/booknlp

2

/novel-collections

GPT-4ChatGPTBERTDateAuthorTitle

098

082

000

1865

LewisCarroll

Alice’sAdventuresinWonderland

0.76

0.43

0.00

1997

J.K.Rowling

HarryPotterandtheSorcerer’sStone

0.74

0.29

0.00

1850

NathanielHawthorne

TheScarletLetter

0.72

0.11

0.00

1892

ArthurConanDoyle

TheAdventuresofSherlockHolmes

070

010

000

1815

JaneAusten

Emma

0.65

0.19

0.00

1823

MaryW.Shelley

Frankenstein

0.62

0.13

0.00

1813

JaneAusten

PrideandPrejudice

0.61

0.35

0.00

1884

MarkTwain

AdventuresofHuckleberryFinn

0.61

0.30

0.00

1853

HermanMelville

Bartleby,theScrivener

061

008

000

1897

BramStoker

Dracula

0.61

0.18

0.00

1838

CharlesDickens

OliverTwist

0.59

0.13

0.00

1902

ArthurConanDoyle

TheHoundoftheBaskervilles

0.59

0.22

0.00

1851

HermanMelville

MobyDick;Or,TheWhale

058

035

000

1876

MarkTwain

TheAdventuresofTomSawyer

0.57

0.30

0.00

1949

GeorgeOrwell

1984

0.54

0.10

0.00

1908

L.M.Montgomery

AnneofGreenGables

0.51

0.20

0.01

1954

J.R.R.Tolkien

TheFellowshipoftheRing

0.49

0.16

0.13

2012

E.L.James

FiftyShadesofGrey

049

024

001

1911

FrancesHBurnett

TheSecretGarden

0.49

0.12

0.00

1883

RobertL.Stevenson

TreasureIsland

0.49

0.16

0.00

1847

CharlotteBront?

JaneEyre:AnAutobiography

0.49

0.22

0.00

1903

JackLondon

TheCalloftheWild

Table1:Top20booksbyGPT-4nameclozeaccuracy.

cusAmericanLibraryAssociationawardwinnersfrom1928–2018.

?95worksofGlobalAnglophone?ction(outsidetheU.S.andU.K.)from1935–2020.

?99worksofgenre?ction,containingscience?ction/fantasy,horror,mystery/crime,romanceandaction/spynovelsfrom1928–2017.

Pre-1923LitBanktextsareborndigitalonProjectGutenbergandareinthepublicdomainintheUnitedStates;allothersourceswerecreatedbypurchasingphysicalbooks,scanningthemandOCR’ingthemwithAbbyyFineReader.Asofthetimeofwriting,bookspublishedafter1928aregenerallyincopyrightintheU.S.

5Results

5.1ChatGPT/GPT-4

WepassallpassageswiththesamepromptthroughbothChatGPTandGPT-4,usingtheOpenAIAPI.ThetotalcostofthisexperimentwithcurrentOpenAIpricing($0.002/thousandtokensforChat-GPT;$0.03/thousandtokensforGPT-4)isapproxi-mately$400.Wemeasurethenameclozeaccuracyforabookasthefractionof100samplesfromitwherethemodelbeingtestedpredictsthemaskednamecorrectly.

Table

1

presentsthetop20bookswiththehigh-estGPT-4nameclozeaccuracy.Whileworksinthepublicdomaindominatethislist,table

6

inthe

Appendixpresentsthesameforbookspublishedafter1928.

3

Ofparticularinterestinthislististhedominanceofscience?ctionandfantasyworks,includingHarryPotter,1984,LordoftheRings,HungerGames,Hitchhiker’sGuidetotheGalaxy,Fahrenheit451,AGameofThrones,andDune—12ofthetop20mostmemorizedbooksincopy-rightfallinthiscategory.Table

2

exploresthisinmoredetailbyaggregatingtheperformancebythetop-levelcategoriesdescribedabove,includingthespeci?cgenreforgenre?ction.

GPT-4andChatGPTarewidelyknowledge-ableabouttextsinthepublicdomain(includedinpre-1923LitBank);itknowslittleaboutworksofGlobalAnglophonetexts,worksintheBlackBookInteractiveProjectandBlackCaucusAmeri-canLibraryAssociationawardwinners.

Source

GPT-4

ChatGPT

pre-1923LitBank

0.244

0.072

Genre:SF/Fantasy

0.235

0.108

Genre:Horror

0054

0028

Bestsellers

0.033

0.016

Genre:Action/Spy

0.032

0.007

Genre:Mystery/Crime

0.029

0.014

Genre:Romance

0029

0011

Pulitzer

0026

0011

Global

0.020

0.009

BBIP/BCALA

0.017

0.011

Table2:Nameclozeperformancebybookcategory.

3Forcompleteresultsonallbooks,see

https://github

.

com/bamman-group/gpt4-books

.

5.2BERT

Forcomparison,wealsogeneratepredictionsforthemaskedtokenusingBERT(onlypassingthepassagethroughthemodelandnottheprefacedinstructions)toprovideabaselineforhowof-tenamodelwouldguessthecorrectnamewhensimplyfunctioningasalanguagemodel(uncon-strainedtogeneratepropernames).Astable

1

illustrates,BERT’sperformanceisnear0forallbooks—exceptforFiftyShadesofGrey,forwhichitguessesthecorrectname13%ofthetime,sug-gestingthatthisbookwasknowntoBERTduringtraining.

Devlinetal.

(

2019

)notesthatBERTwastrainedonWikipediaandtheBookCorpus,which

Zhuetal.

(

2015

)describeas“freebookswrittenbyyetunpublishedauthors.

”4

ManualinspectionoftheBookCorpushostedbyhuggingface

5

con?rmsthatFiftyShadesofGreyispresentwithinit,alongwithseveralotherpublishedworks,includingDi-

anaGabaldon’sOutlanderandDanBrown’sTheLostSymbol.

6Analysis

6.1Erroranalysis

WeanalyzeexamplesonwhichChatGPTandGPT-4makeerrorstoassesstheimpactofmemorization.Speci?cally,wetestthefollowingquestion:whenamodelmakesanameclozeerror,isitmorelikelytoofferanamefromamemorizedbookthananon-memorizedone?

Totestthis,weconstructsetsofseen(S)andunseen(U)characternamesbythemodels.Todothis,wedivideallbooksintothreecategories:Masbooksthemodelhasmemorized(topdecilebyGPT-4nameclozeaccuracy),?Masbooksthemodelhasnotmemorized(bottomdecile),andHasbooksheldouttotestthehypothesis.WeidentifythetruemaskednamesthataremostassociatedwiththebooksinM—bycalculatingthepositivepointwisemutualinformationbetweenanameandbookpair—toobtainsetS,andthemaskednamesmostassociatedwithbooksin?MtoobtainsetU.WealsoensurethatSandUareofthesamesizeandhavenooverlap.Next,wecalculatetheobservedstatisticasthelog-oddsratioonexamples

4FiftyShadesofGreywasoriginallyself-publishedon

ca.2009beforebeingpublishedbyVintage

Booksin2012.

5

https://huggingface.co/datasets/bookcorpus

fromH:

o=logo、,

wherec?isthepredictedcharacter.Totestforsta-tisticalsigni?cance,weperformarandomizationtest(

Droretal.

,

2018

),wheretheobservedstatisticoiscomparedtoadistributionofthesamestatisticcalculatedbyrandomlyshuf?ingthenamesbe-tweenSandU.

We?ndthatforbothChatGPT(o=1.34,p<0.0001)andGPT-4(o=1.37,p<0.0001),thenullhypothesiscanberejected,indicatingthatbothmodelsaremorelikelytopredictacharacternamefromabooktheyhadmemorizedthanacharacterfromabooktheyhavenot.Thishasimportantconsequences:thesemodelsdonotsimplyperformbetteronasetofmemorizedbooks,butthein-formationfromthosebooksbleedsoutintoothernarrativecontexts.

6.2Extrinsicanalysis

WhydotheGPTmodelsknowaboutsomebooksmorethanothers?Asothershaveshown,dupli-catedcontentintrainingisstronglycorrelatedwithmemorization(

Carlinietal.

,

2023

).Whilethetrain-ingdataforChatGPTandGPT-4isunknown,itlikelyinvolvesdatascrapedfromtheweb,aswithpriormodels’useofWebTextandC4.Towhatdegreeisamodel’sperformanceforabookinournameclozetaskcorrelatedwiththenumberofcopiesofthatbookontheopenweb?Weassessthisusingfoursources:GoogleandBingsearchengineresults,C4,andthePile.

Foreachbookinourdataset,wesample10pas-sagesatrandomfromourevaluationsetandselecta10-gramfromit;wethenqueryeachplatformto?ndthenumberofsearchresultsthatmatchthatexactstring.WeusethecustomsearchAPIforGoogle,

6

theBingWebSearchAPI

7

andindexestoC4andthePilebyAI2(Dodgeetal

.

,

2021

).

8

Table

3

liststheresultsofthisanalysis,display-ingthecorrelation(Spearmanρ)betweenGPT-4nameclozeaccuracyforabookandtheaveragenumberofsearchresultsforallquerypassagesfromit.

6

/custom-search/

v1/overview

7

/en-us/bing/apis/

bing-web-search-api

8

Date

Google

Bing

C4

Pile

pre-1928post-1928

0.74

0.37

0.70

0.41

0.71

0.36

0.84

0.21

Table3:Correlation(Spearmanρ)betweenGPT-4nameclozeaccuracyandnumberofsearchresultsinGoogle,Bing,C4andthePile.

Forworksinthepublicdomain(publishedbe-fore1928),weseeastrongandsigni?cant(p<0.001)correlationbetweenGPT-4nameclozeac-curacyandthenumberofsearchresultsacrossallsources.Google,forinstance,containsanaverageof2,590resultsfor10-gramsfromAliceinWonder-land,1,100resultsfromHuckleberryFinnand279resultsfromAnneofGreenGables.Worksincopy-right(publishedafter1928)showupfrequentlyonthewebaswell.Whilethecorrelationisnotasstrongaspublicdomaintexts(inpartre?ectingthesmallernumberofcopies),itisstronglysigni?cantaswell(p<0.001).Googleagainhasanaverageof3,074searchresultsacrosstheten10-ngramswequeryforHarryPotterandtheSorcerer’sStone,

9

92resultsforTheHungerGamesand41resultsforAGameofThrones.

Domain

Hits

337

257

goodreadscom

234

197

181

148

fliphtml5com

124

118

109

98

Table4:Sourcesforcopyrightedmaterial.

Table

4

liststhemostpopularsourcesforcopy-rightedmaterial.Notably,thislistdoesnotonlyin-cludesourceswherethefulltextisfreelyavailablefordownloadasasingledocument,butalsosourceswheresmallersnippetsofthetextsappe

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論