【英文原版】Claude3技術報告-Claude3模型系列作品、十四行詩、俳句_第1頁
【英文原版】Claude3技術報告-Claude3模型系列作品、十四行詩、俳句_第2頁
【英文原版】Claude3技術報告-Claude3模型系列作品、十四行詩、俳句_第3頁
【英文原版】Claude3技術報告-Claude3模型系列作品、十四行詩、俳句_第4頁
【英文原版】Claude3技術報告-Claude3模型系列作品、十四行詩、俳句_第5頁
已閱讀5頁,還剩71頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領

文檔簡介

TheClaude3ModelFamily:Opus,Sonnet,Haiku

Anthropic

Abstract

WeintroduceClaude3,anewfamilyoflargemultimodalmodels–Claude3Opus

mostcapableoffering,Claude3Sonnet,whichprovidesacombinationofskillsandse

andClaude3Haiku,ourfastestandleastexpensivemodel.Allnewmodelshavevisioncapabilitiesthatenablethemtoprocessandanalyzeimagedata.TheClaude3familydemonstratesstrongperformanceacrossbenchmarkevaluationsandsetsanewstandardonmeasuresofreasoning,math,andcoding.Claude3Opusachievesstate-of-the-artresults

onevaluationslikeGPQA[1],MMLU[2],MMMU[3]andmanymore.Claude3Haiku

performsaswellorbetterthanClaude2[4]onmostpure-texttasks,whileSonnetand

Opussignificantlyoutperformit.Additionally,thesemodelsexhibitimprovedfluencyinnon-Englishlanguages,makingthemmoreversatileforaglobalaudience.Inthisreport,weprovideanin-depthanalysisofourevaluations,focusingoncorecapabilities,safety,societalimpacts,andthecatastrophicriskassessmentswecommittedtoinourResponsible

ScalingPolicy[5]

.

1Introduction

ThismodelcardintroducestheClaude3familyofmodels,whichsetnewindustrybenchmarksacrossrea-soning,math,coding,multi-lingualunderstanding,andvisionquality.

Likeitspredecessors,Claude3modelsemployvarioustrainingmethods,suchasunsupervisedlearningand

ConstitutionalAI[6].ThesemodelsweretrainedusinghardwarefromAmazonWebServices(AWS)and

GoogleCloudPlatform(GCP),withcoreframeworksincludingPyTorch[7],JAX[8],andTriton[9]

.

AkeyenhancementintheClaude3familyismultimodalinputcapabilitieswithtextoutput,allowinguserstouploadimages(e.g.,tables,graphs,photos)alongwithtextpromptsforrichercontextandexpandedusecasesasshowninFigure

1

andAppendix

B.1

Themodelfamilyalsoexcelsattooluse,alsoknownasfunctioncalling,allowingseamlessintegrationofClaude’sintelligenceintospecializedapplicationsandcustomworkflows.

Claude3Opus,ourmostintelligentmodel,setsanewstandardonmeasuresofreasoning,math,andcoding.BothOpusandSonnetdemonstrateincreasedproficiencyinnuancedcontentcreation,analysis,forecasting,accuratesummarization,andhandlingscientificqueries.Thesemodelsaredesignedtoempowerenterprisestoautomatetasks,generaterevenuethroughuser-facingapplications,conductcomplexfinancialforecasts,andexpediteresearchanddevelopmentacrossvarioussectors.Claude3Haikuisthefastestandmostafford-ableoptiononthemarketforitsintelligencecategory,whilealsoincludingvisioncapabilities.TheentireClaude3familyimprovessignificantlyonpreviousgenerationsforcodingtasksandfluencyinnon-EnglishlanguageslikeSpanishandJapanese,enablingusecasesliketranslationservicesandbroaderglobalutility.

DevelopedbyAnthropicandannouncedinMarch2024,theClaude3modelfamilywillbeavailableinourconsumerofferings(Claude.ai,ClaudePro)aswellasenterprisesolutionsliketheAnthropicAPI,AmazonBedrock,andGoogleVertexAI.TheknowledgecutofffortheClaude3modelsisAugust2023.

Thismodelcardisnotintendedtoencompassallofourresearch.Forcomprehensiveinsightsintoourtrainingandevaluationmethodologies,weinviteyoutoexploreourresearchpapers(e.g.,ChallengesinEvaluating

1WesupportJPEG/PNG/GIF/WebP,upto10MBand8000x8000px.Werecommendavoidingsmallorlowresolutionimages.

2

AISystems[10],RedTeamingLanguageModelstoReduceHarms[11],CapacityforMoralSelf-Correction

inLargeLanguageModels[12],TowardsMeasuringtheRepresentationofSubjectiveGlobalOpinionsin

LanguageModels[13],FrontierThreatsRedTeamingforAISafety[14],andourResponsibleScalingPolicy

[5]toaddresscatastrophicrisks).Inadditiontoourpublicresearch,wearealsocommittedtosharingfindings

andbestpracticesacrossindustry,government,andcivilsocietyandregularlyengagewiththesestakeholderstoshareinsightsandbestpractices.Weexpecttoreleasenewfindingsaswecontinueourresearchandevaluationsoffrontiermodels.

2ModelDetails

2.1IntendedUses

Claudeistrainedtobeahelpful,honest,andharmlessassistant.Claudemodelsexcelatopen-endedcon-versationandcollaborationonideas,andalsoperformexceptionallywellincodingtasksandwhenworkingwithtext-whethersearching,writing,editing,outlining,orsummarizing

.2

TheClaude3family’smulti-modalfeaturescaninterpretvisualinput(e.g.charts,graphs,andphotos)tosupportadditionalusecasesandproductivity.Claudemodelshaveahelpful,conversationaltoneandcantakedirectionon“personality.”Usershavedescribedthemasfeelingsteerable,adaptive,andengaging.

Claudeusesallthetextthatusersinput(theprompt)andallthetextithasgeneratedsofarwithinthecon-versationtopredictthenextwordsortokensthatwouldbemosthelpful.ThismeansthatClaudeconstructsitsresponsesonesetofcharactersatatime,inorder.Itcannotgobackandedititsresponsesaftertheyhavebeenconstructedunlessusersgiveitachancetodosoinasubsequentprompt.Claudecanalsoonlysee(andmakepredictionson)whatappearsinitscontextwindow.Itcan’trememberpreviousseparateconversationsunlessusersreinsertsuchmaterialintheprompt,norcanitopenlinks.

2.2UnintendedUses

Themodelsshouldnotbeusedontheirowninhigh-stakessituationswhereanincorrectanswercouldcauseharm.Forexample,whileClaudemodelscouldsupportalawyerordoctor,theyshouldnotbedeployedinsteadofone,andanyresponsesshouldstillbereviewedbyahuman.Claudemodelsdonotcurrentlysearchtheweb(thoughuserscanaskthemtointeractwithadocumentthattheysharedirectly),andthemodelsonlyanswerquestionsusingdatauptomid-2023.Claudemodelscanbeconnectedtosearchtoolsandarethoroughlytrainedtoutilizethem(overtheweborotherdatabases),butunlessspecificallyindicated,itshouldbeassumedthatClaudemodelsarenotusingthiscapability.Claudemodelshavemultilingualcapabilitiesbutperformlessstronglyonlow-resourcelanguages(seeourmultilingualevaluationsbelowformoredetailsinSection5.6).

2.3ProhibitedUses

OurAcceptableUsePolicy(AUP)[15]includesdetailsonprohibitedusecases

.Theseprohibitedusesinclude,butarenotlimitedto,politicalcampaigningorlobbying,surveillance,socialscoring,criminaljusticedecisions,lawenforcement,anddecisionsrelatedtofinancing,employment,andhousing.TheAUPalsooutlinesadditionalsafetyrequirementsforbusinessuses,suchasrequiringdisclosurethatanAIsystemisbeingusedandoutliningwhatitscapabilitiesandlimitationsare.TheAUPalsodetailswhichusecasesrequireimplementinghuman-in-the-loopmeasures.

TheAUPappliestobothimageandtextprompts,andallAnthropicusersmustreadandaffirmativelyac-knowledgetheAUPbeforeaccessingClaudemodels.WeregularlyreviewandupdatetheAUPtoensurethatourproductisassafeandtrustworthyaspossible.

2.4SafeguardingAgainstMisuse

Detectingandmitigatingprohibitedusesofourtechnologyareessentialtopreventingbadactorsfrommis-usingourmodelstogenerateabusive,deceptive,ormisleadingcontent.WeuseautomatedsystemstodetectviolationsofourAUPastheyoccurinrealtime.UserpromptsthatareflaggedasviolatingtheAUPtriggeraninstructiontoourmodelstorespondevenmorecautiously.Incaseswheretheuserpromptisparticularly

2Formoreinformationandadviceonpromptdesign,pleaseseeourdocumentationat

/

claude/docs/introduction-to-prompt-design.

3

severeorharmful,wewillblockthemodelfromrespondingaltogether,andinthecaseofrepeatedviolations,wemayterminatetheuser’sClaudeaccess.

2.5TrainingData

Claude3modelsaretrainedonaproprietarymixofpubliclyavailableinformationontheInternetasofAugust2023,aswellasnon-publicdatafromthirdparties,dataprovidedbydatalabelingservicesandpaidcontractors,anddatawegenerateinternally.Weemployseveraldatacleaningandfilteringmethods,includingdeduplicationandclassification.TheClaude3suiteofmodelshavenotbeentrainedonanyuserpromptoroutputdatasubmittedtousbyusersorcustomers,includingfreeusers,ClaudeProusers,andAPIcustomers.

WhenAnthropicobtainsdatabycrawlingpublicwebpages,wefollowindustrypracticeswithrespecttorobots.txtinstructionsandothersignalsthatwebsiteoperatorsusetoindicatewhethertheypermitcrawlingofthecontentontheirsites.Inaccordancewithourpolicies,Anthropic’scrawlerdoesnotaccesspassword-protectedorsign-inpagesorbypassCAPTCHAcontrols,andweconductdiligenceonthedatathatweuse.Anthropicoperatesitscrawlingsystemtransparently,whichmeanswebsiteoperatorscaneasilyidentifyAnthropicvisitsandsignaltheirpreferencestoAnthropic.

2.6TrainingProcess

Claudewastrainedwithafocusonbeinghelpful,harmless,andhonest.Trainingtechniquesincludepre-trainingonlargediversedatatoacquirelanguagecapabilitiesthroughmethodslikewordprediction,aswellashumanfeedbacktechniquesthatelicithelpful,harmless,honestresponses.Anthropicusedatechnique

calledConstitutionalAI[16]toalignClaudewithhumanvaluesduringreinforcementlearningbyexplicitly

specifyingrulesandprinciplesbasedonsourceslikethe

UNDeclarationofHumanRights.

WithClaude3models,wehaveaddedanadditionalprincipletoClaude’sconstitutiontoencouragerespectfordisability

rights,sourcedfromourresearchonCollectiveConstitutionalAI[17]

.Someofthehumanfeedbackdata

usedtofinetuneClaudewasmadepublic[18]alongsideourRLHF[19]andred-teamingresearch.

Onceourmodelsarefullytrained,werunasuiteofevaluationsforsafety.OurTrustandSafetyteamalsorunscontinuousclassifierstomonitorpromptsandoutputsforharmful,malicioususecasesthatviolateourAUP.Seemoreonbothintheevaluationssectionsbelow.

2.7ReleaseDecisionsandMaintenance

WetakeanumberofconcretestepstoresponsiblydevelopanddeployAIsystems,drawingonguidancefrom

theNISTAIRiskManagementFrameworkanditsMap,Measure,Manage,andGovernSubcategories[20]

.Weclearlydocumentthewaysinwhichourproductsmayandmaynotbeused,aswellasthelimitationsandpotentialrisksofusingourproducts.Weregularlyevaluateoursystemsthroughinteractiveredteaming,aswellasassessmentsagainstbenchmarksforbothproductperformanceandpotentialsafetyrisks.Tomanagepotentialrisks,weincrementallyrolloutaccesstoourproductstoensuretheirsafetyandreliability;useacombinationofautomatedmonitoringforpotentialharmsandviolationsofourAUP,aswellashumanreviewtoaudittheaccuracyofourclassifiers;andregularlyupdateourmodelstoversionsthathavebeenhardenedagainstnewly-identifiedrisksandpotentialvulnerabilities.

Wealsotreatsensitivedataandthepersonalinformationoftheendusersofourproductsandserviceswithgreatcare.Weimplementretentionpoliciestoensurethatourstorageofpersonalandsensitiveinformationisproportionatetotheneedforthedata,suchastomonitorandimproveourTrustandSafetyprocesses.Forour

consumerproductsanduseofourwebsite,ourprivacypolicy[21]sharesadditionaldetailsondataprivacy,

use,andretention.

Wealsofollowour

ResponsibleScalingPolicy,whichguidesourdevelopmentanddeploymentofincreas

-inglycapableAIsystems,asdescribedbelow.AsaPublicBenefitCorporation(PBC),wearefocusedonthesafedevelopmentanddeploymentofAIsystemsatalllevelsoftheorganization,uptoandincludingourexecutiveleadershipteam.

4

3Security

Weprotectthesecurityoftheenvironmentofourmodelstohelpensuretheirintegrityusingavarietyofcon-nectionauthenticationandauthorizationtechniques;peoplearerequiredtousemulti-factorauthenticationatalltimes.Ouradvancedmodelsareprotectedbytwo-partycontrols.AccesstoAImodelinfrastructureisgrantedexplicitlyperuserandvalidatedperaccessattempt.Allaccountswithaccesstotheservinginfrastruc-turehostingourservicesareprotectedviarigorouspasswordrequirementsandmulti-factorauthentication.Eachaccountisprovisionedwiththeminimumprivilegelevelsneededbyitsowner.Additionallayersofdefenseincludecontinuoussystems’monitoring,24/7alertresponse,endpointhardening,datastorageandsharingcontrols,personnelvetting,andphysicalsecurityhardening.Wetakesignificantcareintestinganycodechangespriortodeploymenttoproductionenvironmentsincludingcodereview.Finally,weengagewithpenetrationtesterstoexerciseourdetectionsystemsandimproveourdefenseposture.

4SocialResponsibility

AsaPBC,AnthropiciscommittedtodevelopingsafeandresponsibleAIsystemsthroughouteachstageofthedevelopmentprocess.Claude3modelsshowamorenuancedunderstandingofrequests,recognizerealharm,andrefusetoanswerharmlesspromptslessoftenthanpriormodels.Thatsaid,theycanstillmakemistakesandourworktomakeClaudemorehelpful,harmless,andhonestisongoing.EthicalconsiderationsalsoshapebothourAUP,whichdelineatespermissibleandimpermissibleusesofClaude,andtheTrustandSafetyprocessesthatenforceit.

4.1ConstitutionalAI

OurcoreresearchfocushasbeentrainingClaudemodelstobehelpful,honest,andharmless.Currently,wedothisbygivingmodelsaConstitution–asetofethicalandbehavioralprinciplesthatthemodelusestoguideitsoutputs.ThemajorityoftheprinciplesinClaude’sconstitutionarethesameasthosewepublished

inMay2023[6].UsingthisConstitution,modelsaretrainedtoavoidsexist,racist,andtoxicoutputs,aswell

astoavoidhelpingahumanengageinillegalorunethicalactivities.InresponsetoourworkonCollective

ConstitutionalAI[17],weaddedanadditionalprincipleinformedbyourpublicinputprocess,whichin

-structsClaudetobeunderstandingofandaccessibletoindividualswithdisabilities,resultinginlowermodelstereotypebias.

4.2Labor

AnthropicworkswithseveraldataworkplatformswhichareresponsibleforengagingandmanagingdataworkerswhoworkonAnthropic’sprojects.

DataworktasksincludeselectingpreferredmodeloutputsinordertotrainAImodelstoalignwiththosepreferences;evaluatingmodeloutputsaccordingtoabroadrangeofcriteria(e.g.,accuracy,helpfulness,harmlessness,etc.);andadversariallytesting(i.e.,redteaming)ourmodelstoidentifypotentialsafetyvul-nerabilities.Thisdataworkisprimarilyusedinourtechnicalsafetyresearch,andselectaspectsofitarealsousedinourmodeltraining.

4.3Sustainability

Weoffsetouremissions(includingfromourcloudcomputingusage)andworkwithcloudprovidersthatprioritizerenewableenergyandcarbonneutrality.Anthropicworkstofullyoffsetouroperationalcarbonemissionseachyear,partneringwithexternalexpertstoconductarigorousanalysisofourcompany-widecarbonfootprint.Oncemeasured,weinvestinverifiedcarboncreditstofullyoffsetourannualfootprint.Ourcreditsdirectlyfundemissionsreductionprojects.Ourgoalistomaintainnetzeroclimateimpactonanannualbasisthroughsuchinitiativesandoffsets.

5CoreCapabilitiesEvaluations

WeconductedacomprehensiveevaluationoftheClaude3familytoanalyzetrendsintheircapabilitiesacrossvariousdomains.Ourassessmentincludedseveralbroadcategories:

5

?Reasoning:Benchmarksinthiscategoryrequiremathematical,scientific,andcommonsenserea-soning,testingthemodels’abilitytodrawlogicalconclusionsandapplyknowledgetoreal-worldscenarios.

?Multilingual:Thiscategorycomprisestasksfortranslation,summarization,andreasoninginmul-tiplelanguages,evaluatingthemodels’linguisticversatilityandcross-lingualunderstanding.

?LongContext:Theseevaluationsarefocusedonquestionansweringandretrieval,assessingthemodels’performanceinhandlingextendedtextsandextractingrelevantinformation.

?Honesty/Factuality:Questionsinthiscategoryassessthemodels’abilitytoprovideaccurateandreliableresponses,eitherintermsoffactualaccuracyorfidelitytoprovidedsourcematerials.Whenunsure,themodelsareexpectedtobehonestabouttheirlimitations,expressinguncertaintyoradmittingthattheydonothavesufficientinformationtoprovideadefinitiveanswer.

?Multimodal:Evaluationsincludequestionsonsciencediagrams,visualquestionanswering,andquantitativereasoningbasedonimages.

Thesecapabilitiesevaluationshelpedmeasurethemodels’skills,strengths,andweaknessesacrossarangeoftasks.Manyoftheseevaluationsareindustrystandard,andwehaveinvestedinadditionalevaluationtechniquesandtopicsdescribedbelow.Wealsopresentinternalbenchmarkswe’vedevelopedoverthecourseoftrainingtoaddressissueswithharmlessrefusals.

5.1Reasoning,Coding,andQuestionAnswering

WeevaluatedtheClaude3familyonaseriesofindustry-standardbenchmarkscoveringreasoning,read-ingcomprehension,math,science,andcoding.TheClaude3modelsdemonstratesuperiorcapabilitiesintheseareas,surpassingpreviousClaudemodels,andinmanycasesachievingstate-of-the-artresults.TheseimprovementsarehighlightedinourresultspresentedinTable

1.

Wetestedourmodelsonchallengingdomain-specificquestionsinGPQA[1],MMLU[2],ARC-Challenge

[22],andPubMedQA[23];mathproblemsolvinginbothEnglish(GSM8K,MATH)[24,

25]andmultilingual

settings(MGSM)[26];common-sensereasoninginHellaSwag[27],WinoGrande[28];reasoningovertextin

DROP[29];readingcomprehensioninRACE-H[30]andQuALITY[31](seeTable

6);codinginHumanEval

[32],APPS[33],andMBPP[34];andavarietyoftasksinBIG-Bench-Hard[35,

36]

.

GPQA(AGraduate-LevelGoogle-ProofQ&ABenchmark)isofparticularinterestbecauseitisanewevalu-ationreleasedinNovember2023withdifficultquestionsfocusedongraduatelevelexpertiseandreasoning.WefocusmainlyontheDiamondsetasitwasselectedbyidentifyingquestionswheredomainexpertsagreedonthesolution,butexpertsfromotherdomainscouldnotsuccessfullyanswerthequestionsdespitespendingmorethan30minutesperproblem,withfullinternetaccess.WefoundtheGPQAevaluationtohaveveryhighvariancewhensamplingwithchain-of-thoughtatT=1.InordertoreliablyevaluatescoresontheDi-amondset0-shotCoT(50.4%)and5-shotCoT(53.3%),wecomputethemeanover10differentevaluationrollouts.Ineachrollout,werandomizetheorderofthemultiplechoiceoptions.WeseethatClaude3Opustypicallyscoresaround50%accuracy.Thisimprovesgreatlyonpriormodelsbutfallssomewhatshortof

graduate-leveldomainexperts,whoachieveaccuracyscoresinthe60-80%range[1]onthesequestions

.

Weleveragemajorityvoting[37]attesttimetoevaluatetheperformancebyaskingmodelstosolveeach

problemusingchain-of-thoughtreasoning(CoT)[38]

Ndifferenttimes,samplingatT=1,andthenwereporttheanswerthatoccursmostoften.Whenweevaluateinthiswayinafew-shotsettingMaj@32Opusachievesascoreof73.7%forMATHand59.5%forGPQA.Forthelatter,weaveragedover10iterationsofMaj@32asevenwiththisevaluationmethodology,therewassignificantvariance(withsomerolloutsscoringinthelow60s,andothersinthemid-to-high50s).

6

Claude3Opus

Claude3Sonnet

Claude3Haiku

GPT-

43

GPT-3.

53

Gemini

1.

0Ultra4

Gemini1.

5Pro4

Gemini1.

0Pro4

MMLU5-shot

86.8%

79.0%

75.2%

86.4%

70.0%

83.7%

81.9%

71.8%

Generalreasoning5-shotCoT

88.2%

81.5%

76.7%

MATH5

4-shot

61%

40.5%

40.9%

52.9%

6,7

34.1%

53.2%

58.5%

32.6%

Mathematical

problemsolving

0-shot

60.1%

43.1%

38.9%

42.5%

(from

[39]

)

Maj@324-shot

73.7%

55.1%

50.3%

GSM8K

95.0%

92.3%

88.9%

92.0%

57.1%

94.4%

91.7%

86.5%

Gradeschoolmath

0-shotCoT

0-shotCoT

0-shotCoT

SFT,5-shotCoT

5-shot

Maj1@32

11-shot

Maj1@32

HumanEval

Pythoncodingtasks

0-shot

84.9%

73.0%

75.9%

67.

0%6

48.1%

74.4%

71.9%

67.7%

GPQA(Diamond)

GraduatelevelQ&A

0-shotCoT

50.4%

40.4%

33.3%

35.7%

(from

[1]

)

28.1%

(from

[1]

)

Maj@325-shotCoT

59.5%

46.3%

40.1%

MGSM

Multilingualmath

90.7%

0-shot

83.5%

0-shot

75.1%

0-shot

74.

5%7

8-shot

79.0%

8-shot

88.7%

8-shot

63.5%

8-shot

DROP

Readingcomprehension,arithmetic

F1Score

83.1

78.9

78.4

80.9

64.1

82.4

78.9

74.1

3-shot

3-shot

3-shot

3-shot

3-shot

Variableshots

Variableshots

Variableshots

BIG-Bench-Hard

Mixedevaluations

3-shotCoT

86.8%

82.9%

73.7%

83.

1%7

66.6%

83.6%

84.0%

75.0%

ARC-Challenge

Common-sensereasoning

25-shot

96.4%

93.2%

89.2%

96.3%

85.2%

HellaSwag

Common-sensereasoning10-shot

95.4%

89.0%

85.9%

95.3%

85.5%

87.8%

92.5%

84.7%

PubMedQA8

5-shot

75.8%

78.3%

76.0%

74.4%

60.2%

Biomedicalquestions0-shot

74.9%

79.7%

78.5%

75.2%

71.6%

WinoGrande

Common-sensereasoning5-shot

88.5%

75.1%

74.2%

87.5%

RACE-H

Readingcomprehension

5-shot

92.9%

88.8%

87.0%

APPS

Pythoncodingtasks

0-shot

70.2%

55.9%

54.8%

MBPP

Codegeneration

Pass@1

86.4%

79.4%

80.4%

Table1Weshowevaluationresultsforreasoning,math,coding,readingcomprehension,andquestionanswering.MoreresultsonGPQAaregiveninTable

8.

3

AllGPTscoresreportedintheGPT-4TechnicalReport[40],unlessotherwisestated.

4

AllGeminiscoresreportedintheGeminiTechnicalReport[41]ortheGemini1.5TechnicalReport[42],unless

otherwisestated.

5Claude3modelswereevaluatedusingchain-of-thoughtprompting.

6

Researchershavereportedhigherscores[43]foranewerversionofGPT-4T.

7GPT-4scoresonMATH(4-shotCoT),MGSM,andBigBenchHardwerereportedintheGeminiTechnicalReport

[41]

.

8

PubMedQAscoresforGPT-4andGPT-3.5werereportedin[44]

.

7

Claude3Opus

Claude3Sonnet

Claude3Haiku

GPT-

43

3

GPT-3.5

LSAT

5-shotCoT161

158.3

156.3

163

149

MBE

0-shotCoT85%

71%

64%

75.7%

(from

[51]

)

45.1%

(from

[51]

)

AMC129

5-shotCoT63/150

27/150

48/150

60/150

30/150

AMC109

5-shotCoT72/150

24/150

54/150

36/15010

36/150

AMC89

5-shotCoT84/150

54/150

36/150

GRE(Quantitative)

5-shotCoT159

163

147

GRE(Verbal)

5-shotCoT166

169

154

GRE(Writing)

k-shotCoT5.0(2-shot)

4.0(1-shot)

4.0(1-shot)

Table2ThistableshowsevaluationresultsfortheLSAT,theMBE(multistatebarexam),highschoolmathcontests(AMC),andtheGREGeneraltest.ThenumberofshotsusedforGPTevaluationsisinferredfrom

AppendixA.3andA.8of[40]

.

5.2StandardizedTests

WeevaluatedtheClaude3familyofmodelsontheLawSchoolAdmissionTest(LSAT)[45],theMultistate

BarExam(MBE)[46],theAmericanMathematicsCompetition[47]2023mathcontests,andtheGraduate

RecordExam(GRE)GeneralTest[48].SeeTable

2

forasummaryofresults.

WeobtainedLSATscoresforClaude3familymodelsbyaveragingthescaledscoreof3OfficialLSATPracticetests:PT89fromNov2019,PT90andPT91fromMay2020.Wegeneratedfew-shotexamplesusingPT92andPT93fromJune2020.FortheMBEorbarexam,weusedNCBE’sofficial2021MBE

practiceexam[49]

.

Wetestedourmodelsonall150officialAMC2023problems(50eachfromAMC8,10,and12)[47]

.Becauseofhighvariance,wesampledanswerstoeachquestionfivetimesatT=1,andreporttheoverallpercentansweredcorrectlyforeachexammultipliedby150.OfficialAMCexamshave25questions,andcontestantsearn6pointsforcorrectanswers,1.5pointsforskippedquestions,and0pointsforincorrectanswers,foramaximumpossiblescoreof150.

OurscoreforClaudeOpuswasobtainedontheEducationalTestingService’sofficialGREPracticeTest2,

withfew-shotexamplesfromtheofficialGREPracticeTest1[50]

.

5.3VisionCapabilities

TheClaude3familyofmodelsaremultimodal(imageandvideo-frameinput)andhavedemonstratedsignif-icantprogressintacklingcomplexmultimodalreasoningchallengesthatgobeyondsimpletextcomprehen-sion.

Aprimeexampleisthemodels’performanceontheAI2Dsciencediagrambenchmark[52],avisualquestion

answeringevaluationthatinvolvesdiagramparsingandansweringcorrespondingquestionsinamultiple-choiceformat.Claude3Sonnetreachesthestateoftheartwith89.2%in0-shotsetting,followedbyClaude3Opus(88.3%)andClaude3Haiku(80.6%)(seeTable

3)

.

AlltheresultsinTable

3

havebeenobtainedbysamplingattemperatureT=0.ForAI2D,someimageswereupsampledsuchthattheirlongeredgesspan800pixelswhilepreservingtheiraspectratios.Thisupsamplingmethodyieldeda3-4%improvementinperformance.ForMMMU,wealsoreportClaude3models’performanceperdisciplineinTable

3.

Figure

1

showsClaude3Opusreadingandanalyzingachart,andAppendix

B

includessomeadditionalvisionexamples.

9ForAMC10and12,weevaluatedourmodelsonSetAandBforthe2023exam.ForAMC8,weevaluatedourmodelsonthe25-question2023exam.GPTscoresareforthe2022exams.

10

GPT-4outperformsGPT-4VonAMC10[40];wereportthehigherscorehere.

8

Claude3Opus

Claude3

Sonnet

Claude3Haiku

GPT-

4V11

Gemini

1.

0Ultra4

Gemini1.

5Pro4

Gemini1.

0Pro4

MMMU[3]

(val)

→Art&Design

67.5%

61.7%

60.8%

65.8%

70.0%

→Business

67.2%

58.2%

52.5%

59.3%

56.7%

→Science

48.9%

37.1%

37.1%

54.7%

48.0%

→Health&Medicine

61.1%

57.1%

52.3%

64.7%

67.3%

→Humanities&

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論