2024OpenAI Sora視頻?成大模型技術(shù)分析_第1頁
2024OpenAI Sora視頻?成大模型技術(shù)分析_第2頁
2024OpenAI Sora視頻?成大模型技術(shù)分析_第3頁
2024OpenAI Sora視頻?成大模型技術(shù)分析_第4頁
2024OpenAI Sora視頻?成大模型技術(shù)分析_第5頁
已閱讀5頁,還剩16頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

OpenAISora視頻?成大模型技術(shù)分析OpenAI探索了視頻數(shù)據(jù)?成模型的?規(guī)模訓(xùn)練。具體來說,研究?員在可變持續(xù)時間、分辨率和寬??的視頻和圖像上聯(lián)合訓(xùn)練了?個?本條件擴散模型。作者利?對視頻和圖像潛在代碼的時空補丁進(jìn)?transformer架構(gòu)其最?Sora能夠?成?達(dá)?分鐘的?質(zhì)量視頻。OpenAI認(rèn)為,新展?的結(jié)果表明,擴展視頻?成模型是構(gòu)建物理世界通?模擬器的?條有前途的途徑。Weexplorelarge-scaletrainingofgenerativemodelsonvideodata.Specifically,wetraintext-conditionaldiffusionmodelsjointlyonvideosandimagesofvariabledurations,resolutionsandaspectratios.Weleverageatransformerarchitecturethatoperatesonspacetimepatchesofvideoandimagelatentcodes.Ourlargestmodel,Sora,iscapableofgeneratingaminuteofhighfidelityvideo.Ourresultssuggestthatscalingvideogenerationmodelsisapromisingpathtowardsbuildinggeneralpurposesimulatorsofthephysicalworld.視視頻加載失敗,請刷新??再試RefreshOpenAI在技術(shù)報告中重點展?了:(1)將所有類型的視覺數(shù)據(jù)轉(zhuǎn)化為統(tǒng)?表?,從?能夠?規(guī)模訓(xùn)練?成模型的?法;以及(2)對Sora的能?和局限性進(jìn)?定性評估。Thistechnicalreportfocuseson(1)ourmethodforturningvisualdataofalltypesintoaunifiedrepresentationthatenableslarge-scaletrainingofgenerativemodels,and(2)qualitativeevaluationofSora’scapabilitiesandlimitations.Modelandimplementationdetailsarenotincludedinthisreport.令?遺憾的是,OpenAI的報告不包含模型和訓(xùn)練的細(xì)節(jié)。最近?段時間,視頻?成是AI領(lǐng)域的重要?向,先前的許多?作研究了視頻數(shù)據(jù)的?成建模?向,包括循環(huán)?絡(luò)、?成對抗?絡(luò)、?回歸transformer和擴散模型。這些?作通常關(guān)注??類視覺數(shù)據(jù)、較短的視頻或固定??的視頻。Muchpriorworkhasstudiedgenerativemodelingofvideodatausingavarietyofmethods,includingrecurrentnetworks,generativeadversarialnetworks,4,5,6,7autoregressivetransformers,8,9anddiffusionmodels.10,11,12Theseworksoftenfocusonanarrowcategoryofvisualdata,onshortervideos,oronvideosofafixedsize.Soraisageneralistmodelofvisualdata—itcangeneratevideosandimagesspanningdiversedurations,aspectratiosandresolutions,uptoafullminuteofhighdefinitionvideo.與之不同的是,OpenAI的Sora是視覺數(shù)據(jù)的通?模型,它可以?成不同時?、?寬?和分辨率的視頻和圖像,?且最多可以輸出?達(dá)?分鐘的?清視頻。視覺數(shù)據(jù)轉(zhuǎn)為Patches?型語?模型通過在互聯(lián)?規(guī)模的數(shù)據(jù)上進(jìn)?訓(xùn)練,獲得了出?的通?能?中,OpenAI從這?點汲取了靈感。LLM得以確?新范式,部分得益于創(chuàng)新了token使?的?法。研究?員們巧妙地將?本的多種模態(tài)——代碼、數(shù)學(xué)和各種?然語?統(tǒng)?了起來。在這項?作中,OpenAI考慮了?成視覺數(shù)據(jù)的模型如何繼承這種?法的好處。?型語?模型有?本token,Sora有視覺patches。此前的研究已經(jīng)證patches是視覺數(shù)據(jù)模型的有效表?。OpenAI發(fā)現(xiàn)patches是訓(xùn)練?成各種類型視頻和圖像的模型的可擴展且有效的表?。在更?層?上OpenAI?先將視頻壓縮到較低維的潛在空間將表?分解為時空patches,從?將視頻轉(zhuǎn)換為patches。Wetakeinspirationfromlargelanguagemodelswhichacquiregeneralistcapabilitiesbytrainingoninternet-scaledata.13,14ThesuccessoftheLLMparadigmisenabledinpartbytheuseoftokensthatelegantlyunifydiversemodalitiesoftext—code,mathandvariousnaturallanguages.Inthiswork,weconsiderhowgenerativemodelsofvisualdatacaninheritsuchbenefits.WhereasLLMshavetexttokens,Sorahasvisualpatches.Patcheshavepreviouslybeenshowntobeaneffectiverepresentationformodelsofvisualdata.15,16,17,18Wefindthatpatchesareahighly-scalableandeffectiverepresentationfortraininggenerativemodelsondiversetypesofvideosandimages.視頻壓縮?絡(luò)OpenAI訓(xùn)練了?個降低視覺數(shù)據(jù)維度的?絡(luò)。該?絡(luò)將原始視頻作為輸?,并輸出在時間和空間上壓縮的潛在表?。Sora在這個壓縮的潛在空間中接受訓(xùn)練,?后?成視頻。OpenAI還訓(xùn)練了相應(yīng)的解碼器模型,將?成的潛在表?映射回像素空間。Wetrainanetworkthatreducesthedimensionalityofvisualdata.20Thisnetworktakesrawvideoasinputandoutputsalatentrepresentationthatiscompressedbothtemporallyandspatially.Soraistrainedonandsubsequentlygeneratesvideoswithinthiscompressedlatentspace.Wealsotrainacorrespondingdecodermodelthatmapsgeneratedlatentsbacktopixelspace.時空潛在patches給定?個壓縮的輸?視頻,OpenAI提取?系列時空patches,充當(dāng)Transformer的tokens。該?案也適?于圖像,因為圖像可視為單幀視頻。OpenAI基于patches的表?使Sora能夠?qū)Σ煌直媛?、持續(xù)時間和?寬?的視頻和圖像進(jìn)?訓(xùn)練。在推理時,OpenAI可以通過在適當(dāng)??的?格中排列隨機初始化的patches來控制?成視頻的??。Givenacompressedinputvideo,weextractasequenceofspacetimepatcheswhichactastransformertokens.Thisschemeworksforimagestoosinceimagesarejustvideoswithasingleframe.Ourpatch-basedrepresentationenablesSoratotrainonvideosandimagesofvariableresolutions,durationsandaspectratios.Atinferencetime,wecancontrolthesizeofgeneratedvideosbyarrangingrandomly-initializedpatchesinanappropriately-sizedgrid.Scalingtransformersforvideogeneration擴展Transformer?于視頻?成Soraisadiffusionmodel21,22,23,24,25;giveninputnoisypatches(andconditioninginformationliketextprompts),it’strainedtopredicttheoriginal“clean”patches.Importantly,Soraisadiffusiontransformer.26Transformershavedemonstratedremarkablescalingpropertiesacrossavarietyofdomains,includinglanguagemodeling,13,14computervision,15,16,17,18andimagegeneration.27,28,29Sora是?個擴散模型;給定輸?的噪聲塊(和像?本提?這樣的條件信息),它被訓(xùn)練來預(yù)測原始的“?凈”塊。重要的是,Sora是?個擴散變換器。變換器在包括語?建模、計算機視覺和圖像?成等多個領(lǐng)域展現(xiàn)了顯著的擴展屬性。Inthiswork,wefindthatdiffusiontransformersscaleeffectivelyasvideomodelsaswell.Below,weshowacomparisonofvideosampleswithfixedseedsandinputsastrainingprogresses.Samplequalityimprovesmarkedlyastrainingcomputeincreases.在這項?作中,我們發(fā)現(xiàn)擴散變換器作為視頻模型也能有效地擴展。下?,我們展?了訓(xùn)練進(jìn)展過程中,使?固定種?和輸?的視頻樣本?較。隨著訓(xùn)練計算量的增加,樣本質(zhì)量顯著提?。Variabledurations,resolutions,aspectratios可變持續(xù)時間、分辨率、寬??Pastapproachestoimageandvideogenerationtypicallyresize,croportrimvideostoastandardsize–e.g.,4secondvideosat256x256resolution.Wefindthatinsteadtrainingondataatitsnativesizeprovidesseveralbenefits.過去在圖像和視頻?成中的?法通常會將視頻調(diào)整??、裁剪或剪輯到?個標(biāo)準(zhǔn)尺?——例如,4秒?的視頻,分辨率為256x256。我們發(fā)現(xiàn),直接在數(shù)據(jù)的原始尺?上進(jìn)?訓(xùn)練可以帶來?個好處。Samplingflexibility采樣靈活性Soracansamplewidescreen1920x1080pvideos,vertical1080x1920videosandeverythinginbetween.ThisletsSoracreatecontentfordifferentdevicesdirectlyattheirnativeaspectratios.Italsoletsusquicklyprototypecontentatlowersizesbeforegeneratingatfullresolution—allwiththesamemodel.Sora可以采樣寬屏1920x1080p視頻、豎屏1080x1920視頻以及介于兩者之間的所有格式。這使得Sora能夠直接按照不同設(shè)備的原?寬??創(chuàng)建內(nèi)容。它還允許我們在使?同?模型?成全分辨率內(nèi)容之前,快速原型化較?尺?的內(nèi)容。Improvedframingandcomposition改進(jìn)的構(gòu)圖和畫?組成Weempiricallyfindthattrainingonvideosattheirnativeaspectratiosimprovescompositionandframing.WecompareSoraagainstaversionofourmodelthatcropsalltrainingvideostobesquare,whichiscommonpracticewhentraininggenerativemodels.Themodeltrainedonsquarecrops(left)sometimesgeneratesvideoswherethesubjectisonlypartiallyinview.Incomparison,videosfromSora(right)shaveimprovedframing.我們通過實證發(fā)現(xiàn),在視頻的原始寬??上進(jìn)?訓(xùn)練可以改善構(gòu)圖和取景。我們將Sora與?個版本的模型進(jìn)?了?較,該模型將所有訓(xùn)練視頻裁剪成正?形,這是訓(xùn)練?成模型時的常?做法。在正?形裁剪上訓(xùn)練的模型(左側(cè))有時會?成主體只部分出現(xiàn)在視野中的視頻。相?之下,來?Sora的視頻(右側(cè))具有改善的取景。Languageunderstanding語?理解Trainingtext-to-videogenerationsystemsrequiresalargeamountofvideoswithcorrespondingtextcaptions.Weapplythere-captioningtechniqueintroducedinDALL·E330tovideos.Wefirsttrainahighlydescriptivecaptionermodelandthenuseittoproducetextcaptionsforallvideosinourtrainingset.Wefindthattrainingonhighlydescriptivevideocaptionsimprovestextfidelityaswellastheoverallqualityofvideos.訓(xùn)練?本到視頻?成系統(tǒng)需要?量帶有相應(yīng)?字標(biāo)題的視頻。我們將在DALL·E3中引?的重新標(biāo)注技術(shù)應(yīng)?到視頻上。我們?先訓(xùn)練?個?度描述性的標(biāo)注模型,然后使?它為我們訓(xùn)練集中的所有視頻?成?字標(biāo)題。我們發(fā)現(xiàn),在?度描述性的視頻標(biāo)題上進(jìn)?訓(xùn)練可以提??本的準(zhǔn)確性以及視頻的整體質(zhì)量。SimilartoDALL·E3,wealsoleverageGPTtoturnshortuserpromptsintolongerdetailedcaptionsthataresenttothevideomodel.ThisenablesSoratogeneratehighqualityvideosthataccuratelyfollowuserprompts.類似于DALL·E3,我們也利?GPT將?戶的簡短提?轉(zhuǎn)換成更?的詳細(xì)說明,然后發(fā)送給視頻模型。這使得Sora能夠?成?質(zhì)量的視頻,準(zhǔn)確地遵循?戶的提?。Promptingwithimagesandvideos使?圖?和視頻進(jìn)?提?Alloftheresultsaboveandinourlandingpageshowtext-to-videosamples.ButSoracanalsobepromptedwithotherinputs,suchaspre-existingimagesorvideo.ThiscapabilityenablesSoratoperformawiderangeofimageandvideoeditingtasks—creatingperfectlyloopingvideo,animatingstaticimages,extendingvideosforwardsorbackwardsintime,etc.上述結(jié)果以及我們的登錄??展?了?本到視頻的樣本。但是Sora也可以通過其他輸?進(jìn)?提?,例如預(yù)先存在的圖?或視頻。這項能?使得Sora能夠執(zhí)??泛的圖像和視頻編輯任務(wù)——創(chuàng)建完美循環(huán)的視頻,為靜態(tài)圖像添加動畫,向前或向后延?視頻的時間等。AnimatingDALL·Eimages制作DALL·E圖像動畫Soraiscapableofgeneratingvideosprovidedanimageandpromptasinput.BelowweshowexamplevideosgeneratedbasedonDALL·E231andDALL·E330images.Sora能夠根據(jù)輸?的圖?和提??成視頻。下?我們展?了基于DALL·E231和DALL·E330圖??成的?例視頻。Extendinggeneratedvideos延??成的視頻Soraisalsocapableofextendingvideos,eitherforwardorbackwardintime.Belowarefourvideosthatwereallextendedbackwardintimestartingfromasegmentofageneratedvideo.Asaresult,eachofthefourvideosstartsdifferentfromtheothers,yetallfourvideosleadtothesameending.Sora也能夠?qū)⒁曨l向前或向后延?時間。下?是四個視頻,它們都是從?成的視頻?段開始向后延?的。因此,這四個視頻的開頭各不相同,但最終都會達(dá)到相同的結(jié)局。Wecanusethismethodtoextendavideobothforwardandbackwardtoproduceaseamlessinfiniteloop.我們可以使?這種?法將視頻向前和向后擴展,以制作出?縫的?限循環(huán)。Video-to-videoediting視頻到視頻編輯Diffusionmodelshaveenabledaplethoraofmethodsforeditingimagesandvideosfromtextprompts.Belowweapplyoneofthesemethods,SDEdit,32toSora.ThistechniqueenablesSoratotransformthestylesandenvironmentsofinputvideoszero-shot.擴散模型使得從?本提?編輯圖像和視頻的?法層出不窮。下?我們將其中?種?法,SDEdit,應(yīng)?于Sora。這項技術(shù)使得Sora能夠零次學(xué)習(xí)地轉(zhuǎn)換輸?視頻的?格和環(huán)境。Connectingvideos連接視頻WecanalsouseSoratograduallyinterpolatebetweentwoinputvideos,creatingseamlesstransitionsbetweenvideoswithentirelydifferentsubjectsandscenecompositions.Intheexamplesbelow,thevideosinthecenterinterpolatebetweenthecorrespondingvideosontheleftandright.我們還可以使?Sora在兩個輸?視頻之間逐漸插值,創(chuàng)建在完全不同主題和場景構(gòu)成的視頻之間的?縫過渡。在下?的例?中,中間的視頻在左右兩邊對應(yīng)視頻之間進(jìn)?插值。Imagegenerationcapabilities圖像?成能?Soraisalsocapableofgeneratingimages.WedothisbyarrangingpatchesofGaussiannoiseinaspatialgridwithatemporalextentofoneframe.Themodelcangenerateimagesofvariablesizes—upto2048x2048resolution.Sora也能夠?成圖像。我們通過在具有?個幀時間范圍的空間?格中排列?斯噪聲塊來實現(xiàn)這?點。該模型可以?成不同??的圖像——分辨率最?可達(dá)2048x2048。portraitshotofawomaninautumn,extremedetail,shallowdepthoffield秋天??位?性的特寫肖像,極致細(xì)節(jié),淺景深

Close-upcoralreefteemingwithcolorfulfishandseacreatures充滿活?的珊瑚礁,擠滿了五彩繽紛的?類和海洋?物

VibrantDigital artofayoungtigerunderanappletreeinamattepaintingstylewithgorgeousdetails數(shù)字藝術(shù):?只幼年?虎在蘋果樹下,采?啞光繪畫?格,細(xì)節(jié)華麗A snowymountain village with cozy cabins and a northern lights display, high detail andphotorealisticdslr,50mmf/1.2?個雪?村莊,有著舒適的??屋和北極光展?,?清晰度和逼真的數(shù)碼單反相機,50mmf/1.2鏡頭拍攝。Emergingsimulationcapabilities涌現(xiàn)的模擬能?Wefindthatvideomodelsexhibitanumberofinterestingemergentcapabilitieswhentrainedatscale.ThesecapabilitiesenableSoratosimulatesomeaspectsofpeople,animalsandenvironmentsfromthephysicalworld.Thesepropertiesemergewithoutanyexplicitinductivebiasesfor3D,objects,etc.—theyarepurelyphenomenaofscale.我們發(fā)現(xiàn),當(dāng)在?規(guī)模上訓(xùn)練時,視頻模型展現(xiàn)出許多有趣的新興能?。這些能?使得Sora能夠模擬現(xiàn)實世界中?類、動物和環(huán)境的某些??。這些屬性并沒有任何針對3D、物體等的明確歸納偏?——它們純粹是規(guī)模效應(yīng)的現(xiàn)象。3Dconsistency.Soracangeneratevideoswithdynamiccameramotion.Asthecamerashiftsandrotates,peopleandsceneelementsmoveconsistentlythroughthree-dimensionalspace.3D?致性。Sora能夠?成具有動態(tài)相機運動的視頻。隨著相機的移動和旋轉(zhuǎn),?物和場景元素在三維空間中保持?致地移動。Long-rangecoherenceandobjectpermanence.Asignificantchallengeforvideogenerationsystemshasbeenmaintainingtemporalconsistencywhensamplinglongvideos.WefindthatSoraisoften,thoughnotalways,abletoeffectivelymodelbothshort-andlong-rangedependencies.Forexample,ourmodelcanpersistpeople,animalsandobjectsevenwhentheyareoccludedorleavetheframe.Likewise,itcangeneratemultipleshotsofthesamecharacterinasinglesample,maintainingtheirappearancethroughoutthevideo.?距離?致性和物體恒存性。對于視頻?成系統(tǒng)來說,?個重?挑戰(zhàn)是在采樣?視頻時保持時間上的連貫性。我們發(fā)現(xiàn),盡管不總是如此,Sora通常能夠有效地建模短距離和?距離依賴關(guān)系。例如,我們的模型即使在?、動物和物體被遮擋或離開畫?時,也能持續(xù)保持它們的存在。同樣,它能在單個樣本中?成同???的多個鏡頭,并在整個視頻中保持其外觀。Interactingwiththeworld.Soracansometimessimulateactionsthataffectthestateoftheworldinsimpleways.Forexample,apaintercanleavenewstrokesalongacanvasthatpersistovertime,oramancaneataburgerandleavebitemarks.與世界互動。Sora有時可以模擬?些簡單的動作來影響世界的狀態(tài)。例如,畫家可以在畫布上留下隨時間持續(xù)存在的新筆觸,或者?個?可以吃?個漢堡并留下咬痕。Simulatingdigitalworlds.Soraisalsoabletosimulateartificialprocesses–oneexampleisvideogames.SoracansimultaneouslycontroltheplayerinMinecraftwithabasicpolicywhilealsorenderingtheworldanditsdynamicsinhighfidelity.Thesecapabilitiescanbeelicitedzero-shotbypromptingSorawithcaptionsmentioning“Minecraft.”模擬數(shù)字世界。Sora也能夠模擬??過程——?個例?是視頻游戲。Sora可以在同時控制《我的世界》中的玩家采?基本策略的同時,還

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論