大數(shù)據(jù)挖掘外文翻譯文獻(xiàn)

上傳人：w*** IP屬地：江蘇上傳時(shí)間：2024-01-16 格式：DOCX 頁數(shù)：16 大小：1.87MB 積分：15 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩11頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

文獻(xiàn)信息：文獻(xiàn)標(biāo)題：AStudyofDataMiningwithBigData（大數(shù)據(jù)挖掘研究）國外作者：VHShastri，VSreeprada文獻(xiàn)出處：《InternationalJournalofEmergingTrendsandTechnologyinComputerScience》,2016,38(2):99-103字?jǐn)?shù)統(tǒng)計(jì)：英文2291單詞，12196字符；中文3868漢字外文文獻(xiàn)：AStudyofDataMiningwithBigDataAbstractDatahasbecomeanimportantpartofeveryeconomy,industry,organization,business,functionandindividual.BigDataisatermusedtoidentifylargedatasetstypicallywhosesizeislargerthanthetypicaldatabase.Bigdataintroducesuniquecomputationalandstatisticalchallenges.BigDataareatpresentexpandinginmostofthedomainsofengineeringandscience.Datamininghelpstoextractusefuldatafromthehugedatasetsduetoitsvolume,variabilityandvelocity.ThisarticlepresentsaHACEtheoremthatcharacterizesthefeaturesoftheBigDatarevolution,andproposesaBigDataprocessingmodel,fromthedataminingperspective.Keywords:BigData,DataMining,HACEtheorem,structuredandunstructured.I.IntroductionBigDatareferstoenormousamountofstructureddataandunstructureddatathatoverflowtheorganization.Ifthisdataisproperlyused,itcanleadtomeaningfulinformation.Bigdataincludesalargenumberofdatawhichrequiresalotofprocessinginrealtime.Itprovidesaroomtodiscovernewvalues,tounderstandin-depthknowledgefromhiddenvaluesandprovideaspacetomanagethedataeffectively.Adatabaseisanorganizedcollectionoflogicallyrelateddatawhichcanbeeasilymanaged,updatedandaccessed.Dataminingisaprocessdiscoveringinterestingknowledgesuchasassociations,patterns,changes,anomaliesandsignificantstructuresfromlargeamountofdatastoredinthedatabasesorotherrepositories.BigDataincludes3V’sasitscharacteristics.Theyarevolume,velocityandvariety.Volumemeanstheamountofdatageneratedeverysecond.Thedataisinstateofrest.Itisalsoknownforitsscalecharacteristics.Velocityisthespeedwithwhichthedataisgenerated.Itshouldhavehighspeeddata.Thedatageneratedfromsocialmediaisanexample.Varietymeansdifferenttypesofdatacanbetakensuchasaudio,videoordocuments.Itcanbenumerals,images,timeseries,arraysetc.DataMininganalysesthedatafromdifferentperspectivesandsummarizingitintousefulinformationthatcanbeusedforbusinesssolutionsandpredictingthefuturetrends.Datamining(DM),alsocalledKnowledgeDiscoveryinDatabases(KDD)orKnowledgeDiscoveryandDataMining,istheprocessofsearchinglargevolumesofdataautomaticallyforpatternssuchasassociationrules.Itappliesmanycomputationaltechniquesfromstatistics,informationretrieval,machinelearningandpatternrecognition.Dataminingextractonlyrequiredpatternsfromthedatabaseinashorttimespan.Basedonthetypeofpatternstobemined,dataminingtaskscanbeclassifiedintosummarization,classification,clustering,associationandtrendsanalysis.BigDataisexpandinginalldomainsincludingscienceandengineeringfieldsincludingphysical,biologicalandbiomedicalsciences.II.BIGDATAwithDATAMININGGenerallybigdatareferstoacollectionoflargevolumesofdataandthesedataaregeneratedfromvarioussourceslikeinternet,social-media,businessorganization,sensorsetc.WecanextractsomeusefulinformationwiththehelpofDataMining.Itisatechniquefordiscoveringpatternsaswellasdescriptive,understandable,modelsfromalargescaleofdata.Volumeisthesizeofthedatawhichislargerthanpetabytesandterabytes.Thescaleandriseofsizemakesitdifficulttostoreandanalyseusingtraditionaltools.BigClassificationDecisiontrees,SVMRegressionMultivariatelinearregressionTable1.ClassificationofAlgorithmsDataMiningalgorithmscanbeconvertedintobigmapreducealgorithmbasedonparallelcomputingbasis.BigDataDataMiningItiseverythingintheworldnow.ItistheoldBigData.Sizeofthedataislarger.Sizeofthedataissmaller.Involvesstorageandprocessingoflargedatasets.Interestingpatternscanbefound.BigDataisthetermforlargedataset.Dataminingreferstotheactivityofgoingthroughbigdatasettolookforrelevantinformation.Bigdataistheasset.Dataminingisthehandlerwhichprovidebeneficialresult.Bigdata"variesdependingonthecapabilitiesoftheorganizationmanagingtheset,andonthecapabilitiesoftheapplicationsthataretraditionallyusedtoprocessandanalysethedata.Dataminingreferstotheoperationthatinvolverelativelysophisticatedsearchoperation.Table2.DifferencesbetweenDataMiningandBigDataVI.ChallengesinBIGDATAMeetingthechallengeswithBIGDataisdifficult.Thevolumeisincreasingeveryday.Thevelocityisincreasingbytheinternetconnecteddevices.Thevarietyisalsoexpandingandtheorganizations’capabilitytocaptureandprocessthedataislimited.ThefollowingarethechallengesinareaofBigDatawhenitishandled:1.Datacaptureandstorage2.Datatransmission3.Datacuration4.Dataanalysis5.DatavisualizationAccordingto,challengesofbigdataminingaredividedinto3tiers.Thefirsttieristhesetupofdataminingalgorithms.Thesecondtierincludes1.InformationsharingandDataPrivacy.2.DomainandApplicationKnowledge.Thethirdoneincludeslocallearningandmodelfusionformultipleinformationsources.3.Miningfromsparse,uncertainandincompletedata.4.Miningcomplexanddynamicdata.Figure2:PhasesofBigDataChallengesGenerallyminingofdatafromdifferentdatasourcesistediousassizeofdataislarger.Bigdataisstoredatdifferentplacesandcollectingthosedatawillbeatedioustaskandapplyingbasicdataminingalgorithmswillbeanobstacleforit.Nextweneedtoconsidertheprivacyofdata.Thethirdcaseisminingalgorithms.Whenweareapplyingdataminingalgorithmstothesesubsetsofdatatheresultmaynotbethatmuchaccurate.VII.ForecastofthefutureTherearesomechallengesthatresearchersandpractitionerswillhavetodealduringthenextyears:AnalyticsArchitecture:Itisnotclearyethowanoptimalarchitectureofanalyticssystemsshouldbetodealwithhistoricdataandwithreal-timedataatthesametime.AninterestingproposalistheLambdaarchitectureofNathanMarz.TheLambdaArchitecturesolvestheproblemofcomputingarbitraryfunctionsonarbitrarydatainrealtimebydecomposingtheproblemintothreelayers:thebatchlayer,theservinglayer,andthespeedlayer.ItcombinesinthesamesystemHadoopforthebatchlayer,andStormforthespeedlayer.Thepropertiesofthesystemare:robustandfaulttolerant,scalable,general,andextensible,allowsadhocqueries,minimalmaintenance,anddebuggable.Statisticalsignificance:Itisimportanttoachievesignificantstatisticalresults,andnotbefooledbyrandomness.AsEfronexplainsinhisbookaboutLargeScaleInference,itiseasytogowrongwithhugedatasetsandthousandsofquestionstoansweratonce.Distributedmining:Manydataminingtechniquesarenottrivialtoparalyze.Tohavedistributedversionsofsomemethods,alotofresearchisneededwithpracticalandtheoreticalanalysistoprovidenewmethods.Timeevolvingdata:Datamaybeevolvingovertime,soitisimportantthattheBigDataminingtechniquesshouldbeabletoadaptandinsomecasestodetectchangefirst.Forexample,thedatastreamminingfieldhasverypowerfultechniquesforthistask.Compression:DealingwithBigData,thequantityofspaceneededtostoreitisveryrelevant.Therearetwomainapproaches:compressionwherewedon’tlooseanything,orsamplingwherewechoosewhatisthedatathatismorerepresentative.Usingcompression,wemaytakemoretimeandlessspace,sowecanconsideritasatransformationfromtimetospace.Usingsampling,weareloosinginformation,butthegainsinspacemaybeinordersofmagnitude.ForexampleFeldmanetalusecoresetstoreducethecomplexityofBigDataproblems.Coresetsaresmallsetsthatprovablyapproximatetheoriginaldataforagivenproblem.Usingmerge-reducethesmallsetscanthenbeusedforsolvinghardmachinelearningproblemsinparallel.Visualization:AmaintaskofBigDataanalysisishowtovisualizetheresults.Asthedataissobig,itisverydifficulttofinduser-friendlyvisualizations.Newtechniques,andframeworkstotellandshowstorieswillbeneeded,asforexamplethephotographs,infographicsandessaysinthebeautifulbook”TheHumanFaceofBigData”.HiddenBigData:Largequantitiesofusefuldataaregettinglostsincenewdataislargelyuntaggedfilebasedandunstructureddata.The2012IDCstudyonBigDataexplainsthatin2012,23%(643exabytes)ofthedigitaluniversewouldbeusefulforBigDataiftaggedandanalyzed.However,currentlyonly3%ofthepotentiallyusefuldataistagged,andevenlessisanalyzed.VIII.CONCLUSIONTheamountsofdataisgrowingexponentiallyduetosocialnetworkingsites,searchandretrievalengines,mediasharingsites,stocktradingsites,newssourcesandsoon.BigDataisbecomingthenewareaforscientificdataresearchandforbusinessapplications.Dataminingtechniquescanbeappliedonbigdatatoacquiresomeusefulinformationfromlargedatasets.Theycanbeusedtogethertoacquiresomeusefulpicturefromthedata.BigDataanalysistoolslikeMapReduceoverHadoopandHDFShelpsorganization.中文譯文：大數(shù)據(jù)挖掘研究摘要數(shù)據(jù)已經(jīng)成為各個(gè)經(jīng)濟(jì)、行業(yè)、組織、企業(yè)、職能和個(gè)人的重要組成部分。大數(shù)據(jù)是用于識(shí)別大型數(shù)據(jù)集的一個(gè)術(shù)語，通常其大小比典型的數(shù)據(jù)庫要大。大數(shù)據(jù)引入了獨(dú)特的計(jì)算和統(tǒng)計(jì)挑戰(zhàn)。在工程和科學(xué)的大部分領(lǐng)域，大數(shù)據(jù)目前都有延伸。由于大數(shù)據(jù)的數(shù)量之多、速度之快、種類之繁，所以可以使用數(shù)據(jù)挖掘，有助于從龐大的數(shù)據(jù)集中提取有用的數(shù)據(jù)。本文介紹了HACE定理，它描述了大數(shù)據(jù)革命的特征，并從數(shù)據(jù)挖掘角度提出了一個(gè)大數(shù)據(jù)處理模型。關(guān)鍵詞：大數(shù)據(jù)，數(shù)據(jù)挖掘，HACE定理，結(jié)構(gòu)化和非結(jié)構(gòu)化。一、簡(jiǎn)介大數(shù)據(jù)指的是大量的結(jié)構(gòu)化數(shù)據(jù)和非結(jié)構(gòu)化數(shù)據(jù)，這些數(shù)據(jù)遍布了整個(gè)組織。如果這些數(shù)據(jù)被正確使用，將會(huì)產(chǎn)生有意義的信息。大數(shù)據(jù)包括大量的數(shù)據(jù)，需要大量的實(shí)時(shí)處理。它提供了兩個(gè)空間，一個(gè)用于發(fā)現(xiàn)新價(jià)值，并從隱藏的價(jià)值中了解深入的知識(shí)，另一個(gè)用于有效管理數(shù)據(jù)。數(shù)據(jù)庫是一個(gè)與數(shù)據(jù)相關(guān)的邏輯上有組織的集合，可以方便地管理、更新和訪問。數(shù)據(jù)挖掘是從數(shù)據(jù)庫或其他存儲(chǔ)庫中存儲(chǔ)的大量數(shù)據(jù)中發(fā)現(xiàn)有趣的知識(shí)(如關(guān)聯(lián)、模式、更改、異常和重要結(jié)構(gòu))的過程。大數(shù)據(jù)包括3V的特征。它們是大量（volume）、高速（velocity）和多樣（variety）。大量意味著每秒生成的數(shù)據(jù)量。數(shù)據(jù)是靜態(tài)的，它的規(guī)模特征也是眾所周知的。高速是數(shù)據(jù)生成的速度。大數(shù)據(jù)應(yīng)該有高速數(shù)據(jù)，社交媒體產(chǎn)生的數(shù)據(jù)就是一個(gè)例子。多樣意味著可以采取不同類型的數(shù)據(jù)，例如音頻、視頻或文檔。它可以是數(shù)字、圖像、時(shí)間序列、數(shù)組等。數(shù)據(jù)挖掘從不同的角度分析數(shù)據(jù)，并將其匯總為有用的信息，可用于商業(yè)解決方案和預(yù)測(cè)未來趨勢(shì)。數(shù)據(jù)挖掘（DM）也稱為數(shù)據(jù)庫中的知識(shí)發(fā)現(xiàn)（KDD），或者知識(shí)發(fā)現(xiàn)和數(shù)據(jù)挖掘，是為關(guān)聯(lián)規(guī)則等模式自動(dòng)搜索大量數(shù)據(jù)的過程。它應(yīng)用了統(tǒng)計(jì)學(xué)、信息檢索、機(jī)器學(xué)習(xí)和模式識(shí)別等方面的許多計(jì)算技術(shù)。數(shù)據(jù)挖掘僅在短時(shí)間內(nèi)從數(shù)據(jù)庫中提取所需的模式。根據(jù)要挖掘的模式類型，可以將數(shù)據(jù)挖掘任務(wù)分為匯總、分類、聚類、關(guān)聯(lián)和趨勢(shì)分析。在包括物理、生物和生物醫(yī)學(xué)等科學(xué)和工程領(lǐng)域在內(nèi)的所有領(lǐng)域，大數(shù)據(jù)都有延伸。二、大數(shù)據(jù)挖掘一般而言，大數(shù)據(jù)是指大量數(shù)據(jù)的集合，這些數(shù)據(jù)來自互聯(lián)網(wǎng)、社交媒體、商業(yè)組織、傳感器等各種來源。我們可以借助數(shù)據(jù)挖掘技術(shù)來提取一些有用的信息。這是一種從大量數(shù)據(jù)中發(fā)現(xiàn)模式以及描述性、可理解的模型的技術(shù)。容量是數(shù)據(jù)的大小，大于PB和TB。規(guī)模和容量的增加使得傳統(tǒng)的工具難以存儲(chǔ)和分析。在預(yù)定的時(shí)間段內(nèi)，應(yīng)該使用大數(shù)據(jù)挖掘大量數(shù)據(jù)。傳統(tǒng)的數(shù)據(jù)庫系統(tǒng)旨在解決少量的結(jié)構(gòu)化和一致性的數(shù)據(jù)，而大數(shù)據(jù)包括各種數(shù)據(jù)，如地理空間數(shù)據(jù)、音頻、視頻、非結(jié)構(gòu)化文本等。大數(shù)據(jù)挖掘是指通過大數(shù)據(jù)集來查找相關(guān)信息的活動(dòng)。為了快速處理不同來源的大量數(shù)據(jù)，使用了Hadoop。Hadoop是一個(gè)免費(fèi)的基于Java的編程框架，支持在分布式計(jì)算環(huán)境中處理大型數(shù)據(jù)集。其分布式文件系統(tǒng)支持節(jié)點(diǎn)之間的快速數(shù)據(jù)傳輸速率，并允許系統(tǒng)在發(fā)生節(jié)點(diǎn)故障時(shí)不中斷運(yùn)行。它為分布式數(shù)據(jù)處理進(jìn)行MapReduce，用于結(jié)構(gòu)化和非結(jié)構(gòu)化數(shù)據(jù)。三、大數(shù)據(jù)特征——HACE定理我們有大量的異構(gòu)數(shù)據(jù)。數(shù)據(jù)之間存在復(fù)雜的關(guān)系。我們需要從這些龐大的數(shù)據(jù)中發(fā)現(xiàn)有用的信息。讓我們想象一下，一個(gè)盲人被要求畫大象的場(chǎng)景。每個(gè)盲人收集到的信息可能會(huì)認(rèn)為軀干像墻，腿像樹，身體像墻，尾巴像繩子。盲人們可以相互交換信息。圖1：盲人和大象其中的一些特征包括：1.具有異構(gòu)及不同來源的海量數(shù)據(jù)：大數(shù)據(jù)的基本特征之一是大量的異構(gòu)數(shù)據(jù)和多樣數(shù)據(jù)。例如，在生物醫(yī)學(xué)世界中，個(gè)人用姓名、年齡、性別、家族病史等來表示，用于X射線和CT掃描圖像和視頻。異構(gòu)是指同一個(gè)體的不同表現(xiàn)形式，多樣是指用各種特征來表示單一信息。2.具有分布式和非集中式控制的自治：來源是自治的，即自動(dòng)生成；它在沒有任何集中控制的情況下生成信息。我們可以將它與萬維網(wǎng)（WWW）進(jìn)行比較，其中每臺(tái)服務(wù)器都提供一定數(shù)量的信息，而不依賴于其他服務(wù)器。3.復(fù)雜且不斷演化的關(guān)系：隨著數(shù)據(jù)量變得無限大，存在的關(guān)系也很大。在早期階段，當(dāng)數(shù)據(jù)很小時(shí)，數(shù)據(jù)之間的關(guān)系并不復(fù)雜。社交媒體和其他來源生成的數(shù)據(jù)具有復(fù)雜的關(guān)系。四.工具：開放源碼革命Facebook、雅虎、Twitter、LinkedIn等大公司受益于開源項(xiàng)目，并為之做出貢獻(xiàn)。在大數(shù)據(jù)挖掘中，有許多開源計(jì)劃。其中最受歡迎的是：ApacheMahout：主要基于Hadoop的可擴(kuò)展機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘的開源軟件。它實(shí)現(xiàn)了廣泛的機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘算法：聚類、分類、協(xié)同過濾和頻繁模式。R：為統(tǒng)計(jì)計(jì)算和可視化設(shè)計(jì)的開源編程語言和軟件環(huán)境。R是由在新西蘭奧克蘭大學(xué)的RossIhaka和RobertGentleman在1993年開始設(shè)計(jì)的，用于統(tǒng)計(jì)分析超大型數(shù)據(jù)集。MOA：流數(shù)據(jù)挖掘開源軟件，可以實(shí)時(shí)進(jìn)行數(shù)據(jù)挖掘。它具有分類、回歸、聚類和頻繁項(xiàng)集挖掘和頻繁圖挖掘等實(shí)現(xiàn)。它始于新西蘭懷卡托大學(xué)機(jī)器學(xué)習(xí)小組的一個(gè)項(xiàng)目，以WEKA軟件著稱。流框架為使用簡(jiǎn)單的根據(jù)XML來定義和運(yùn)行流過程提供了一個(gè)環(huán)境，并能夠使用MOA、Android和StormSAMOA：這是一個(gè)新的即將推出的分布式流挖掘軟件項(xiàng)目，它將S4和Storm與MOA結(jié)合在一起。VowpalWabbit：在雅虎啟動(dòng)的開源項(xiàng)目。研究并繼續(xù)在微軟研究院設(shè)計(jì)一個(gè)快速的、可擴(kuò)展的、有用的學(xué)習(xí)算法。VW能夠從大量特征數(shù)據(jù)集中學(xué)習(xí)。在進(jìn)行線性學(xué)習(xí)、通過并行學(xué)習(xí)時(shí)，它可以超過任何單機(jī)網(wǎng)絡(luò)接口的吞吐量。五、大數(shù)據(jù)的數(shù)據(jù)挖掘數(shù)據(jù)挖掘是通過分析不同來源的數(shù)據(jù)從而發(fā)現(xiàn)有用的信息的過程。數(shù)據(jù)挖掘包含多種算法，分為4類。他們是：1.關(guān)聯(lián)規(guī)則2.聚類3.分類4.回歸關(guān)聯(lián)用于搜索變量之間的關(guān)系。它用于搜索經(jīng)常訪問的項(xiàng)目?？偠灾⒘藢?duì)象之間的關(guān)系。聚類發(fā)現(xiàn)數(shù)據(jù)中的組和結(jié)構(gòu)。分類處理將未知結(jié)構(gòu)關(guān)聯(lián)到已知結(jié)構(gòu)?；貧w找到一個(gè)函數(shù)來模擬數(shù)據(jù)。不同的數(shù)據(jù)挖掘算法有：類別算法關(guān)聯(lián)Apriori,FPgrowth聚類K-Means,期望值分類決策樹，SVM回歸多元線性回歸表1.算法的分類數(shù)據(jù)挖掘算法可以轉(zhuǎn)化為基于并行計(jì)算的MapReduce算法。大數(shù)據(jù)數(shù)據(jù)挖掘這是現(xiàn)在世界上的一切。這是舊的大數(shù)據(jù)。數(shù)據(jù)的規(guī)模較大。數(shù)據(jù)的規(guī)模較小。涉及大型數(shù)據(jù)集的存儲(chǔ)和處理?？梢哉业接腥さ哪Ｊ?。大數(shù)據(jù)是大型數(shù)據(jù)集的術(shù)語。數(shù)據(jù)挖掘是指通過大數(shù)據(jù)集尋找相關(guān)信息的活動(dòng)。大數(shù)據(jù)是資產(chǎn)。數(shù)據(jù)挖掘是提供有益結(jié)果的處理程序。大數(shù)據(jù)取決于管理集的組織的能力，以及傳統(tǒng)上用于處理和分析數(shù)據(jù)的應(yīng)用程序的功能。數(shù)據(jù)挖掘指的是涉及相對(duì)復(fù)雜的搜索操作的活動(dòng)。表2.大數(shù)據(jù)和數(shù)據(jù)挖掘的不同之處六、大數(shù)據(jù)挑戰(zhàn)面對(duì)大數(shù)據(jù)的挑戰(zhàn)很困難。數(shù)量每天都在增加。網(wǎng)絡(luò)連接設(shè)備的速度在增加。種類

人人文庫> 全部分類> 專業(yè)文獻(xiàn) > 學(xué)術(shù)論文

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

大數(shù)據(jù)挖掘外文翻譯文獻(xiàn)

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

大數(shù)據(jù)挖掘外文翻譯文獻(xiàn)

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔