大數(shù)據(jù)采集與清洗課件PPT12_第1頁
大數(shù)據(jù)采集與清洗課件PPT12_第2頁
大數(shù)據(jù)采集與清洗課件PPT12_第3頁
大數(shù)據(jù)采集與清洗課件PPT12_第4頁
大數(shù)據(jù)采集與清洗課件PPT12_第5頁
已閱讀5頁,還剩31頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、數(shù)據(jù)采集與清洗,2019|02|15,什么是大數(shù)據(jù),大數(shù)據(jù)處理流程,大數(shù)據(jù)的主要特征,大數(shù)據(jù)采集的概念,大數(shù)據(jù)采集應(yīng)用,1,什么是大數(shù)據(jù),淘寶推薦,依據(jù)購物行為偏好引薦,依據(jù)你最近的閱讀行為和消費(fèi)行為進(jìn)行引薦,依據(jù)你用的設(shè)備往來不斷猜特征,依據(jù)時節(jié)改變進(jìn)行引薦,2014-03,2015-08,2017-10,2016-03,2018,大數(shù)據(jù)工作首先寫入政府工作報告,十三五規(guī)劃綱要提出實(shí)施國家大數(shù)據(jù)戰(zhàn)略,2018年政府工作報告提出:實(shí)施大數(shù)據(jù)發(fā)展行動,注重用互聯(lián)網(wǎng)、大數(shù)據(jù)等提升監(jiān)管效能,國務(wù)院發(fā)布促進(jìn)大數(shù)據(jù)發(fā)展的行動綱要,十九大提出推動大數(shù)據(jù)戰(zhàn)略,與實(shí)體經(jīng)濟(jì)深度融合,行業(yè)現(xiàn)狀與前景,2019年人

2、社部擬最新發(fā)布15項(xiàng)新職業(yè),1.大數(shù)據(jù)工程技術(shù)人員 2.云計算工程技術(shù)人員 3.人工智能工程技術(shù)人員 4.物聯(lián)網(wǎng)工程技術(shù)人員 5,什么是大數(shù)據(jù),大數(shù)據(jù)(Big Data)是指無法使用傳統(tǒng)和常用的軟件技術(shù)和工具在一定時間內(nèi)完成獲取、管理和處理的數(shù)據(jù)集,大數(shù)據(jù)的主要特征,大數(shù)據(jù)主要特征,Volume,Velocity,Variety,Veracity,真實(shí)性(Veracity),即追求高質(zhì)量的數(shù)據(jù),容量大(Volume),指大規(guī)模的數(shù)據(jù)量,并且數(shù)據(jù)量呈持續(xù)增長趨勢,速度快(Velocity),指的是數(shù)據(jù)被創(chuàng)建和移動的速度,種類多(Variety),指數(shù)據(jù)來自多種數(shù)據(jù)源,數(shù)據(jù)種類和格式,Value,

3、價值密度低(Value),指隨著數(shù)據(jù)量的增長,數(shù)據(jù)中有意義的信息卻沒有成相應(yīng)比例增長,3,大數(shù)據(jù)處理流程,大數(shù)據(jù)處理流程,數(shù)據(jù)預(yù)處理 就是將采集來的數(shù)據(jù)從多種數(shù)據(jù)庫導(dǎo)入到大型的分布式數(shù)據(jù)庫中(目前主要是hfds或hive),并同時做一些簡單的清洗和預(yù)處理工作,數(shù)據(jù)統(tǒng)計分析 就是對上面已經(jīng)完成的存儲在大型分布式數(shù)據(jù)庫中的數(shù)據(jù)進(jìn)行歸類統(tǒng)計,可以滿足一般場景的分析需求,數(shù)據(jù)挖掘 是對數(shù)據(jù)進(jìn)行基于各種算法的分析計算,從而起到預(yù)測的效果,實(shí)現(xiàn)一些高級別數(shù)據(jù)分析的需求,數(shù)據(jù)采集 就是利用多種數(shù)據(jù)庫(關(guān)系型,NOSQL)去存儲不同來源的數(shù)據(jù),數(shù)據(jù)展示 就是對以上處理完的結(jié)果進(jìn)行分析,或者形成報表,大數(shù)據(jù)采集

4、的概念,大數(shù)據(jù)采集的概念,3、大數(shù)據(jù)采集技術(shù)方法 大數(shù)據(jù)采集技術(shù)就是對數(shù)據(jù)進(jìn)行 ETL 操作,通過對數(shù)據(jù)進(jìn)行提取、轉(zhuǎn)換、加載,最終挖掘數(shù)據(jù)的潛在價值。ETL指的是Extract-Transform-Load,也就是抽取、轉(zhuǎn)換、加載。 抽取-從各種數(shù)據(jù)源獲取數(shù)據(jù) 轉(zhuǎn)換-按需求格式將源數(shù)據(jù)轉(zhuǎn)換為目標(biāo)數(shù)據(jù) 加載-把目標(biāo)數(shù)據(jù)加載到數(shù)據(jù)倉庫中,2、數(shù)據(jù)采集與大數(shù)據(jù)采集的區(qū)別 傳統(tǒng)數(shù)據(jù)采集:來源單一,數(shù)據(jù)量相當(dāng)??;結(jié)構(gòu)單一;關(guān)系數(shù)據(jù)庫和并行數(shù)據(jù)庫 大數(shù)據(jù)的數(shù)據(jù)采集:來源廣泛,數(shù)量巨大;數(shù)據(jù)類型豐富;分布式數(shù)據(jù)庫,1、什么是數(shù)據(jù)采集 數(shù)據(jù)采集就是數(shù)據(jù)獲取,數(shù)據(jù)源主要分為線上數(shù)據(jù)和內(nèi)容數(shù)據(jù),大數(shù)據(jù)采集系統(tǒng),1

5、.日志采集系統(tǒng)(Apache Flume、Scribe,3.數(shù)據(jù)庫采集系統(tǒng)(關(guān)系型、nosql等各種數(shù)據(jù)庫,2.網(wǎng)絡(luò)數(shù)據(jù)采集系統(tǒng)(Scrapy 框架、Apache Nutch,5,大數(shù)據(jù)采集應(yīng)用,技能準(zhǔn)備,Python基礎(chǔ),Linux操作系統(tǒng)基本操作,數(shù)據(jù)庫基礎(chǔ)(SQL語句操作,環(huán)境準(zhǔn)備,Python,Jdk(java環(huán)境,數(shù)據(jù)庫(mysql,Thanks,YOUR TITLE,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficul

6、t to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,YOUR TITLE,Nothing is difficult to the man who wi

7、ll try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to

8、 the man who will try,OKPPT工作室,YOUR TITLE,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man w

9、ho will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,YOUR TITLE,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing

10、 is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,YOUR TITLE,21,9,28,42,3,OKPPT工作室,YOUR

11、 TITLE,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to

12、 the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,YOUR TITLE,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will t

13、ry.Nothing is difficult to the man who will try,Nothing is difficult to the man who will try,Nothing is difficult to the man who will try,Nothing is difficult to the man who will try,Nothing is difficult to the man who will try,YOUR TITLE,Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try.Nothing is difficult to the man who will try,Nothing is difficult to

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論