尚大數(shù)據(jù)技術(shù)之hive Hive_第1頁
尚大數(shù)據(jù)技術(shù)之hive Hive_第2頁
尚大數(shù)據(jù)技術(shù)之hive Hive_第3頁
尚大數(shù)據(jù)技術(shù)之hive Hive_第4頁
尚大數(shù)據(jù)技術(shù)之hive Hive_第5頁
已閱讀5頁,還剩84頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

(作者:大海哥—Hive什么是 提供類SQL查詢功能。3)執(zhí)行程序運行在Yarn上優(yōu)缺

HivenMetaCLI(hives )、JDBC/ODBC(javahive)、WEBUI(瀏覽器hive)的類型(是否是外部表)、表的數(shù)據(jù)所在等;解析器(SQLParser):SQLAST,這一步一般都用第工具庫完成,比如antlr;對AST進行語法分析,比如表是否存在、字段是否存在、SQL語義是否有誤。編譯器(Physicaln)AST說,就是MR/Spark。Hive和數(shù)據(jù)由于Hive采用了類似SQL的查詢語言HQL(HiveQueryLanguage),因此很容易Hive理解為數(shù)據(jù)庫。其實從結(jié)構(gòu)上來看,Hive和數(shù)據(jù)庫除了擁有類似的查詢語言,再無Hive和數(shù)據(jù)庫的差異。數(shù)據(jù)庫可以用在Online的應(yīng)用中,但是Hive是為數(shù)據(jù)倉庫而設(shè)計的,清楚這一點,有助于從應(yīng)用角度理解Hive的特性。查詢語SQLHiveSQL的查詢語言HQL。熟悉SQL開發(fā)的開發(fā)者可以很方便的使用Hive進行開發(fā)。數(shù)據(jù)位Hive是建立在Hadoop之上的,所有Hive的數(shù)據(jù)都是在HDFS中的。而數(shù)據(jù)庫則數(shù)據(jù)更Hive是針對數(shù)據(jù)倉庫應(yīng)用設(shè)計的,而數(shù)據(jù)倉庫的內(nèi)容是讀多寫少的。因此,Hive通常是需要經(jīng)常進行修改的,因此可以使用INSERTINTO…VALUES添加數(shù)據(jù),使用UPDATE…SET修改數(shù)據(jù)。索Hive在加載數(shù)據(jù)的過程中不會對數(shù)據(jù)進行任何處理,甚至不會對數(shù)據(jù)進行掃描,因此也沒有對數(shù)據(jù)中的某些Key建立索引。Hive要數(shù)據(jù)中滿足條件的特定值時,需要此即使沒有索引,對于大數(shù)據(jù)量的,Hive仍然可以體現(xiàn)出優(yōu)勢。數(shù)據(jù)庫中,通常會針效率,較低的延遲。由于數(shù)據(jù)的延遲較高,決定了Hive不適合數(shù)據(jù)查詢。執(zhí)HiveHadoopMapReduce來實現(xiàn)的。而數(shù)據(jù)庫通常執(zhí)行延Hive在查詢數(shù)據(jù)的時候,由于沒有索引,需要掃描整個表,因此延遲較高。另外一個HiveMapReduceMapReduce本身具有較高的延遲,的時候,Hive的并行計算顯然能體現(xiàn)出優(yōu)勢??蓴U展HiveHadoopHiveHadoop的可擴展性是一致的(世界上最大的HadoopYahoo!,20094000臺節(jié)點左右)。而數(shù)據(jù)庫由于ACID語義的嚴(yán)格限制,擴展行非常有限。目前最先進的并行數(shù)據(jù)庫Oracle在理論上的擴展能力也只有100臺左右。數(shù)據(jù)規(guī)二Hive2)地址:s2.2Hive安裝部 修改apache-hive-1.2.1-bin.tar.gz的名稱為hive[atguigu@hadoop102module]$mvapache-hive-1.2.1-binhive [atguigu@hadoop102conf]$mvhive-env.sh.temtehive-env.sh配置HIVE_CONF_DIR[atguigu@hadoop102hadoop-2.7.2]$sbin/start-dfs.sh [atguigu@hadoop102hadoop-2.7.2]$bin/hadoopfs-mkdir/tmp[atguigu@hadoop102hadoop-2.7.2]$bin/hadoopfs-mkdir-p/user/hive/warehouse[atguigu@hadoop102hadoop-2.7.2]$bin/hadoopfs-odg+w/tmp[atguigu@hadoop102hadoop-2.7.2]$bin/hadoopfs-odg+w/user/hive/warehousehive>usehive>createtablestudent(idint,namestring)hive>insertintostudenthive>select*fromhive>quit;將本地文件導(dǎo)入Hive案需求:將本地/opt/module/datas/student.txt這個 下的數(shù)據(jù)導(dǎo)入到hive的student(idint,name 下創(chuàng)建datas [atguigu@hadoop102module]$touchstudent.txt[atguigu@hadoop102module]$vistudent.txthive>usehive>droptable創(chuàng)建student表,并文件分隔符hive>createtablestudent(idint,namestring)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\t';hive>select*fromstudent; Timetaken:0.266seconds,Fetched:3row(s)ExceptionExceptioninthread"main"java.lang.RuntimeException:java.lang.RuntimeException:Unabletoinstantiateorg.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreatorg.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)atatsun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethod)atjava.lang.reflect.Method.invoke(Method.java:606)atorg.apache.hadoop.util.RunJar.run(RunJar.java:221) atorg.apache.hadoop.hive.ql.metadata.Hive.createMetaStore atorg.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)at...8MySql安裝包準(zhǔn)[root@hadoop102桌面rpmqa|grepmysql[root@hadoop102桌面rpmenodepsmysql-libs-5.1.73-7.el6.x86_642)解壓mysql-libs.zip文件到當(dāng)前[root@hadoop102software]#unzipmysql-libs.zip[root@hadoop102software]#ls-rw-r--r--.1root 2015 12月12013mysql-connector-java--rw-r--r1root 3月262015MySQL-server-5.6.24-[root@hadoop102mysql-libs]# odu+x./*[root@hadoop102mysql-libs]#ll-rwxr--r--.1root 2015 12月12013mysql-connector-java--rwxr--r1root 3月262015MySQL-server-5.6.24-安裝MySql服務(wù)[root@hadoop102mysql-libsrpmivhMySQL-server-5.6.24-1.el6.x86_64.rpm[root@hadoop102mysql-libs]#servicemysqlstatus4)啟動mysql安裝MySql客戶[root@hadoop102mysql-libs]#rpm-ivhMySQL- -5.6.24-1.el6.x86_64.rpm2)mysql[root@hadoop102mysql-libsmysqlurootpOEXaQuS8IWkG19Xs3)修改MySqluser表中主機配置只要是root用戶+,在任何主機上都能登錄MySQL數(shù)據(jù)庫[root@hadoop102mysql-libsmysqlurootp000000mysql>showmysql>usemysql>selectUserHostPasswordfromuser;7)修改user表,把Host表內(nèi)容修改為%mysql>updateusersethost='%'wheremysql>deletefromuserwhereHost='hadoop102';mysql>deletefromuserwhereHost='127.0.0.1';mysql>deletefromuserwhereHost='::1';mysql>flushprivileges;mysql>Hive元數(shù)據(jù)配驅(qū)動拷 [root@hadoop102mysql-libs]#tar-zxvfmysql-connector-java-5.1.27.tar.gz [root@hadoop102mysql-connector-java-5.1.27]#cpmysql-connector-java-5.1.27-配置Metastore [root@hadoop102conf]#touchhive-site.xml[root@hadoop102conf]#vihive-site.xml<?xml<?xml<description>JDBCconnectstringforaJDBC<description>DriverclassnameforaJDBC <description>Whethertoprintthenamesofthecolumnsinquery集群多窗口Hive測mysql>show | | |performance_schema| mysql>show | | | |performance_schema| usage:--database-f-i-S,--Variablesubsitutiontoapplytohivecommands.e.g.-dA=Bor--defineA=BSpecifythedatabasetouseSQLfromcommandlineSQLfromPrinthelpinformationUsevalueforgivenpropertycommands.e.g.--hivevarA=BSilentmodein[atguigu@hadoop102hive]$bin/hiveeselectidfromstudent;"2)“-f”執(zhí)行中sql語句在 select*from - Hive其令操退出hive窗口:hive(defaultlsopt/module/datas; /root或[atguigu@hadoop102~]$catHive數(shù)據(jù)倉庫位置配在倉 下,沒有對默認(rèn)的數(shù)據(jù)庫default創(chuàng)建文件夾。如果某屬于 修改default數(shù)據(jù)倉庫原始位置(將hive-default.xml.temte如下配置信息拷貝bin/hdfsdfs odg+w查詢后信息顯示配Hive運行日志信息配Hive的log默認(rèn)存放在 注意:用戶自定義配置會覆蓋默認(rèn)配置。另外,Hive也會讀入Hadoop的配置,因的設(shè)定對本機啟動的所有Hive進程都有效。hive(default)>sethive(default)>sethive(default)>set上述三種設(shè)定方式的優(yōu)先級依次遞增。即配置文件<命令行參數(shù)<參數(shù)。注意某些系統(tǒng)級的參數(shù),例如log4j相關(guān)的設(shè)定,必須用前兩種方式設(shè)定,因為那些參數(shù)的在會三Hive布爾類型,true ‘nowisthetime’“forallgoodmen”對于HiveStringvarchar類型,該類型是一個可變的字符串,不過它不能其中最多能多少個字符,理論上它可以2GB的字符數(shù)。符號元素內(nèi)容。例如,如果某個列的數(shù)據(jù)類型是STRUCT{firstSTRING,lastSTRING那么第1個元素可以通過字段.first來MAP是一組鍵-值對元組集合,使用數(shù)組表示法可以數(shù)據(jù)。例如,如果某個列的數(shù)據(jù)類型是MAP,其中鍵->值對HiveARRAY、MAPSTRUCT。ARRAYMAPJava中{{"name":"songsong","friends":["bingbing","lili"],"children":{"xiaosong":18,"xiaoxiaosong":19}"street":"huilongguan","city":"beijing"}//鍵值}創(chuàng)建本地測試文件test.txttest(namestring,friendsarray<string>,))rowformatdelimitedfieldsterminatedby','collectionitemsterminatedby'_'mapkeysterminatedbylinesterminatedbyrowformatdelimitedfieldsterminatedby --collectionitemsterminatedby'_' --MAPSTRUCT和ARRAY的分隔符(數(shù)據(jù)分割符號)mapkeysterminatedby':' --MAP中的key與value的分隔符linesterminatedby --hivehive(default)>selectfriends[1],children['xiaosong'],address.cityfromtestwhere Timetaken:0.076seconds,Fetched:1類型轉(zhuǎn)Hive的原子數(shù)據(jù)類型是可以進行隱式轉(zhuǎn)換的,類似于Java的類型轉(zhuǎn)換,例如某表達INT類型,TINYINTINT類型,但是Hive不會進行反向轉(zhuǎn)化,例如,用CAST操作。TINYINT可以轉(zhuǎn)換成INT,INT可以轉(zhuǎn)換成BIGINT。1CAST('XASINT)NULL。四DDL創(chuàng)建數(shù)據(jù)hive(default)>createdatabasehive>hive>createdatabaseFAILED:ExecutionError,returncode1fromorg.apache.hadoop.hive.ql.exec.DDLTask.Databasedb_hivealreadyexihive(default)>createdatabaseifnotexists修改數(shù)據(jù)ALTERDATABASEDBPROPERTIES設(shè)置鍵-值對據(jù)庫名和數(shù)據(jù)庫所在的位置。hive(default)>alterdatabasedb_hiveset hive>descdatabaseextendeddb_namecomment atguigu 查詢數(shù)據(jù)顯示數(shù)據(jù)hive>showdatabases;hive>showdatabaseslike查看數(shù)據(jù)庫詳hive>descdatabasedb_hive; hive>descdatabaseextended 切換當(dāng)前數(shù)據(jù)hive(default)>use刪除數(shù)據(jù)hive>hive>dropdatabasehive>dropdatabaseifexistshive>hive>dropdatabaseFAILED:ExecutionError,returncode1fromInvalidOperationException(message:Databasedb_hiveisnotempty.Oneormoretableshive>dropdatabasedb_hive[(col_namedata_type[COMMENT [CLUSTEREDBY(col_name,col_name,[SORTEDBY(col_name[ASC|DESC],...)]INTOnum_bucketsBUCKETS][ROWFORMATrow_format]CREATETABLE創(chuàng)建一個指定名字的表。如果相同名字的表已經(jīng)存在,則拋出異常;用戶可以用IFNOTEXISTS選項來忽略這個異常。數(shù)據(jù)的路徑(LOCATION),Hive創(chuàng)建內(nèi)部表時,會將數(shù)據(jù)移動到數(shù)據(jù)倉庫指向的路ROW TERMINATEDBYchar]|SERDEserde_name[WITHSERDEPROPERTIES(property_name=property_value,property_name=property_value,...)]用戶在建表的時候可以自定義SerDe或者使用自帶的SerDe。如果沒有指定ROW還需要為表指定列,用戶在指定表的列的同時也會指定自定義的SerDe,HiveSerDeSTOREDAS指定文件類STOREDASTEXTFILE。如果數(shù)據(jù)需要壓縮,使用STOREDASSEQUENCEFILE。LOCATION:指定表在HDFS上的位置管理默認(rèn)創(chuàng)建的表都是所謂的管理表,有時也被稱為內(nèi)部表。因為這種表,Hive會(或多或少地)控制著數(shù)據(jù)的生命周期。Hive默認(rèn)情況下會將這些表的數(shù)據(jù)在由配置項hive.metastore.warehouse.dir(例如,/user/hive/warehouse)所定義的的子下。當(dāng)我們createtableifnotexistsstudent2(createtableifnotexistsstudent2(idint,namestring)rowformatdelimitedfieldsterminatedby'\t'storedastextfilecreatecreatetableifnotexistsasselectid,namefromcreatecreatetableifnotexistsstudent4like 外部每天將收集到的日志定期流入HDFS文本文件。在外部表(原始日志表)的基礎(chǔ) emp.tcreatecreateexternaltableifnotexistsdefault.dept(deptnoint,locint)rowformatdelimitedfieldsterminatedbycreatecreateexternaltableifnotexistsdefault.emp(empnoint,enamestring,jobstring,mgrint,saldeptnoint)rowformatdelimitedfieldsterminatedbyhive(default)>showtables;hivedefaultloaddatalocalinpathopt/module/datas/dept.txtintotabledefault.dept;hivedefault)>loaddatalocalinpathopt/module/datas/emp.txtintotabledefault.emp;hive(default)>select*fromhive(default)>select*fromhive(default)>descformatted HDFS文件系統(tǒng)上的獨立的文件夾,該文件夾下是該分區(qū)所有的數(shù)據(jù)文件。Hive中的分區(qū)就是分,把一個大的數(shù)據(jù)集根據(jù)業(yè)務(wù)需要分割成小的WHERE子句中的表達式選擇查詢所需要的指定的分區(qū),這樣的查分區(qū)表基本操///hivehive(default)>createtabledeptnoint,dnamestring,loc)partitionedby(monthrowformatdelimitedfieldsterminatedby default.dept_partitionpartition(month='201709'); default.dept_partitionpartition(month='201708'); default.dept_partitionpartition(month='201707');hive(default)>select*fromdept_partitionwherehive(default)>select*fromdept_partitionwheremonth='201709'select*fromdept_partitionwheremonth='201708'select*fromdept_partitionwhere hive(default)>altertabledept_partitiondroppartition(month='201705'),partitionhive>showpartitionsdept_partition;#Partition#分區(qū)表注意事hivehive(default)>createtabledeptnoint,dnamestring,loc)partitionedby(monthstring,dayrowformatdelimitedfieldsterminatedbyhive(default)> loaddata inpath'/opt/module/datas/dept.txt' intotabledefault.dept_partition2partition(month='201709',day='13');hive(default)>select*fromdept_partition2wheremonth='201709'andday='13'; - - - hive(default)>select*fromdept_partition2wheremonth='201709'andhive>msckrepairtablehive(default)>select*fromdept_partition2wheremonth='201709'and - - - hive(default)>select*fromdept_partition2wheremonth='201709'and - -hive(default)> loaddata inpath'/opt/module/datas/dept.txt' intotabledept_partition2partition(month='201709',day='10');hive(default)>select*fromdept_partition2wheremonth='201709'and重命名hive(default)>altertabledept_partition2renameto增加、修改和刪除表分增加/修改/ALTERALTERTABLEtable_nameCHANGE[COLUMN]col_old_namecolumn_type ment][FIRST|AFTERALTERALTERTABLEtable_name CECOLUMNS(col_name ment],注:ADD是代表新增一字段,字段位置在所有列后面(partition列前),RE CE則是hive>deschive>deschive(default)>altertabledept_partitionchangecolumndeptdescdeschive>deschive(default)>altertabledept_partitionrececolumns(deptnostring,dnamestring,lochive>deschive(default)>droptable五DML數(shù)據(jù)導(dǎo)向表中裝載數(shù)據(jù)hive>loaddata[local]inpath'/opt/module/datas/student.txt'[overwrite]intotablehive(default)>createtablestudent(idstring,namestring)rowformatdelimitedfieldsterminatedby'\t';hive(default)>loaddatalocalinpath'/opt/module/datas/student.txt'intotable加載HDFShive中hive(default)>loaddatainpath'/user/atguigu/hive/student.txt'overwriteintotable通過查詢語句向表中插入數(shù)據(jù)hive(default)>createtablestudent(idstring,namestring)partitionedby(monthstring)formatdelimitedfieldsterminatedby\t';hive(default)>insertoverwritetablestudentpartition(month='201708')selectid,namefromstudentwheremonth='201709';hive(default)>fromselectid,namewheremonth='201709'insertoverwritetablestudentpartition(month='201706')selectid,namewheremonth='201709';查詢語句中創(chuàng)建表并加載數(shù)據(jù) createcreatetableifnotexistsasselectid,namefrom創(chuàng)建表時通 Location指定加載數(shù)據(jù)路hive(default)>createtableifnotexistsidint,name)rowformatdelimitedfieldsterminatedby'\t'location'/user/hive/warehouse/student5';hive(default)>dfs-put/opt/module/datas/student.txt hive(default)>select*fromImport數(shù)據(jù)到指定Hive表 數(shù)據(jù)導(dǎo)Insert導(dǎo)select*from COLLECTIONITEMSTERMINATEDBY'\n'select*from COLLECTIONITEMSTERMINATEDBY'\n'select*from - Hive 命令導(dǎo)基本語法:(hive-f/-e執(zhí)行語句或者> - Export導(dǎo)出到HDFShivehive(default)>exporttabledefault.studenttoSqoop導(dǎo)清除表中數(shù)據(jù)六查詢[WITHCommonTableExpression(,CommonTableExpression)*] (Note:[WITHCommonTableExpression(,CommonTableExpression)*] (Note:OnlyavailablestartingwithHive0.13.0)FROMtable_reference[GROUPBYcol_list][ORDERBYcol_list][CLUSTERBYcol_list|[DISTRIBUTEBYcol_list][SORTBY]基本查詢hivedefaultselect*fromemp;hive(default)>selectempno,enamefromSQLSQL列別hive(default)>selectenameASname,deptnodnfrom算術(shù)運算ABABABABABABhive(default)>selectsal+1from常用函hivedefaultselectcount(*)cntfromemp;2)hive(default)>selectmax(sal)max_salfromhivedefaultselectmin(salmin_salfromemp;hive(default)>selectsum(sal)sum_salfromhive(default)>selectavg(sal)avg_salfromLimit語hive(default)>select*fromemplimitWhere語使用WHEREWHEREFROMhive(default)>select*fromempwheresal比較運算符(Between/In/Is為NULL則結(jié)果為NULLABNULLNULLAA[NOT]BANDA,BCNULL,則結(jié)果為AISAISNOTA[NOT]LIKESTRINGBSQLA與其匹配的話,則返回TRUE;反之返回須以字母‘x’開頭,‘%x’A必須以字母’x’結(jié)尾,而‘%x%’表示A包含有字母’x’,可以位 ASTRINGhive(default)>select*fromempwheresalhive(default)>select*fromempwheresalbetween500andhive(default)>select*fromempwherecommishive(default)>select*fromempwheresalIN(1500,Like%代表零個或多個字符(任意個字符)_hive(default)>select*fromempwheresalLIKEhive(default)>select*fromempwheresalLIKEhive(default)>select*fromempwheresalRLIKE邏輯運算符hive(default)>select*fromempwheresal>1000andhive(default)>select*fromempwheresal>1000orhive(default)>select*fromempwheredeptnonotIN(30,分GroupBy語GROUPBY語句通常會和聚合函數(shù)一起使用,按照一個或者多個列隊結(jié)果進行分組,hive(default)>selectt.deptno,avg(t.sal)avg_salfromemptgroupbyhive(default)>selectt.deptno,t.job,max(t.sal)max_salfromemptgroupbyt.deptno,hive(default)>selectdeptno,avg(sal)fromempgroupbyhive(default)>selectdeptno,avg(sal)avg_salfromempgroupbydeptnohavingavg_sal>Join語等值案例hive(default)>selecte.empno,e.ename,d.deptno,d.dnamefromempejoindeptdone.deptno=d.deptno;表的別hive(default)>selecte.empno,e.ename,d.deptnofromempejoindeptdone.deptno=內(nèi)連hive(default)>selecte.empno,e.ename,d.deptnofromempejoindeptdone.deptno=左外連hive(default)>selecte.empno,e.ename,d.deptnofromempeleftjoindeptdone.deptno=右外連hive(default)>selecte.empno,e.ename,d.deptnofromemperightjoindeptdone.deptno=滿外連WHERE語句條件的所有記錄。如果任一表的指定字段沒有符合條件的值的話,那么就使用NULL值替代。hive(default)>selecte.empno,e.ename,d.deptnofromempefulljoindeptdone.deptno=多表連createcreatetableifnotexistsdefault.location(locint,loc_namestring)rowformatdelimitedfieldsterminatedbyhive(default)>loaddatalocalinpath'/opt/module/datas/location.txt'intotablehive(default)>SELECTe.ename,d.deptno,l.loc_name empe dept d.deptno= d.loc=大多數(shù)情況下,HiveJOINMapReduce任務(wù)。本例中會首MapReducejobedMapReducejob將第一個MapReducejob的輸出和表l;進行連接操作。dlHive總是按照從左到右的積hive(default)>selectempno,deptnofromemp,連接謂詞中不支持hive(default)>selecte.empno,e.ename,d.deptnofromempejoindeptdone.deptno=d.deptnoore.ename=d.ename;錯誤的排全局排序(Order1)使用ORDERBY子句排序ASC(ascend):升序(默認(rèn)DESC(descend):hive(default)>select*fromemporderbyhive(default)>select*fromemporderbysalhive(default)>selectename,sal*2twosalfromemporderby多個列排hive(default)>selectename,deptno,salfromemporderbydeptno,salMapReduce內(nèi)部排序(Sort2)查看設(shè)置reduce個數(shù)hive(default)>sethive(default)>select*fromempsortbyempnodesc;4)hive(default)>insertoverwritelocaldirectory'/opt/module/datas/sortby-result'selectfromempsortbydeptno分區(qū)排序(DistributeDistributeByMRpartition,進行分區(qū),結(jié)合sortby使用。hive(default)>sethive(default)>insertoverwritelocaldirectory'/opt/module/datas/distribute-result'select*fromempdistributebydeptnosortbyempnodesc;序,不能指定排序規(guī)則為ASC或者DESC。select*fromempclusterbyselect*fromempdistributebydeptnosortby分桶及抽樣查分桶表數(shù)createcreatetablestu_buck(idint,namestring)clusteredby(id)into4rowformatdelimitedfieldsterminatedbyhive(default)>descformattedNum createcreatetablestu(idint,namerowformatdelimitedfieldsterminatedbytruncatetruncatetableselect*frominsertinsertintotableselectid,namefromstuclusterhive(default)>setmapreduce.job.reduces=-1;hive(default)>insertintotablestu_buckselectid,namefromstuclusterhive(default)>select*hive(default)>select*fromstu_bucktablesample(bucket1outof4onytablebucket數(shù)的倍數(shù)或者因子。hivey的大小,決定抽樣的比例。例個bucket的數(shù)據(jù)。4),表示總共抽?。?/4=)1bucket4bucket的數(shù)據(jù)。注意:x的值必須小于等于y的值,否則FAILED:SemanticException[Error10061]:Numeratorshouldnotbebiggerthandenominatorinsampleclausefortablestu_buck數(shù)據(jù)塊抽hive(default)>hive(default)>select*fromstutablesample(0.1percent)七函數(shù)系統(tǒng)自帶的函hive>descfunctionextended自定義函Hive自帶了一些函數(shù),比如:max/min當(dāng)Hive提供的內(nèi)置函數(shù)你的業(yè)務(wù)處理需要時,此時就可以考慮使用用戶自定義函數(shù)(UDF:user-definedfunction)。UDAF(User-DefinedAggregationFunction)UDTF(User-DefinedTable-GeneratingFunctions)如 lview5)addjarlinux_jar_pathb)創(chuàng)建function,Droptemporary]functionifexists]dbname.]function_name;自定義UDF函數(shù)開發(fā)案packagepackagepublicclassLowerextendsUDF{publicStringevaluate(finalStrings)ifif(s==null)return}}}5)將jar包添加到hive的classpathhivedefaultaddjaropt/module/datas/udf.jar;6)創(chuàng)建臨時函數(shù)與開發(fā)好的javaclass關(guān)聯(lián)hive(default)>createtemporaryfunctionmy_loweras八壓縮和Hadoop源碼編譯支持Snappy壓資源準(zhǔn)配置CentOS能連接。Linux虛擬機是暢通的注意:采用root角色編譯,減少文件夾權(quán)限出現(xiàn)問題jar包準(zhǔn)備(hadoop源碼、JDK7mavenJDKJAVA_HOMEPATHjava-version(如下都需要驗證exportPATH=$PATH:$JAVA_HOME/bin[root@hadoop101software]#tar-zxfjdk-8u144-linux-x64.tar.gzexportPATH=$PATH:$JAVA_HOME/bin-Maven解壓、配置MAVEN_HOMEPATHexportPATH=$PATH:$MAVEN_HOME/bin[root@hadoop101software]#tar-zxvfapache-maven-3.0.5-bin.tar.gzexportPATH=$PATH:$MAVEN_HOME/bin編譯源[root@hadoop101software]#yuminstallautoconfautomakelibtoolcmake[root@hadoop101software]#yuminstallncurses-devel[root@hadoop101software]#yuminstallopenssl-devel[root@hadoop101software]#yuminstallgcc*[root@hadoop101software]#tar-zxvfsnappy-1.1.3.tar.gz-C/opt/module/[root@hadoop101module]#cdsnappy-1.1.3/[root@hadoop101snappy-1.1.3]#./configure[root@hadoop101snappy-1.1.3]#make[root@hadoop101snappy-1.1.3makeinstall#查看snappy庫文件[root@hadoop101software]#tar-zxvfprotobuf-2.5.0.tar.gz-C/opt/module/[root@hadoop101module]#cdprotobuf-2.5.0/[root@hadoop101protobuf-2.5.0]#./configure[root@hadoop101protobuf-2.5.0]# [root@hadoop101protobuf-2.5.0]# makeinstall#查看protobuf版本以測試是否安裝成功[root@hadoop101software]#tar-zxvfhadoop-2.7.2-src.tar.gz[root@hadoop101software]#cdhadoop-2.7.2-src/執(zhí)行成功后,/opt/software/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz即為新生成的支持snappy壓縮的二進制安裝包。Hadoop壓縮配MR無否否是是無否 OnasinglecoreofaCorei7processorin64-bitmode,Snappycompressesatabout250MB/secormoreand pressesatabout500MB/secormore. 使用文件擴展名判斷是否支持某種編器pmapper出p mapper輸 LZO 編階段壓reducer輸true啟用 使用標(biāo)準(zhǔn)工具或者編如 eFile輸出使用的壓縮類型:和開啟Map輸出階段壓hivedefault)>set 2)開啟mapreduce中map輸出壓縮功能 hive(default)>selectcount(ename)namefromHivepress.output控制著這個功能。用戶可能需要保持默認(rèn)設(shè)置文件中的默認(rèn)值設(shè)置這個值為true,來開啟輸出結(jié)果壓縮功能。hivedefault)>set 2)開啟mapreduce最終輸出數(shù)據(jù)壓縮 設(shè)置mapreducehive(default)>setmapreduce.output.fi hive(default)>insertoverwritelocaldirectory'/opt/module/datas/distribute-result'selectfromempdistributebydeptnosortbyempno文件格列式和行式行的特點:查詢滿足條件的一整行數(shù)據(jù)的時候,列則需要去每個的字段 的特點:因為每個字段的數(shù)據(jù),在查詢只需要少數(shù)幾個字段的時候,統(tǒng)自動檢查,執(zhí)行查詢時自動解壓),但使用這種方式,hive不會對數(shù)據(jù)進行切分,從而無Orc格Orc1stripestripe250MBStripe每個Stripe里有三部分組成,分別是IndexData,RowData,StripeFooter:只是記錄某行的各字段在RowData中的offset。RowData:存的是具體的數(shù)據(jù),先取部分行,然后對這些行按列進行。對每個列進行了編碼,分成多個Stream來。長度信息等。在文件時,會seek到文件尾部讀PostScript,從里面解析到FileFooter長度,再讀FileFooter,從里面解析到各個Stripe信息,再讀各個Stripe,即從后往前讀。Parquet格據(jù)和元數(shù)據(jù),因此Parquet格式文件是自解析的。通常情況下,在Parquet數(shù)據(jù)的時候會按照Block大小設(shè)置行組的大小,由于一般個Mapper任務(wù)處理,增大任務(wù)執(zhí)行并行度。Parquet文件的格式如下圖所示。上圖展示了一個Parquet文件的內(nèi)容,一個文件中可以多個行組,文件的首位都是MagicCodeParquet文件,F(xiàn)ooterlength記錄了文件元數(shù)頁的開始都會該頁的元數(shù)據(jù),在Parquet中,有三種類型的頁:數(shù)據(jù)頁、字典頁和索引多包含一個字典頁,索引頁用來當(dāng)前行組下該列的索引,目前Parquet中還不支持索引主流文件格式對比實(track_timestring,urlstring,session_idstring,refererstring,ipstring,city_idstring)rowformatdelimitedfieldsterminatedby'\t'storedastextfile;dfsdfs-du-h18.1M createcreatetablelog_orc(track_timestring,urlstring,session_idstring,refererstring,ipstring,city_idstring)rowformatdelimitedfieldsterminatedby'\t'storedasorc;insertinsertintotablelog_orcselect*fromlog_text2.8 33)log_parquet(track_timestring,urlstring,refererstring,ipstring,city_idstring)rowformatdelimitedfieldsterminatedby'\t'storedasparquet;insertinsertintotablelog_parquetselect*fromlog_text13.1 ORC hive(default)>selectcount(*)fromTimetaken:21.54seconds,Fetched:1row(s)Timetaken:20.346seconds,Fetched:1row(s)hive(default)>selectcount(*)fromTimetaken:20.867seconds,Fetched:1row(s)Timetaken:20.174seconds,Fetched:1hive(default)>selectcount(*)fromTimetaken:22.922seconds,Fetched:1row(s)Timetaken:20.149seconds,Fetched:1ORC>TextFile>8.6和壓縮結(jié)ORC方式的壓縮highlevelcompression(oneofnumberofbytesineachnumberofbytesineachnumberofrowsbetweenindex(mustbe>=whethertocreaterowcommaseparatedlistofcolumnnameswhichbloomfiltershouldbefalsepositiveprobabilityforbloom(must>0.0andlog_orc_none(track_timestring,urlstring,refereripstring,ipstring,city_idstring)rowformatdelimitedfieldsterminatedbystoredasorctblproperties insertinsertintotablelog_orc_noneselect*fromlog_text7.7M 2)創(chuàng)建一個SNAPPY壓縮的ORC方式createcreatetablelog_orc_snappy(track_timeurlstring,refererstring,ipstring,city_idstring)rowformatdelimitedfieldsterminatedbyinsertinsertintotablelog_orc_snappyselect*fromlog_text3.8M/user/hive/warehouse/log_orc_snappy/000000_0 3)上一節(jié)中默認(rèn)創(chuàng)建的ORC方式,導(dǎo)入數(shù)據(jù)后的大小為2.8 九企業(yè)級調(diào)優(yōu)FetchFetch抓取是指,Hive中對某些情況的查詢可以不必使用MapReduceSELECT*FROMemployees;在這種情況下,Hive可以簡單地employee對應(yīng)的Expectsoneof[none,minimal,SomeselectqueriescanbeconvertedtoExpectsoneof[none,minimal,SomeselectqueriescanbeconvertedtosingleFETCHtaskminimizingCurrentlythequeryshouldbesinglesourcednothavinganysubqueryandshouldanyaggregationsordistincts(whichincursRS), lviewsandnone:disable :SELECT,FILTER,LIMITonly(supportTABLESAMPLEandvirtualhive(default)>sethive.fetch.task.conversion=none;hive(default)>select*fromemp;hive(default)>selectenamefromhive(default)>selectenamefromemplimit執(zhí)行mapreduce程序。hive(default)>select*fromemp;hive(default)>selectenamefromhive(default)>selectenamefromemplimit本地模大多數(shù)的HadoopJobHadoop提供的完整的可擴展性來處理大數(shù)據(jù)集的。不過,Hive的輸入數(shù)據(jù)量是非常小的。在這種情況下,為查詢觸發(fā)執(zhí)行任務(wù)時消耗可能會比//local//localmrlocalmrset;sethive(default)>sethive.exec.mode.local.auto=true;hive(default)>select*fromempclusterbydeptno;Timetaken:1.328seconds,Fetched:14row(s)hive(default)>sethive.exec.mode.local.auto=false;hive(default)>select*fromempclusterbydeptno;Timetaken:20.09seconds,Fetched:14row(s)表的優(yōu)小表、Group讓小的維度表(1000條以下的記錄條數(shù))先進內(nèi)存。在map端完成reduce。hive已經(jīng)對小表JOIN大表和大表JOIN小表進行了優(yōu)化。小表createcreatetablebigtable(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedby'\t';createtablesmalltable(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedby'\t';createtablejointable(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedby'\t';hive(default)>loaddatalocalinpath'/opt/module/datas/bigtable'intotablebigtable;hive(default)>loaddatalocalinpath'/opt/module/datas/smalltable'intotablesmalltable;執(zhí)行小表JOINselectb.id,b.time,b.uid,b.keyword,b.url_rank,b.click_num,b.click_urlfromsmalltablesleftjoin ononb.id=Timetaken:35.921執(zhí)行大表JOINinsertinsertoverwritetableselectb.id,b.time,b.uid,b.keyword,b.url_rank,b.click_num,b.click_urlfrombigtable leftjoin ons.id=Timetaken:34.196大表Join大KEYcreatetableori(idbigint,timebigint,uidstring,keywordstring,url_rankint,createtableori(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedby'\t';createtablenullidtable(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedby'\t';createtablejointable(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedby'\t';hive(default)>loaddatalocalinpath'/opt/module/datas/ori'intotableselectn.*fromnullidtablenleftjoinorioonn.id=o.id;Timetaken:42.038secondsselectn.*from(select*fromnullidtablewhereidisnotnull)n leftjoinorioonn.id=Timetaken:31.725seconds2)空key轉(zhuǎn)換join的結(jié)果中,此時我們可以表akey為空的字段賦一個隨機的值,使得數(shù)據(jù)隨機均勻地分不到不同的reducer上。例如:不隨機分布空nullJOINselectn.*fromnullidtablenleftjoinoribonn.id=結(jié)果:可以看出來,出現(xiàn)了數(shù)據(jù)傾斜,某些reducer的資源消耗遠(yuǎn)大于其他reducerJOINselectn.*fromnullidtablenfulljoinoriocasewhenn.idisnullthenconcat('hive',rand())elsen.idend=表全部加載到內(nèi)存在map端進行join,避免reducer處理。sethive.auto.convert.jointrue;sethive.mapjoin.smalltable.filesize= 2)MapJoin工作機制sethive.auto.convert.jointrue;執(zhí)行小表JOINselectb.id,b.time,b.uid,b.keyword,b.url_rank,b.click_num,b.click_urlfromsmalltables ons.id=Timetaken:24.594執(zhí)行大表JOINselectb.id,b.time,b.uid,b.keyword,b.url_rank,b.click_num,b.click_urlfrombigtable ons.id=Timetaken:24.315Group部分聚合,最后在Reduce端得出最終結(jié)果。是否在MapTruehive.map.aggr=truehive.groupby.skewindata=truetrueMRJobMRJob中,Map的輸果是相同的GroupByKeyReduce中,從而達到負(fù)載均衡的目的;第MRJob再根據(jù)預(yù)處理的數(shù)據(jù)結(jié)果按照GroupByKeyReduce中(這個過程可以保證相同的GroupByKey被分布到同一個Reduce中),最后完成最終的聚合操作。Count(Distinct)去重統(tǒng)COUNTDISTINCT操作需要用一個ReduceTaskReduce需要處理的數(shù)據(jù)量太大,就會導(dǎo)致整個Job很難完成,一般COUNTDISTINCT使用先GROUPBYCOUNT的方式替換:hive(default)>createtablebigtable(idbigint,timehive(default)>createtablebigtable(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedbyhive(default)>selectcount(distinctid)fromStage-Stage-1:Map:1 Reduce:1 CumulativeCPU:7.12sec HDFSRead:HDFSWrite:7SUCCESSTotalMapReduceCPUTimeSpent:7seconds120msecTimetaken:23.607seconds,Fetched:1row(s)Timetaken:34.941seconds,Fetched:1采用GROUPbyhive(default)>selectcount(id)from(selectidfrombigtablegroupbyid)Stage-Stage-1:Map:1 Reduce:5 CumulativeCPU:17.53sec HDFSRead:HDFSWrite:580SUCCESSStage-Stage-2:Map:3 Reduce:1 CumulativeCPU:4.29sec HDFSRead:9409HDFSWrite:7SUCCESSTotalMapReduceCPUTimeSpent:21seconds820msecTimetaken:50.795seconds,Fetched:19.3.6盡量避免積,join的時候不加on條件,或者無效的on條件,Hive只能使用行列過行處理:在分區(qū)剪裁中,當(dāng)使用外關(guān)聯(lián)時,如果將副表的過濾條件寫在Where后面,hive(default)>selecto.idfrombigtablejoinorioono.id=b.idwhereo.id<=10;Timetaken:34.406seconds,Fetched:100row(s)Timetaken:26.043seconds,Fetched:100hive(default)>selectb.idfrombigtablejoin(selectidfromoriwhereid<=10)oonb.id=o.id;Timetaken:30.058seconds,Fetched:100row(s)Timetaken:29.106seconds,Fetched:100插入到相應(yīng)的分區(qū)中,Hive中也提供了類似的機制,即動態(tài)分區(qū)(DynamicPartition),只不過,使用Hive的動態(tài)分區(qū),需要進行相應(yīng)的配置。需要設(shè)置成大于365,如果使用默認(rèn)值100,則會報錯。MRJob中,最大可以創(chuàng)建多少個HDFS2)案例createcreatetableori_partitioned(idbigint,timebigint,uidstring,keywordstring,url_rankint,click_numint,click_urlstring)rowformatdelimitedfieldsterminatedbyhivehive(default)>loaddatalocalinpath'/opt/module/datas/ds1'intotableori_partitionedpartition(p_time='20111230000010');hive(default)>loaddatalocalinpath'/opt/module/datas/ds2'intotablecreatetableori_partitioned_target(idbigint,timebigint,uidstring,keywordstring,url_rankcreatetableori_partitioned_target(idbigint,

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論