Titanic數(shù)據(jù)集分析.doc

上傳人：工*** IP屬地：浙江上傳時間：2020-04-05 格式：DOC 頁數(shù)：20 大?。?39.83KB 積分：15 舉報 版權(quán)申訴

已閱讀5頁，還剩15頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

泰坦尼克數(shù)據(jù)集探索1. 簡介：從泰塔尼克數(shù)據(jù)集中，根據(jù)每個乘客的信息，建立模型并進(jìn)行預(yù)測。整篇文章分為三步：1. 特征選擇2. 缺失數(shù)據(jù)處理3. 預(yù)測1.1 導(dǎo)入軟件包并檢查數(shù)據(jù) library(ggplot2) # 可視化 library(ggthemes) # 可視化 library(scales) # 可視化 library(dplyr) # 數(shù)據(jù)處理 library(mice) # 填充缺失數(shù)據(jù) library(randomForest) # 分類算法 #數(shù)據(jù)的導(dǎo)入 setwd(D:/Titanic)#設(shè)置默認(rèn)功過路徑 train test full # check data str(full)我們觀察到一共有1309條數(shù)據(jù)，每一條數(shù)據(jù)有12個相關(guān)變量。2. 特征工程頭銜# 從名稱中挖掘 # 從乘客名字中提取頭銜 #R中的grep、grepl、sub、gsub、regexpr、gregexpr等函數(shù)都使用正則表達(dá)式的規(guī)則進(jìn)行匹配。默認(rèn)是egrep的規(guī)則，sub函數(shù)只實現(xiàn)第一個位置的替換，gsub函數(shù)實現(xiàn)全局的替換。 full$Title # 查看按照性別劃分的頭銜數(shù)量 table(full$Sex, full$Title)我們發(fā)現(xiàn)頭銜的類別太多，并且好多出現(xiàn)的頻次是很低的，我們可以將這些類別進(jìn)行合并 rare_title # 重命名稱呼 full$Titlefull$Title = Mlle full$Titlefull$Title = Ms full$Titlefull$Title = Mme full$Titlefull$Title %in% rare_title # 再次查看按照性別劃分的頭銜數(shù)量 table(full$Sex, full$Title)可以看到頭銜的個數(shù)得到了大量的縮減 #sapply()函數(shù)：根據(jù)傳入?yún)?shù)規(guī)則重新構(gòu)建一個合理的數(shù)據(jù)類型返回 full$Surname # Create a family size variable including the passenger themselves full$Fsize # Create a family variable full$Family #為了直觀顯示，我們可以用ggplot2 畫出家庭成員數(shù)量和生存家庭數(shù)情況的圖形 ggplot(full1:891, aes(x = Fsize, fill = factor(Survived) + geom_bar(stat=count, position=dodge) + scale_x_continuous(breaks=c(1:11) + labs(x = Family Size) + theme_few() full$FsizeDfull$Fsize = 1 full$FsizeDfull$Fsize 1 full$FsizeDfull$Fsize 4 # Show family size by survival using a mosaic plot mosaicplot(table(full$FsizeD, full$Survived), main=Family Size by Survival, shade=TRUE)嘗試創(chuàng)建一些新的特征 # This variable appears to have a lot of missing values full$Cabin1:28 # Create a Deck variable. Get passenger deck A - F: full$Deck # Use ggplot2 to visualize embarkment, passenger class, & median fare ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass) + geom_boxplot() + geom_hline(aes(yintercept=80), + colour=red, linetype=dashed, lwd=2) + scale_y_continuous(labels=dollar_format() + theme_few()可以看到出發(fā)的一級乘客的中位票價與我們的登機(jī)手續(xù)費(fèi)乘客支付的80美元相當(dāng)。我們接近在這里和那里確定了幾個缺失值的位置。 1044行上的乘客的票價是缺失值。 # Since their fare was $80 for 1st class, they most likely embarked from C full$Embarkedc(62, 830) # Show row 1044 full1044, 這是從南安普敦（S）出發(fā)的三級乘客。讓所有其他人分享他們的班級和登機(jī)牌（n = 494）可視化票價。 ggplot(fullfull$Pclass = 3 & full$Embarked = S, , + aes(x = Fare) + geom_density(fill = #99d6ff, alpha=0.4) + + geom_vline(aes(xintercept=median(Fare, na.rm=T),+ colour=red, linetype=dashed, lwd=1) + scale_x_continuous(labels=dollar_format() + theme_few()從這個可視化的角度來看，將NA票價值替換為上課時間為8.05美元的中位數(shù)似乎是相當(dāng)合理的。 # Replace missing fare value with median fare for class/embarkment full$Fare1044 # Show number of missing Age values sum(is.na(full$Age) # Make variables factors into factors factor_vars fullfactor_vars # Set a random seed set.seed(129) # Perform mice imputation, excluding certain less-than-useful variables: mice_mod # Save the complete output mice_output # Plot age distributions par(mfrow=c(1,2) hist(full$Age, freq=F, main=Age: Original Data, + col=darkgreen, ylim=c(0,0.04) hist(mice_output$Age, freq=F, main=Age: MICE Output, + col=lightgreen, ylim=c(0,0.04) # Show new number of missing Age values sum(is.na(full$Age)第二次特征工程現(xiàn)在我們知道每個人的年齡，我們可以創(chuàng)造幾個新的年齡變量：兒童和母親。一個孩子只會是18歲以下的人，母親是1）女性2）18歲以上3）有超過0個孩子4）沒有頭銜 Miss # First well look at the relationship between age & survival ggplot(full1:891, aes(Age, fill = factor(Survived) + + geom_histogram() + + # I include Sex since we know (a priori) its a significant predictor+ facet_grid(.Sex) + + theme_few() # Create the column child, and indicate whether child or adult full$Childfull$Age 18 full$Childfull$Age = 18 # Show counts table(full$Child, full$Survived) # Adding Mother variable full$Mother full$Motherfull$Sex = female & full$Parch 0 & full$Age 18 & full$Title != Miss # Show counts table(full$Mother, full$Survived) # Finish by factorizing our two new factor variables full$Child full$Mother md.pattern(full)預(yù)測最后，我們準(zhǔn)備根據(jù)我們精心策劃和處理缺失值的變量，預(yù)測在泰坦尼克號的乘客中誰能幸存下來。為此，我們將依靠隨機(jī)森林分類算法; 畢竟，我們花了所有的時間來進(jìn)行數(shù)據(jù)處理。拆分測試與訓(xùn)練 # Split the data back into a train set and a test set train test # Set a random seed set.seed(754) # Build the model (note: not all possible variables are used) rf_model # Show model error plot(rf_model, ylim=c(0,0.36) legend(topright, colnames(rf_model$err.rate), col=1:3, fill=1:3)黑線顯示總體錯誤率低于20。紅線和綠線分別顯示“死亡”和“幸存”的錯誤率。我們可以看到，現(xiàn)在我們比預(yù)測死亡更成功，而不是生存。變量重要性 # Get importance importance varImportance # Create a rank variable based on importance rankImportance %+ mutate(Rank = paste0(#,dense_rank(desc(Importance) # Use ggplot2 to visualize the relative importance of variables ggplot(rankImportance, aes(x = reorder(Variables, Importance), + y = Importance, fill = Importance) + geom_bar(stat=identity) + +

人人文庫> 全部分類> 教育資料 > 幼兒教育

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

Titanic數(shù)據(jù)集分析.doc

文檔簡介

溫馨提示

最新文檔

評論

Titanic數(shù)據(jù)集分析.doc

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔