《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案

上傳人：熊*** IP屬地：山東上傳時間：2024-10-17 格式：DOCX 頁數(shù)：14 大?。?8.16KB 積分：15 舉報 版權(quán)申訴

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案_第2頁

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案_第3頁

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案_第4頁

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案_第5頁

已閱讀5頁，還剩9頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案第一章通過PyCharm建立一個項目，項目名稱自定。在該項目中實現(xiàn)一個WelcometoPython!程序。PyCharm創(chuàng)建項目。首先，雙擊桌面的PyCharm圖標(biāo)打開PyCharm程序，選擇“File”—>“NewProject”，然后在彈出的窗口中的location文本框中自定義項目名稱為：你的項目名稱，并將該項目存放在位置：項目存放路徑。將鼠標(biāo)移到項目根節(jié)點，右擊鼠標(biāo)，選擇“New”—>“PythonFile”。這樣就可以在PyCharm中創(chuàng)建一個基于Python3.6基礎(chǔ)解釋器作為編程環(huán)境的Python文件。在此，將該文件命名為：你的文件名稱在右邊的代碼編輯框中輸入：print("WelcometoPython!")運行該文件即可。2.通過PyCharm建立一個項目，項目名稱自定。在該項目中定義一個列表，并使用列表函數(shù)append()向該列表中添加數(shù)據(jù)，最后使用for循環(huán)語句遍歷輸出。list1=['hello','world',2020]

list1.append('python')##使用append()添加元素

print("list1[3]:",list1[3])#輸出添加的元素

foreinlist1:#循環(huán)遍歷列表元素

print(e)第二章1.通過導(dǎo)入requests庫，使用該庫爬取Python官方網(wǎng)站頁面數(shù)據(jù)。importrequests

req=requests.get('***.python***/')

req.encoding='utf-8'

print(req.text)2.通過導(dǎo)入lxml和BeautifulSoup，使用該庫解析爬取的Python官方網(wǎng)站頁面數(shù)據(jù)。frombs4importBeautifulSoup

req=requests.get('***.python***/')

req.encoding='utf-8'

soup=BeautifulSoup(req.text,'lxml')

print(soup.title.string)

item=soup.select('#top>nav>ul>li.python-meta.current_item.selectedcurrent_branch.selected>a')

print(item)第三章1.使用Python讀取和輸出CSV和JSON數(shù)據(jù)。讀取CSV，數(shù)據(jù)自定義。importcsv

file_to_use='學(xué)生信息.csv'

withopen(file_to_use,'r',encoding='utf-8')asf:

r=csv.reader(f)

file_header=next(r)

print(file_header)

forid,file_header_colinenumerate(file_header):

print(id,file_header_col)

forrowinr:

ifrow[2]=='學(xué)號':

print(row)寫入CSV，數(shù)據(jù)自定義。importcsv

withopen('學(xué)生信息.csv','a',encoding='utf-8')asf:

wr=csv.writer(f)

wr.writerows([['大數(shù)據(jù)運維','hadoop','高級技術(shù)員','張三'],['大數(shù)據(jù)開發(fā)','python','中級技術(shù)員','李四']])

wr.writerow(['大數(shù)據(jù)運維','hadoop','高級技術(shù)員','張三'])

wr.writerow(['大數(shù)據(jù)開發(fā)','python','中級技術(shù)員','李四'])2.使用Python連接MySQL，創(chuàng)建數(shù)據(jù)庫和表，并實現(xiàn)增刪查改。importpymysql

db=pymysql.connect("localhost","root","你的密碼","你的數(shù)據(jù)庫名稱")

cursor=db.cursor()

cursor.execute("DROPTABLEIFEXISTSemployee")

#創(chuàng)建表格

sql="""CREATETABLE`employee`(

`id`int(10)NOTNULLAUTO_INCREMENT,

`first_name`char(20)NOTNULL,

`last_name`char(20)DEFAULTNULL,

`age`int(11)DEFAULTNULL,

`sex`char(1)DEFAULTNULL,

`income`floatDEFAULTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor.execute(sql)

print("CreatedtableSuccessfully.")

#插入

sql2="""INSERTINTOEMPLOYEE(FIRST_NAME,

LAST_NAME,AGE,SEX,INCOME)

VALUES('Mac','Su',20,'M',5000)"""

cursor.execute(sql2)

print("InserttableSuccessfully.")

#查詢

sql3="""SELECT*FROMEMPLOYEE"""

cursor.execute(sql3)

print("SELECTtableSuccessfully.")

#修改

sql4="""UPDATEEMPLOYEESETFIRST_NAME='Sam'WHEREID=3'"""

cursor.execute(sql4)

print("UPDATEtableSuccessfully.")

#刪除

sql5="""DELETEFROMEMPLOYEEWHEREFIRST_NAME='Mac'"""

cursor.execute(sql5)

print("DELETEtableSuccessfully.")第四章1.利用業(yè)務(wù)網(wǎng)站提供的API實現(xiàn)數(shù)據(jù)采集，清洗和存儲。importrequests

importpymysql

api_url='***//api.github***/search/repositories?q=spider'

req=requests.get(api_url)

print('狀態(tài)碼：',req.status_code)

req_dic=req.json()

print('與spider有關(guān)的庫總數(shù)：',req_dic['total_count'])

print('本次請求是否完整:',req_dic['incomplete_results'])

req_dic_items=req_dic['items']

print('當(dāng)前頁面返回的項目數(shù)量：',len(req_dic_items))

names=[]

forkeyinreq_dic_items:

names.append(key['name'])

sorted_names=sorted(names)

db=pymysql.connect(host='localhost',user='root',password='這里要使用自己密碼',port=3306)

cursor=db.cursor()

cursor.execute("CREATEDATABASE數(shù)據(jù)庫名稱DEFAULTCHARACTERSETutf8mb4")

db.close()

db2=pymysql.connect("localhost","root","這里要使用自己密碼","數(shù)據(jù)庫名稱",3306)

cursor2=db2.cursor()

cursor2.execute("DROPTABLEIFEXISTS數(shù)據(jù)庫名稱")

sql1="""CREATETABLE`數(shù)據(jù)庫名稱`(

`id`int(10)NOTNULLAUTO_INCREMENT,

`full_name`char(20)NOTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor2.execute(sql1)

print("CreatedtableSuccessfull.")

forindex,nameinenumerate(sorted_names):

print('項目索引號：',index,'項目名稱：',name)

sql2='INSERTINTO數(shù)據(jù)庫名稱(id,full_name)VALUES(%s,%s)'

try:

cursor2.execute(sql2,(index,name))

db2***mit()

except:

db2.rollback()

db2.close()2.通過分析特定頁面結(jié)構(gòu)和數(shù)據(jù)的各項內(nèi)容，使用Python實現(xiàn)AJAX的數(shù)據(jù)采集，并將結(jié)果存儲到MySQL數(shù)據(jù)庫中。fromurllib.parseimporturlencode

importrequests

importpymysql

original_url='***.autohome***.cn/ashx/AjaxIndexHotCarByDsj.ashx?'

requests_headers={

'Referer':'***.autohome***.cn/beijing/',

'User-Agent':'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/57.0.2987.133Safari/537.36',

'X-Requested-With':'XMLHttpRequest',

}

db=pymysql.connect(host='localhost',user='root',password='這里要使用自己密碼',port=3306)

cursor=db.cursor()

cursor.execute("CREATEDATABASEAJAXDEFAULTCHARACTERSETutf8mb4")

db.close()

db2=pymysql.connect("localhost","root","這里要使用自己密碼","AJAX",3306)

cursor2=db2.cursor()

cursor2.execute("DROPTABLEIFEXISTSajax")

sql1="""CREATETABLE`ajax`(

`car_name`char(20)NOTNULL,

`id`int(10)NOTNULLAUTO_INCREMENT,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor2.execute(sql1)

print("CreatedtableSuccessfull.")

defget_one(cityid):

p={

'cityid':cityid

}

complete_url=original_url+urlencode(p)

try:

response=requests.get(url=complete_url,params=requests_headers)

ifresponse.status_code==200:

returnresponse.json()

exceptrequests.ConnectionErrorase:

print('Error',e.args)

defparse_three(json):

ifjson:

foriinjson:

forbini.get('SeriesList'):

item_list=b.get('Name')

item_list2=b.get('Id')

print(item_list+':'+str(item_list2))

sql2='INSERTINTOajax(car_name,id)VALUES(%s,%s)'

try:

cursor2.execute(sql2,(item_list,item_list2))

db2***mit()

except:

db2.rollback()

if__name__=='__main__':

city_list=[{'北京':'110100'},{'重慶':'500100'},{'上海':'310100'}]

forcityincity_list:

jo=get_one(city.values())

parse_three(jo)

db2.close()

#jo=get_one(110100)

#parse_one(jo)

#parse_two(jo)

#parse_three(jo)

#defparse_one(json):

#ifjson:

#foriinjson:

#item_list=i.get('Name')

#print(item_list)

#defparse_two(json):

#ifjson:

#foriinjson:

#forbini.get('SeriesList'):

#item_list=b.get('Name')

#print(item_list)第五章一、判斷題1、Selenium庫的主要作用是什么（）A.．進(jìn)行數(shù)據(jù)存儲B.．自動化瀏覽器操作和網(wǎng)頁訪問C.．?dāng)?shù)據(jù)可視化處理D.．編寫網(wǎng)頁前端代碼二、判斷題2、WebDriverWait是Selenium中用于實現(xiàn)等待條件的方法之一，可以等待特定元素的出現(xiàn)。（）3、使用Selenium進(jìn)行網(wǎng)頁自動化操作時，不需要關(guān)心頁面的加載時間和元素的出現(xiàn)順序。（）答案：1、B2、對3、錯

三、實踐題請編寫Python代碼，使用Selenium訪問業(yè)務(wù)網(wǎng)站首頁，然后從搜索框中輸入關(guān)鍵字"Python編程"，并模擬點擊搜索按鈕fromseleniumimportwebdriver#創(chuàng)建瀏覽器驅(qū)動browser=webdriver.Chrome()#打開百度首頁browser.get("***.baidu***")#定位搜索框并輸入關(guān)鍵字search_box=browser.find_element_by_id("kw")search_box.send_keys("Python編程")#定位搜索按鈕并點擊search_button=browser.find_element_by_id("su")search_button.click()第六章使用Scrapy創(chuàng)建項目，爬取網(wǎng)站的頁面數(shù)據(jù)，并保存到MySQL數(shù)據(jù)庫中（網(wǎng)站可自行指定）。SpiderDemo.py爬蟲主代碼：importscrapy

#引入本地的模板

fromDemoAuto.itemsimportDemoautoItem

classMyScr(scrapy.Spider):

#設(shè)置全局唯一的name

name='DemoAuto'

#填寫爬取地址

start_urls=['***.autohome***.cn/all/#pvareaid=3311229']

#編寫爬取方法

defparse(self,response):

#實例一個容器保存爬取的信息

item=DemoautoItem()

#這部分是爬取部分，使用xpath的方式選擇信息，具體方法根據(jù)網(wǎng)頁結(jié)構(gòu)而定

#先獲取每個課程的div

fordivinresponse.xpath('//*[@id="auto-channel-lazyload-article"]/ul/li/a'):

#獲取div中的課程標(biāo)題

item['title']=div.xpath('.//h3/text()').extract()[0].strip()

item['content']=div.xpath('.//p/text()').extract()[0].strip()

#返回信息

yielditemItems.py代碼importscrapy

classDemoautoItem(scrapy.Item):

#definethefieldsforyouritemherelike:

#name=scrapy.Field()

#儲存標(biāo)題

title=scrapy.Field()

content=scrapy.Field()

passMiddlewares.py的代碼#-*-coding:utf-8-*-

#Defineherethemodelsforyourspidermiddleware

#Seedocumentationin:

#***//doc.scrapy***/en/latest/topics/spider-middleware.html

fromscrapyimportsignals

classDemoautoSpiderMiddleware(object):

#Notallmethodsneedtobedefined.Ifamethodisnotdefined,

#scrapyactsasifthespidermiddlewaredoesnotmodifythe

#passedobjects.

@classmethod

deffrom_crawler(cls,crawler):

#ThismethodisusedbyScrapytocreateyourspiders.

s=cls()

crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)

returns

defprocess_spider_input(self,response,spider):

#Calledforeachresponsethatgoesthroughthespider

#middlewareandintothespider.

#ShouldreturnNoneorraiseanexception.

returnNone

defprocess_spider_output(self,response,result,spider):

#CalledwiththeresultsreturnedfromtheSpider,after

#ithasprocessedtheresponse.

#MustreturnaniterableofRequest,dictorItemobjects.

foriinresult:

yieldi

defprocess_spider_exception(self,response,exception,spider):

#Calledwhenaspiderorprocess_spider_input()method

#(fromotherspidermiddleware)raisesanexception.

#ShouldreturneitherNoneoraniterableofResponse,dict

#orItemobjects.

pass

defprocess_start_requests(self,start_requests,spider):

#Calledwiththestartrequestsofthespider,andworks

#similarlytotheprocess_spider_output()method,except

#thatitdoesn’thavearesponseassociated.

#Mustreturnonlyrequests(notitems).

forrinstart_requests:

yieldr

defspider_opened(self,spider):

('Spideropened:%s'%)

classDemoautoDownloaderMiddleware(object):

#Notallmethodsneedtobedefined.Ifamethodisnotdefined,

#scrapyactsasifthedownloadermiddlewaredoesnotmodifythe

#passedobjects.

@classmethod

deffrom_crawler(cls,crawler):

#ThismethodisusedbyScrapytocreateyourspiders.

s=cls()

crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)

returns

defprocess_request(self,request,spider):

#Calledforeachrequestthatgoesthroughthedownloader

#middleware.

#Musteither:

#-returnNone:continueprocessingthisrequest

#-orreturnaResponseobject

#-orreturnaRequestobject

#-orraiseIgnoreRequest:process_exception()methodsof

#installeddownloadermiddlewarewillbecalled

returnNone

defprocess_response(self,request,response,spider):

#Calledwiththeresponsereturnedfromthedownloader.

#Musteither;

#-returnaResponseobject

#-returnaRequestobject

#-orraiseIgnoreRequest

returnresponse

defprocess_exception(self,request,exception,spider):

#Calledwhenadownloadhandleroraprocess_request()

#(fromotherdownloadermiddleware)raisesanexception.

#Musteither:

#-returnNone:continueprocessingthisexception

#-returnaResponseobject:stopsprocess_exception()chain

#-returnaRequestobject:stopsprocess_exception()chain

pass

defspider_opened(self,spider):

('Spideropened:%s'%)Pipelines.py的代碼#-*-coding:utf-8-*-

#Defineyouritempipelineshere

#Don'tforgettoaddyourpipelinetotheITEM_PIPELINESsetting

#See:***//doc.scrapy***/en/latest/topics/item-pipeline.html

importjson

importpymysql

classDemoautoPipeline(object):

def__init__(self):

#打開文件

self.file=open('data.json','w',encoding='utf-8')

#該方法用于處理數(shù)據(jù)

defprocess_item(self,item,spider):

#讀取item中的數(shù)據(jù)

line=json.dumps(dict(item),ensure_ascii=False)+"\n"

#寫入文件

self.file.write(line)

#返回item

returnitem

#該方法在spider被開啟時被調(diào)用。

defopen_spider(self,spider):

pass

#該方法在spider被關(guān)閉時被調(diào)用。

defclose_spider(self,spider):

pass

defdbHandle():

conn=pymysql.connect("localhost","root","你的數(shù)據(jù)庫密碼","test")

returnconn

classMySQLPipeline(object):

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案

文檔簡介

溫馨提示

最新文檔

評論

《大數(shù)據(jù)采集與預(yù)處理》課內(nèi)習(xí)題和答案

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔