java開(kāi)發(fā)項(xiàng)目集錦附源碼_第1頁(yè)
java開(kāi)發(fā)項(xiàng)目集錦附源碼_第2頁(yè)
java開(kāi)發(fā)項(xiàng)目集錦附源碼_第3頁(yè)
java開(kāi)發(fā)項(xiàng)目集錦附源碼_第4頁(yè)
java開(kāi)發(fā)項(xiàng)目集錦附源碼_第5頁(yè)
已閱讀5頁(yè),還剩60頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、 新浪天氣預(yù)報(bào)新聞java抓去程序package .weather1;import java.io.BufferedReader;import java.io.ByteArrayOutputStream;import java.io.File;import java.io.FileWriter;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.PrintWriter;import .URL;import .URLConnection;impo

2、rt java.util.regex.Matcher;import java.util.regex.Pattern;import mons.logging.Log;import mons.logging.LogFactory;import .update.Getdata;/* * 正則方式抓取新浪天氣新聞上的新聞 * param args */public class Newlist private static final Log log = LogFactory.getLog(Newlist.class); /* * 測(cè)試 * param args */ public static voi

3、d main(String args) Newlist n=new Newlist(); String k=n.getNewList(); for (int i=0;i<k.length;i+) System.out.println(ki.replace("href="", "href="newinfo2.jsp?url="); String m=n.getNewinfo("news/2008/1119/35261.html"); for (int l=0;l<m.length;l+) System.o

4、ut.println(ml); /* * 由url地址獲得新聞內(nèi)容string * 新聞中的圖片下載到本地,文中新聞地址改成本地地址 * param url * return */ public String getNewinfo(String url) String URL=" /30是指取30段滿足給出的正則條件的字符串,如果只找出10個(gè),那數(shù)組后面的全為null String s = analysis("<p>(.*?)</p>" , getContent(URL) , 30); for (int i=0;i<s.length;

5、i+) Pattern sp = Ppile("src="(.*?)""); Matcher matcher = sp.matcher(si); if (matcher.find() String imageurl=analysis("src="(.*?)"" , si , 1)0; if(!imageurl.startsWith("http:/") imageurl=" System.out.println("新聞?dòng)袌D片:"+imageurl); String c

6、ontent=getContent(imageurl); String images=imageurl.split("/"); String imagename=imagesimages.length-1; System.out.println("圖片名:"+imagename); try File fwl = new File(imagename); PrintWriter outl = new PrintWriter(fwl); outl.println(content); outl.close(); catch (IOException e) /

7、TODO Auto-generated catch block e.printStackTrace(); System.out.println("si:"+si); /修改文件圖片地址 si=si.replace(analysis("src="(.*?)"" , si , 1)0, imagename); return s; public String getNewList()r/news/index.html" return getNewList(getContent(url); private String getNew

8、List(String content ) /String s = analysis("align="center" valign="top"><img src="./images/a(.*?).gif" width="70" height="65"></td>" , content , 50); String s = analysis("<li>(.*?)</li>" , content , 50);

9、 return s; private String analysis(String pattern, String match , int i) Pattern sp = Ppile(pattern); Matcher matcher = sp.matcher(match); String content = new Stringi; for (int i1 = 0; matcher.find(); i1+) contenti1 = matcher.group(1); /下面一段是為了剔除為空的串 int l=0; for (int k=0;k<content.length;k+) if

10、 (contentk=null) l=k; break; String content2; if (l!=0) content2=new Stringl; for (int n=0;n<l;n+) content2n=contentn; return content2; else return content; /* * 由地址獲取網(wǎng)頁(yè)內(nèi)容 * param strUrl * return private String getContent(String strUrl) try /URL url = new URL(strUrl); /BufferedReader br = new Buf

11、feredReader(new InputStreamReader(url.openStream(); URLConnection uc = new URL(strUrl).openConnection(); /通過(guò)修改http頭的User-Agent來(lái)偽裝成是通過(guò)瀏覽器提交的請(qǐng)求 uc.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows XP; DigExt)"); System.out.println("-"); System.o

12、ut.println("Content-Length: "+uc.getContentLength(); System.out.println("Set-Cookie: "+uc.getHeaderField("Set-Cookie"); System.out.println("-"); /獲取文件頭信息 System.out.println("Header"+uc.getHeaderFields().toString(); System.out.println("-");

13、BufferedReader br=new BufferedReader(new InputStreamReader(uc.getInputStream(), "gb2312"); String s = "" StringBuffer sb=new StringBuffer(); while(s = br.readLine()!=null) sb.append(s+"rn"); System.out.println("長(zhǎng)度+"+sb.toString().length(); return sb.toString()

14、; catch(Exception e) return "error open url" + strUrl; */ public static String getContent (String strUrl) URLConnection uc = null; String all_content=null; try all_content =new String(); URL url = new URL(strUrl); uc = url.openConnection(); uc.setRequestProperty("User-Agent", &qu

15、ot;Mozilla/4.0 (compatible; MSIE 5.0; Windows XP; DigExt)"); System.out.println("-"); System.out.println("Content-Length: "+uc.getContentLength(); System.out.println("Set-Cookie: "+uc.getHeaderField("Set-Cookie"); System.out.println("-"); /獲取文件頭

16、信息 System.out.println("Header"+uc.getHeaderFields().toString(); System.out.println("-"); if (uc = null) return null; InputStream ins = uc.getInputStream(); ByteArrayOutputStream outputstream = new ByteArrayOutputStream(); byte str_b = new byte1024; int i = -1; while (i=ins.read(s

17、tr_b) > 0) outputstream.write(str_b,0,i); all_content = outputstream.toString(); / System.out.println(all_content); catch (Exception e) e.printStackTrace(); log.error("獲取網(wǎng)頁(yè)內(nèi)容出錯(cuò)"); finally uc = null; / return new String(all_content.getBytes("ISO8859-1"); System.out.println(all_

18、content.length(); return all_content; 現(xiàn)在的問(wèn)題是:圖片下載不全,我用后面兩種getContent方法下圖片,下來(lái)的圖片大小都和文件頭里獲得的Content-Length,也就是圖片的實(shí)際大小不符,預(yù)覽不了。 而且反復(fù)測(cè)試,兩種方法每次下來(lái)的東西大小是固定的,所以重復(fù)下載沒(méi)有用? 測(cè)試toString后length大小比圖片實(shí)際的小,而生成的圖片比圖片數(shù)據(jù)大。下載后存儲(chǔ)過(guò)程中圖片數(shù)據(jù)增加了! 圖片數(shù)據(jù)流toString過(guò)程中數(shù)據(jù)大小發(fā)生了改變,還原不回來(lái)。其它新聞內(nèi)容沒(méi)有問(wèn)題。估計(jì)是圖片的編碼格式等的問(wèn)題。在圖片數(shù)據(jù)流讀過(guò)來(lái)時(shí)直接生成圖片就可以了。publ

19、ic int saveImage (String strUrl) URLConnection uc = null; try URL url = new URL(strUrl); uc = url.openConnection(); uc.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows XP; DigExt)"); /uc.setReadTimeout(30000); /獲取圖片長(zhǎng)度 /System.out.println("Content-

20、Length: "+uc.getContentLength(); /獲取文件頭信息 /System.out.println("Header"+uc.getHeaderFields().toString(); if (uc = null) return 0; InputStream ins = uc.getInputStream(); byte str_b = new byte1024; int byteRead=0; String images=strUrl.split("/"); String imagename=imagesimages.l

21、ength-1; File fwl = new File(imagename); FileOutputStream fos= new FileOutputStream(fwl); while (byteRead=ins.read(str_b) > 0) fos.write(str_b,0,byteRead); ; fos.flush(); fos.close(); catch (Exception e) e.printStackTrace(); log.error("獲取網(wǎng)頁(yè)內(nèi)容出錯(cuò)"); finally uc = null; return 1; 方法二:首先把搜索后

22、的頁(yè)面用流讀取出來(lái),再寫(xiě)個(gè)正則,去除不要的內(nèi)容,再把最后的結(jié)果存成xml格式文件、或者直接存入數(shù)據(jù)庫(kù),用的時(shí)候再調(diào)用本代碼只是顯示html頁(yè)的源碼內(nèi)容,如果需要抽取內(nèi)容請(qǐng)自行改寫(xiě)public static String regex()中的正則式 package rssTest; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import .HttpURLConnection; import .MalformedURLException; import .U

23、RL; import .URLConnection; import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; public class MyRSS /* * 獲取搜索結(jié)果的html源碼 * */ public static String getHtmlSource(String url) StringBuffer codeBuffer = null; BufferedReader in=null; try URLConne

24、ction uc = new URL(url).openConnection(); /* * 為了限制客戶端不通過(guò)網(wǎng)頁(yè)直接讀取網(wǎng)頁(yè)內(nèi)容,就限制只能從瀏覽器提交請(qǐng)求. * 但是我們可以通過(guò)修改http頭的User-Agent來(lái)偽裝,這個(gè)代碼就是這個(gè)作用 * */ uc.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows XP; DigExt)"); / 讀取url流內(nèi)容 in = new BufferedReader(new InputStreamRead

25、er(uc .getInputStream(), "gb2312"); codeBuffer = new StringBuffer(); String tempCode = "" / 把buffer內(nèi)的值讀取出來(lái),保存到code中 while (tempCode = in.readLine() != null) codeBuffer.append(tempCode).append("n"); in.close(); catch (MalformedURLException e) e.printStackTrace(); catch (

26、IOException e) e.printStackTrace(); return codeBuffer.toString(); /* * 正則表達(dá)式 * */ public static String regex() String googleRegex = "<div class=g>(.*?)href="(.*?)"(.*?)">(.*?)</a>(.*?)<div class=std>(.*?)<br>" return googleRegex; /* * 測(cè)試用 * 在google

27、中檢索關(guān)鍵字,并抽取自己想要的內(nèi)容 * * */ public static List<String> GetNews() List<String> newsList = new ArrayList<String>(); String allHtmlSource = MyRSS n/search?complete=1&hl=zh-CN&newwindow=1&client=aff-os- maxthon&hs=SUZ&q=%E8%A7%81%E9%BE%99%E5%8D%B8%E7%94%B2&meta=&am

28、p;aq=f"); Pattern pattern = Ppile(regex(); Matcher matcher = pattern.matcher(allHtmlSource); while (matcher.find() String urlLink = matcher.group(2); String title = matcher.group(4); title = title.replaceAll("<font color=CC0033>", ""); title = title.replaceAll("&l

29、t;/font>", ""); title = title.replaceAll("<b>.</b>", ""); String content = matcher.group(6); content = content.replaceAll("<font color=CC0033>", ""); content = content.replaceAll("</font>", ""); con

30、tent = content.replaceAll("<b>.</b>", ""); newsList.add(urlLink); newsList.add(title); newsList.add(content); return newsList; /* * main方法 * */ public static void main(String args) System.out .println(MyRSS .getHtmlSource(" 方法三:jsp自動(dòng)抓取新聞 自動(dòng)抓取新聞package com.news.sp

31、ider;import java.io.File;import java.io.FileFilter;import java.text.SimpleDateFormat;import java.util.ArrayList;import java.util.Calendar;import java.util.Date;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;import com.db.DBAccess;public class SpiderNewsServer pub

32、lic static void main(String args) throws Exception /設(shè)置抓取信息的首頁(yè)面 String endPointUrl = " /獲得當(dāng)前時(shí)間 Calendar calendar=Calendar.getInstance(); SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd"); String DateNews = sdf.format(calendar.getTime(); /* * 抓取二級(jí)URl 開(kāi)始 * url匹配類(lèi)型:" */ List li

33、stNewsType = new ArrayList(); /取入口頁(yè)面html WebHtml webHtml = new WebHtml(); String htmlDocuemtnt1 = webHtml.getWebHtml(endPointUrl); if(htmlDocuemtnt1 = null | htmlDocuemtnt1.length() = 0) return; String strTemp1 = " String strTemp2 = "</li>" int stopIndex=0; int startIndex=0; int

34、 dd=0; while(true) dd+; startIndex = htmlDocuemtnt1.indexOf(strTemp1, stopIndex); System.out.println("="+startIndex); stopIndex= htmlDocuemtnt1.indexOf(strTemp2, startIndex); System.out.println("=-"+stopIndex); if(startIndex!=-1 && stopIndex!=-1) String companyType=htmlDo

35、cuemtnt1.substring(startIndex,stopIndex); System.out.println("-"+companyType); System.out.println("-"+companyType.indexOf("""); companyType=companyType.substring(0,companyType.indexOf("""); System.out.println("#-"+companyType); listNewsType

36、.add(companyType); if(dd>10) break; if(stopIndex=-1 | startIndex=-1) break; System.out.println("listCompanyType="+listNewsType.size(); /* * 抓取二級(jí)URl 結(jié)束 */ /* * 抓取頁(yè)面內(nèi)容 開(kāi)始 */ String title="" String hometext="" String bodytext="" String keywords="" St

37、ring counter = "221" String cdate= "" int begainIndex=0;/檢索字符串的起點(diǎn)索引 int endIndex=0;/檢索字符串的終點(diǎn)索引 String begainStr;/檢索開(kāi)始字符串 String endStr;/檢索結(jié)束字符串 for (int rows = 1; rows < listNewsType.size(); rows+) String strNewsDetail = listNewsType.get(rows).toString(); System.out.println(&q

38、uot;strNewsDetail="+strNewsDetail); if(strNewsDetail != null && strNewsDetail.length() > 0) WebHtml newsListHtml = new WebHtml(); String htmlDocuemtntCom = newsListHtml.getWebHtml(strNewsDetail); System.out.println("$-"+htmlDocuemtntCom); if(htmlDocuemtntCom = null | htmlDo

39、cuemtntCom.length() = 0) return; /截取時(shí)間 int dateBegainIndex = htmlDocuemtntCom.indexOf("<div>時(shí)間:"); System.out.println("%-"+dateBegainIndex); String newTime = htmlDocuemtntCom.substring(dateBegainIndex,dateBegainIndex+20); System.out.println("-"+newTime); String ne

40、wTimeM = newTime.substring(newTime.lastIndexOf("-")+1,newTime.lastIndexOf("-")+3); String dateM = DateNews.substring(DateNews.lastIndexOf("-")+1); System.out.println("-"+newTimeM); System.out.println("-"+dateM); if(newTimeM = dateM | newTimeM.equals(

41、dateM) /檢索新聞標(biāo)題 begainStr="<div class="divCon bg008 ">" endStr="<div>時(shí)間:" begainIndex=htmlDocuemtntCom.indexOf(begainStr,0); System.out.println("&&&&&&-"+begainIndex); endIndex=htmlDocuemtntCom.indexOf(endStr,0); System.out.p

42、rintln("&&&&&&-"+endIndex); if(begainIndex!=-1 && endIndex!=-1) title = htmlDocuemtntCom.substring(begainIndex,endIndex).trim(); title = title.substring(title.indexOf("<h1>")+4,title.indexOf("</h1>"); title = title.replace(&qu

43、ot;'", ""); title = title.replace("", ""); title = title.replace(" ", ""); /檢索新聞內(nèi)容 begainStr="<div class="divCon bg008 ">" endStr="<!- page begin ->" begainIndex=htmlDocuemtntCom.indexOf(begainStr,0)

44、; endIndex=htmlDocuemtntCom.indexOf(endStr,0); if(begainIndex!=-1 && endIndex!=-1) bodytext = htmlDocuemtntCom.substring(begainIndex,endIndex).trim(); if(bodytext.indexOf("<p>")>0 && bodytext.indexOf("</p>")>bodytext.indexOf("<p>"

45、) && bodytext.indexOf("</p>")>0) bodytext = bodytext.substring(bodytext.indexOf("<p>")+3,bodytext.indexOf("</p>"); bodytext=bodytext.replace("&nbsp;", ""); bodytext=bodytext.replace("<br>", ""

46、;); bodytext=bodytext.replace("n", "<br>"); bodytext=bodytext.replace("'", ""); bodytext=bodytext.replace("", ""); /簡(jiǎn)介 if(bodytext.length()>40) hometext = bodytext.substring(0,40)+"." else hometext = bodytext+"." /瀏覽量 String str = String.valueOf(Math.random(); counter = str.substring(str.lastIndexOf(".")+1,5); Calendar cal = Calendar.getInstance(); cal.setTime(ne

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論