网络爬虫的设计与实现文献综述

阅读：评论：0

网络爬虫的设计与实现+文献综述

摘要：随着互联网的高速发展，在互联网搜索服务中，搜索引擎扮演着越来越重要的角。网络爬虫是搜索引擎系统中不可或缺的组成部分，是一种自动搜集互联网信息的程序，它负责从互联网中搜集网页，并将这些页面用于建立索引从而为搜索引擎提供支持。通过网络爬虫不仅能够为搜索引擎采集网络信息，而且可以作为定向信息采集器，定向采集某些网站显示的特定信息，如招聘信息，租房信息等。本文通过JAVA 实现了一个基于广度优先算法的爬虫程序。本论文从网络爬虫的应用出发，探讨了网络爬虫在搜索引擎中的作用和地位，提出了网络爬虫的功能和设计要求。在对网络爬虫系统结构和工作原理所作分析的基础上，研究了页面爬取、解析等策略和算法，并使用Java 实现了一个网络爬虫的程序，并对其运行结果做了分析。通过这一爬虫程序，可以搜集某一站点或多个站点的URL。连接外网后，可以爬取中国大部分大型主流门户的网站，如：百度，新浪，网易等。7384

1 / 13

关键词：搜索引擎；JAVA；广度优先.

The Design and Implementation of

Distributed Web CrawlerISA防火墙

Abstract: With the rapid development of Internet, search engines as the main entrance of the Internet plays a more and more important role. Web crawler is a very important part of the search engines, a program which can auto collect information form Internet,which is responsible to collect web pages from Internet. These pages are used to build index and provide support for search engines.Spider can collect data for search engines ,also can be a directional information collector,collects specifically informations from some web sites,such as HR informations,house rent informations.In this paper,use JAVA implements a breadth-first algorithm Spider. The paper，discussing from the application of the search engine，searches the importance and function of Web Crawler in the search engine,and puts forward its demand of function and

design．On the base of analyzing Web Crawler’s system strtucture and working elements,this paper also researches the method and strategy of multithreading scheduler，Web page crawling and HTML parsing．And then,a program of web page crawling based on Java is applied and analyzed．Through the crawler can collect a site or multiple site URL.Links outside the network,you can crawl most of China’s major large-scale portal sites,such as:Baidu , Sina , Netease.

3.2.3 功能需求13

3.3 系统功能实现14

余干乌黑鸡4 网络爬虫16

4.1 本系统所采用的搜索策略16

4.2 HTMLPARSER16

3 / 13

今年疫情发展趋势4.3 网络爬虫程序流程17 4.3.1 爬虫主要流程代码17

4.3.2 爬虫程序流程图22

5 实验效果及分析23

5.1 系统实验环境及配臵23

5.2 系统测试23

6 全文总结24江西省煤矿设计院

6.1 工作总结24艾琳娜

GOLDEN COCK

6.2 研究展望25

致谢27

参考文献28

1绪论

网络的迅猛发展带来的是互联网信息的爆炸性增长，这使得互联网信息容量达到了一个空前的高度。然而，人们从互联网上获取信息的能力是有限的，人们越来越需要一种有效的途径可以帮助他们全面、快速、准确的获取信息。Web搜索引擎的出现解决了这一问题，它成为人们获取网络信息的必不可少的工具。然而，谁也无法确定互联网上到底有多少网页，保守估计，它至少包含成百上千亿的网页。互联网的规模十分庞大，每天都有无数的网站上线，无数的网页信息发布，无数的页面更新，所以，导致信息爆炸式增长的最根本原因在于人们无法集中控制网页内容的发布机制，这也为Web搜索引擎索引和检索这些发布的内容带来了巨大的挑战。

网络爬虫是搜索引擎系统中十分重要的组成部分，它负责从互联网中搜集网页，采集信息，这些网页信息用于建立索引从而为搜索引擎提供支持，它决定着整个引擎系统的内容是否丰富，信息是否即时，因此

5 / 13

本文发布于:2023-08-16 07:53:19，感谢您对本站的认可！

本文链接：https://patent.en369.cn/xueshu/366295.html

上一篇：网络爬虫工作原理

下一篇：Python爬虫的数据存储技术