1·The basic design of this crawler is to load the first link to check onto a queue.
这个爬虫的基本设计是加载第一个链接并将其放入一个队列。
2·E-mail harvesting can be one of the easiest crawling activities, as you'll see in the final crawler example in this article.
E - mail收集可能是最容易的一种爬行行为,在本文中最后一个爬虫例子中我们会看到这一点。
3·The behavior policies define which pages the crawler will bring down to the indexer, how often to go back to a Web site to check it again, and something called a politeness policy.
这种行为策略定义了爬虫会将哪些页面带入索引程序、以什么样的频率回到Web站点上再次对它进行检查,以及一种礼貌原则。
4·Click on the Edit button in the query_statistic line to move to the crawler TAB.
单击query_statistic行的Ededit按钮,移向爬虫选项卡。
5·Define the crawler name (UNIX file system crawler 1, for example), as shown in Figure 7, and then click on the Next button.
定义爬虫名称(例如,UNIX file system crawler 1),如图7所示,然后单击Next按钮。
1·Assuming that SCA and MDB applications were already deployed and started, ensure that the ica crawler and indexer for a particular document collection are running.
假设sca和MDB应用程序已经部署并启动,确保针对特定文档集合的ICA爬行器和索引器均已运行。
2·Next, navigate to the crawler details page and click 'Start full recrawl', as shown at the bottom of Figure 3.
接下来,导航到爬行器的细节页面并单击“Start full recrawl”,如图3底部所示。
3·Aiming at the practical problems a parallel crawler will face to, this paper advances three types of optimization policy for ChaoCrawler, including collision avoidance, URL indexing and DNS caching.
针对并行爬行器所遇到的实际问题,实现了三种优化策略:冲突规避,URL索引和DNS缓冲。
1·Then, on basic of search engine's core technologies, based on a lightweight architecture, its three main modules were designed: crawler, indexer and searcher.
然后,在搜索引擎关键技术的基础上,基于一个轻量级的架构设计了搜索引擎的三个主要模块:网页爬虫、索引器与搜索器。