1.9K+ Star！Crawlee：一个高效可靠的网络爬虫和浏览器自动化库，支持下载HTML、PDF、JPG、PNG等多种格式

欢迎关注我，持续获取更多内容，感谢赞&在看~

项目简介

Crawlee[1] 是一个用于构建可靠爬虫的 Python 网络爬取和浏览器自动化库。可以用于从网站下载 HTML、PDF、JPG、PNG 等文件，并且支持 BeautifulSoup、Playwright 和原生 HTTP 请求。

Crawlee 支持 headful 和 headless 模式，并且具备代理轮换功能。

项目特点

主要亮点

支持BeautifulSoup和Playwright，应对不同网页需求。
自动化重试、代理轮换和会话管理，保障爬虫稳定性。
基于标准的Asyncio，编写简洁且高效的异步代码。
丰富的配置选项，高度可定制化以满足特定项目需求。
开源项目，由Apify支持，易于在Apify平台上部署和运行。

功能特点

统一的 HTTP 和无头浏览器爬取接口。
基于系统资源的自动并行爬取。
使用 Python 编写，带有类型提示，提高开发体验并减少错误。
自动重试错误或被封锁的情况。
集成代理轮换和会话管理。
可配置的请求路由，将 URL 直接定向到适当的处理器。
持久化 URL 队列以供爬取。
可插拔的存储选项，用于存储表格数据和文件。
强大的错误处理。

使用方法

安装

Crawlee 可在 PyPI 上作为 crawlee
包获取。基本安装命令如下：

pip install crawlee

如果需要使用 BeautifulSoupCrawler
，则需要安装带有 beautifulsoup
额外依赖的 crawlee
：

pip install 'crawlee[beautifulsoup]'

如果需要使用 PlaywrightCrawler
，则需要安装带有 playwright
额外依赖的 crawlee
，并安装 Playwright 依赖：

pip install 'crawlee[playwright]'  
playwright install

使用 Crawlee CLI

使用 Crawlee CLI 快速开始，首先确保安装了 Pipx[2]：

pipx --help

然后运行 CLI 并选择一个模板：

pipx run crawlee create my-crawler

如果已经安装了 crawlee
，可以直接运行：

crawlee create my-crawler

示例

Crawlee 提供了不同类型的爬虫示例，包括 BeautifulSoupCrawler 和 PlaywrightCrawler。

BeautifulSoupCrawler 示例：

import asyncio  
  
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext  
  
  
async def main() -> None:  
    crawler = BeautifulSoupCrawler(  
        # Limit the crawl to max requests. Remove or increase it for crawling all links.  
        max_requests_per_crawl=10,  
    )  
  
    # Define the default request handler, which will be called for every request.  
    @crawler.router.default_handler  
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:  
        context.log.info(f'Processing {context.request.url} ...')  
  
        # Extract data from the page.  
        data = {  
            'url': context.request.url,  
            'title': context.soup.title.string if context.soup.title else None,  
        }  
  
        # Push the extracted data to the default dataset.  
        await context.push_data(data)  
  
        # Enqueue all links found on the page.  
        await context.enqueue_links()  
  
    # Run the crawler with the initial list of URLs.  
    await crawler.run(['https://crawlee.dev'])  
  
if __name__ == '__main__':  
    asyncio.run(main())

PlaywrightCrawler 示例：

import asyncio  
  
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext  
  
  
async def main() -> None:  
    crawler = PlaywrightCrawler(  
        # Limit the crawl to max requests. Remove or increase it for crawling all links.  
        max_requests_per_crawl=10,  
    )  
  
    # Define the default request handler, which will be called for every request.  
    @crawler.router.default_handler  
    async def request_handler(context: PlaywrightCrawlingContext) -> None:  
        context.log.info(f'Processing {context.request.url} ...')  
  
        # Extract data from the page.  
        data = {  
            'url': context.request.url,  
            'title': await context.page.title(),  
        }  
  
        # Push the extracted data to the default dataset.  
        await context.push_data(data)  
  
        # Enqueue all links found on the page.  
        await context.enqueue_links()  
  
    # Run the crawler with the initial list of requests.  
    await crawler.run(['https://crawlee.dev'])  
  
  
if __name__ == '__main__':  
    asyncio.run(main())

在 Apify 平台上运行

Crawlee 是开源的，可以在任何地方运行，但由于它是由 Apify 开发的，因此在 Apify 平台上设置和在云端运行非常容易。

关于

Crawlee 是由 Apify 开发的 Python 网络爬取和浏览器自动化库，用于构建可靠的爬虫。更多信息请访问 Crawlee 项目网站[3]。

注：本文内容仅供参考，具体项目特性请参照官方 GitHub 页面的最新说明。

欢迎关注&点赞&在看，感谢阅读~

资源列表

[1]

Github 项目地址: https://github.com/apify/crawlee-python

[2]

Pipx: https://pipx.pypa.io/

[3]

Crawlee 项目网站: https://crawlee.dev/python/

长亭百川云 - 文章详情

长亭百川云