长亭百川云 - 文章详情

Scrapy结合MongoDB源码重构,打磨完美指纹存储机制!

未闻Code

34

2024-07-13

本篇文章将带给各位读者关于Scrapy与MongoDB的结合,打磨出完美的指纹存储机制,同时也解决了Redis内存压力的问题。我们将深入探讨Scrapy-Redis源码的改造,使其可以根据不同场景进行灵活配置和使用。欢迎各位读者阅读并参与讨论!

**特别声明:**本公众号文章只作为学术研究,不作为其他不法用途;如有侵权请联系作者删除。

立即加星标

每月看好文

 目录

一、前言介绍

二、架构梳理

三、源码分析

四、源码重写

五、文章总结


一、前言介绍

在使用Scrapy-Redis进行数据采集时,经常会面临着Redis内存不足的困扰,特别是当Redis中存储的指纹数量过多时,可能导致Redis崩溃、指纹丢失,进而影响整个爬虫的稳定性。那么,面对这类问题,我们应该如何应对呢?我将在本文中分享解决方案:通过改造Scrapy-Redis源码,引入MongoDB持久化存储,从根本上解决了上述问题。敬请关注我的文章,一起探讨这个解决方案的实现过程,以及带来的收益和挑战。

二、架构梳理

1、进行源码分析之前,我们需要先了解下scrapy及scrapy-redis的架构图,两者相比,是哪些地方进行了改造?带着这样的疑问,我们来看下两个框架的架构图:

                                                         图1(scrapy架构图)

图2(scrapy-redis架构图)

2、拿 图2 同 图1 对比,我们可以看到scrapy-redis在scrapy的架构上增加了redis,基于redis的特性拓展了如下四种组件:Scheduler,Dupfilter,ItemPipeline,BaseSpider,这也是为什么在redis中会生成spider:requests、spider:items、spider:dupfilter三个key的原因。接下来我们进入源码分析环节,来看看scrapy-redis如何进行指纹改造吧。


三、源码分析

1、分析scrapy-redis源码,我们在使用scrapy-redis时,在settings模块都会进行如下配置:

**总结:**这里面的三个参数,分别同redis进行请求出入、请求指纹、请求优先级交互,如果我们想要修改redis指纹模块,那么我们需要对RFPDupeFilter模块进行重写,从而结合mongodb进行大量指纹存储,接下来进入源码分析环节。

2、阅读分析RFPDupeFilter源码,我们先来附上RFPDupeFilter完整源码如下:

`import logging``import time``   ``from scrapy.dupefilters import BaseDupeFilter``from scrapy.utils.request import request_fingerprint``   ``from . import defaults``from .connection import get_redis_from_settings``   ``   ``logger = logging.getLogger(__name__)``   ``   ``# TODO: Rename class to RedisDupeFilter.``class RFPDupeFilter(BaseDupeFilter):`    `"""Redis-based request duplicates filter.``   `    `This class can also be used with default Scrapy's scheduler.``   `    `"""``   `    `logger = logger``   `    `def __init__(self, server, key, debug=False):`        `"""Initialize the duplicates filter.``   `        `Parameters`        `----------`        `server : redis.StrictRedis`            `The redis server instance.`        `key : str`            `Redis key Where to store fingerprints.`        `debug : bool, optional`            `Whether to log filtered requests.``   `        `"""`        `self.server = server`        `self.key = key`        `self.debug = debug`        `self.logdupes = True``   `    `@classmethod`    `def from_settings(cls, settings):`        `"""Returns an instance from given settings.``   `        `This uses by default the key ``dupefilter:<timestamp>``. When using the`        ` ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as `        `it needs to pass the spider name in the key.``   `        `Parameters`        `----------`        `settings : scrapy.settings.Settings``   `        `Returns`        `-------`        `RFPDupeFilter`            `A RFPDupeFilter instance.``   ``   `        `"""`        `server = get_redis_from_settings(settings)`        `# XXX: This creates one-time key. needed to support to use this`        `# class as standalone dupefilter with scrapy's default scheduler`        `# if scrapy passes spider on open() method this wouldn't be needed`        `# TODO: Use SCRAPY_JOB env as default and fallback to timestamp.`        `key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}`        `debug = settings.getbool('DUPEFILTER_DEBUG')`        `return cls(server, key=key, debug=debug)``   `    `@classmethod`    `def from_crawler(cls, crawler):`        `"""Returns instance from crawler.``   `        `Parameters`        `----------`        `crawler : scrapy.crawler.Crawler``   `        `Returns`        `-------`        `RFPDupeFilter`            `Instance of RFPDupeFilter.``   `        `"""`        `return cls.from_settings(crawler.settings)``   `    `def request_seen(self, request):`        `"""Returns True if request was already seen.``   `        `Parameters`        `----------`        `request : scrapy.http.Request``   `        `Returns`        `-------`        `bool``   `        `"""`        `fp = self.request_fingerprint(request)`        `# This returns the number of values added, zero if already exists.`        `added = self.server.sadd(self.key, fp)`        `return added == 0``   `    `def request_fingerprint(self, request):`        `"""Returns a fingerprint for a given request.``   `        `Parameters`        `----------`        `request : scrapy.http.Request``   `        `Returns`        `-------`        `str``   `        `"""`        `return request_fingerprint(request)``   `    `@classmethod`    `def from_spider(cls, spider):`        `settings = spider.settings`        `server = get_redis_from_settings(settings)`        `dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)`        `key = dupefilter_key % {'spider': spider.name}`        `debug = settings.getbool('DUPEFILTER_DEBUG')`        `return cls(server, key=key, debug=debug)``   `    `def close(self, reason=''):`        `"""Delete data on close. Called by Scrapy's scheduler.``   `        `Parameters`        `----------`        `reason : str, optional``   `        `"""`        `self.clear()``   `    `def clear(self):`        `"""Clears fingerprints data."""`        `self.server.delete(self.key)``   `    `def log(self, request, spider):`        `"""Logs given request.``   `        `Parameters`        `----------`        `request : scrapy.http.Request`        `spider : scrapy.spiders.Spider``   `        `"""`        `if self.debug:`            `msg = "Filtered duplicate request: %(request)s"`            `self.logger.debug(msg, {'request': request}, extra={'spider': spider})`        `elif self.logdupes:`            `msg = ("Filtered duplicate request %(request)s"`                   `" - no more duplicates will be shown"`                   `" (see DUPEFILTER_DEBUG to show all duplicates)")`            `self.logger.debug(msg, {'request': request}, extra={'spider': spider})`            `self.logdupes = False``   `

3、我们对scrapy-redis dupfilter.py源码进行分析如下:

**解读:**request_seen方法中的self.request_fingerprint方法会对请求指纹进行sha1加密运算得到一个40位长度的fp参数,然后redis set会对该指纹进行add添加,如果指纹不存在则返回True,return True==0 则最后结果返回False,如果指纹存在则返回True,return False==0 则最后结果返回True。接下来分析下调度器是如何进行最终指纹判重的!

4、我们分析Schedulter源码,查看Scheduler对请求进行入队列处理逻辑如下:

**解读:**通过分析enqueue_request方法,我们可以看到相关逻辑,如果该请求设置为去重并且request_seen方法返回为True,则该请求不入队列;相反该请求需要入队列,并进行相关数据自增统计。

**总结:**其实分析到这里,我们只需要修改request_seen方法,即可完成scrapy-redis fp源码改造,通过结合mongodb,实现各种爬虫fp指纹持久化存储;长话短说,接下来进入源码重写环节。


四、源码重写

1、首先我们需要在settings里配置mongodb相关参数,代码如下:

`MONGO_DB = "crawler"``MONGO_URL = "mongodb://localhost:27017"`

2、紧接着笔者通过继承重写BaseDupeFilter源码,自定义去重模块MongoRFPDupeFilter源码如下:

`import logging``import time``   ``from pymongo import MongoClient``from scrapy.dupefilters import BaseDupeFilter``from scrapy.utils.request import request_fingerprint``from scrapy_redis import defaults``   ``logger = logging.getLogger(__name__)``   ``   ``class MongoRFPDupeFilter(BaseDupeFilter):`    `"""Redis-based request duplicates filter.`    `This class can also be used with default Scrapy's scheduler.`    `"""``   `    `logger = logger``   `    `def __init__(self, key, debug=False, settings=None):`        `self.key = key`        `self.debug = debug`        `self.logdupes: bool = True`        `self.mongo_uri = settings.get('MONGO_URI')`        `self.mongo_db = settings.get('MONGO_DB')`        `self.client = MongoClient(self.mongo_uri)`        `self.db = self.client[self.mongo_db]`        `self.collection = self.db[self.key]`        `self.collection.create_index([("_id", 1)])``   `    `@classmethod`    `def from_settings(cls, settings):`        `key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}`        `debug = settings.getbool('DUPEFILTER_DEBUG')`        `return cls(key=key, debug=debug, settings=settings)``   `    `@classmethod`    `def from_crawler(cls, crawler):`        `"""Returns instance from crawler.``   `        `Parameters`        `----------`        `crawler : scrapy.crawler.Crawler``   `        `Returns`        `-------`        `RFPDupeFilter`            `Instance of RFPDupeFilter.``   `        `"""`        `return cls.from_settings(crawler.settings)``   `    `def request_seen(self, request):`        `"""Returns True if request was already seen.`        `"""`        `fp = self.request_fingerprint(request)`        `# This returns the number of values added, zero if already exists.`        `if self.collection.find_one({'_id': fp}):`            `return True`        `self.collection.insert_one(`            `{'_id': fp, "crawl_time": time.strftime("%Y-%m-%d")})`        `return False``   `    `def request_fingerprint(self, request):`        `return request_fingerprint(request)``   `    `@classmethod`    `def from_spider(cls, spider):`        `settings = spider.settings`        `dupefilter_key = settings.get("SCHEDULER_DUPEFILTER_KEY", defaults.SCHEDULER_DUPEFILTER_KEY)`        `key = dupefilter_key % {'spider': spider.name}`        `debug = settings.getbool('DUPEFILTER_DEBUG')`        `return cls(key=key, debug=debug, settings=settings)``   `    `def close(self, reason=''):`        `"""Delete data on close. Called by Scrapy's scheduler.``   `        `Parameters`        `----------`        `reason : str, optional``   `        `"""`        `self.clear()``   `    `def clear(self):`        `"""Clears fingerprints data."""`        `self.collection.delete(self.key)``   `    `def log(self, request, spider):`        `"""Logs given request.``   `        `Parameters`        `----------`        `request : scrapy.http.Request`        `spider : scrapy.spiders.Spider``   `        `"""`        `if self.debug:`            `msg = "Filtered duplicate request: %(request)s"`            `self.logger.debug(msg, {'request': request}, extra={'spider': spider})`        `elif self.logdupes:`            `msg = ("Filtered duplicate request %(request)s"`                   `" - no more duplicates will be shown"`                   `" (see DUPEFILTER_DEBUG to show all duplicates)")`            `self.logger.debug(msg, {'request': request}, extra={'spider': spider})`            `self.logdupes = False``   `

3、第三步,我们需要将继承重写的MongoRFPDupeFilter模块配置到settings文件中,代码如下:

`# 确保所有的爬虫实例使用Mongodb进行重复过滤``DUPEFILTER_CLASS = "test_scrapy.dupfilter.MongoRFPDupeFilter"`

4、编写测试爬虫(编写代码环节跳过),直接查看mongdb collection中fp结果,截图如下:

**总结:**到这里整个流程就结束了,接下来不管我们开发多少个爬虫,都默认使用mongodb对request fp指纹进行存储。最后我们来总结下scrapy-redis同scrapy-mongodb的指纹方式优缺点吧!

  • scrapy-redis    速度快,但由于指纹过大,内存不足会导致redis宕机,内存昂贵

  • scrapy+mongo    速度同redis相比,不是很优,优点是能存储大批量指纹,磁盘廉价


更多每日开发小技巧

尽在****未闻 Code Telegram Channel !

END

未闻 Code·知识星球开放啦!

一对一答疑爬虫相关问题

职业生涯咨询

面试经验分享

每周直播分享

......

未闻 Code·知识星球期待与你相见~

一二线大厂在职员工

十多年码龄的编程老鸟

国内外高校在读学生

中小学刚刚入门的新人

在“未闻 Code技术交流群”等你来!

入群方式:添加微信“mekingname”,备注“粉丝群”(谢绝广告党,非诚勿扰!)

相关推荐
关注或联系我们
添加百川云公众号,移动管理云安全产品
咨询热线:
4000-327-707
百川公众号
百川公众号
百川云客服
百川云客服

Copyright ©2024 北京长亭科技有限公司
icon
京ICP备 2024055124号-2