response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation. #4964

GeorgeA92 · 2021-01-28T00:08:21Z

Description

Usage of scrapy.utils.response.response_httprepr inside DownloaderStats middleware causing application to make unnecessary memory allocation.
response_httprepr used only one time - to calculate response sizes for downloader/response_bytes stats in that middleware.

In current implementation response_httprepr return bytes (immutable type) - in order to calculate downloader/response_bytes application will additionally allocate nearly the same memory amount as for original response (only for calculating len inside middleware).

scrapy/scrapy/utils/response.py

Lines 45 to 60 in 26836c4

    
           def response_httprepr(response): 
        
               """Return raw HTTP representation (as bytes) of the given response. This 
        
               is provided only for reference, since it's not the exact stream of bytes 
        
               that was received (that's not exposed by Twisted). 
        
               """ 
        
               values = [ 
        
                   b"HTTP/1.1 ", 
        
                   to_bytes(str(response.status)), 
        
                   b" ", 
        
                   to_bytes(http.RESPONSES.get(response.status, b'')), 
        
                   b"\r\n", 
        
               ] 
        
               if response.headers: 
        
                   values.extend([response.headers.to_string(), b"\r\n"]) 
        
               values.extend([b"\r\n", response.body]) 
        
               return b"".join(values)

Steps to Reproduce

In order to demonstrate influence of this i made this spider:

spider code

import sys
from importlib import import_module
import scrapy


class MemoryHttpreprSpider(scrapy.Spider):
    name = 'memory_httprepr'
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES':{
            'scrapy.downloadermiddlewares.stats.DownloaderStats': None
        }
    }
# the same as in MemoryUsage extension:
    def get_virtual_size(self):
        size = self.resource.getrusage(self.resource.RUSAGE_SELF).ru_maxrss
        if sys.platform != 'darwin':
            # on macOS ru_maxrss is in bytes, on Linux it is in KB
            size *= 1024
        return size

    def start_requests(self):
        try:
            self.resource = import_module('resource')
        except ImportError:
            pass
        self.logger.info(f"used memory on start: {str(self.get_virtual_size())}")
        yield scrapy.Request(url='https://speed.hetzner.de/100MB.bin', callback=self.parse)
        #yield scrapy.Request(url='http://quotes.toscrape.com', callback=self.parse)

    def parse(self, response, **kwargs):
        self.logger.info(f"used memory after downloading response: {str(self.get_virtual_size())}")

It include:

usage of get_virtual_size method - directly the same as on MemoryUsage extension
url with 100Mb binary response https://speed.hetzner.de/100MB.bin
disabled DownloaderStats middleware that uses response_httprepr and request_httprepr
usual html url http://quotes.toscrape.com (for comparison)

log output	100Mb default settings	100Mb no downloader stats	quotes default
`[memory_httprepr] used memory on start:`	`61 587 456`	`61 521 920`	`61 558 784`
`[memory_httprepr] used memory after downloading response:`	`375 910 400`	`271 179 776`	`61 558 784`

Versions

[scrapy.utils.log] Scrapy 2.4.0 started (bot: httprepr)
[scrapy.utils.log] Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 23 2020, 14:32:57) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Linux-4.15.0-76-generic-x86_64-with-glibc2.2.5

The text was updated successfully, but these errors were encountered:

GeorgeA92 · 2021-02-02T07:59:49Z

Remaining things related to this issue:

Usage of request_httprepr.
On table from issue description we clearly see that 3 ~100Mb memory chunks was allocated:
1. original response.body - received from twisted side
2. response_httprepr - mentioned in this issue
3. probably side effect of #4240 scrapy identified this response as text, and when it converted to str - another 100Mb memory chunk allocated.

aadityasinha-dotcom · 2021-03-27T06:11:42Z

Can I work on this issue?

NIKZ3 · 2021-03-30T15:06:05Z

I would like to work on this issue

elacuesta · 2021-03-30T15:08:30Z

For the record, there's already #4972 for this

Gallaecio added enhancement performance good first issue labels Jan 28, 2021

GeorgeA92 linked a pull request that will close this issue Feb 2, 2021

response_httprepr memory issue fixed #4972

Open

GeorgeA92 mentioned this issue Feb 18, 2021

selector create_root_node memory issues scrapy/parsel#210

Open

Gallaecio mentioned this issue Mar 5, 2021

It would be nice to get the amount of bandwidth used/min along with pages/min #4969

Open

scrapy / scrapy Public

response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation. #4964

response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation. #4964

GeorgeA92 commented Jan 28, 2021

GeorgeA92 commented Feb 2, 2021

aadityasinha-dotcom commented Mar 27, 2021

NIKZ3 commented Mar 30, 2021

elacuesta commented Mar 30, 2021

scrapy / scrapy Public

response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation. #4964

response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation. #4964

Comments

GeorgeA92 commented Jan 28, 2021

Description

Steps to Reproduce

Versions

GeorgeA92 commented Feb 2, 2021

aadityasinha-dotcom commented Mar 27, 2021

NIKZ3 commented Mar 30, 2021

elacuesta commented Mar 30, 2021