crawling

Description

When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

Scrape a http link with proxy

**Expected

Whenever CLI process gets interrupted or killed, CDP driver must (and used to) close all open tabs.
It stopped doing this.

Main examples at Apify SDK webpage, Github repo and CLI templates should demonstrate how to manipulate with DOM and retrieve data from it.

Also add one example of scraping with Apify SDK + jQuery to https://sdk.apify.com/docs/examples/basiccrawler

Feedback from: https://medium.com/better-programming/do-i-need-python-scrapy-to-build-a-web-scraper-7cc7cac2081d

I lost an hour trying to make

branched from #29 , added failing test.

This assumes the behaviour of nested values are flattened to json.

For consideration.

We have different mixins in spidermon/contrib/monitors/mixins directory, but no documentation.

crawling

Here are 493 public repositories matching this topic...

scrapy / scrapy

Description

Steps to Reproduce

gocolly / colly

codelucas / newspaper

yujiosaka / headless-chrome-crawler

MontFerret / ferret

apify / apify-js

apache / nutch

transitive-bullshit / awesome-puppeteer

iawia002 / Lulu

MorvanZhou / easy-scraping-tutorial

clemfromspace / scrapy-selenium

slotix / dataflowkit

essandess / isp-data-pollution

zhuyingda / webster

oltarasenko / crawly

infinitbyte / gopa

DarkSand / Sasila

scrapinghub / spidermon

rivermont / spidy

stopstalk / stopstalk-deployment

alephdata / memorious

antchfx / antch

forkonlp / N2H4

trandoshan-io / crawler

dimkouv / massivedl

google / corpuscrawler

N0taN3rd / Squidwarc

jvandenaardweg / linkedin-profile-scraper

mehmetozkaya / DotnetCrawler

usc-isi-i2 / dig-etl-engine

Improve this page

Add this topic to your repo