crawler

Description

When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the twisted.web.error.SchemeNotSupported: Unsupported scheme: b'' error

As I see, most of the people have this issue the other way around.

Steps to Reproduce

Scrape a http link with proxy

**Expected

With the current installation and deployment manual, it is basically impossible to deploy successfully at one time.
It is also recommended to provide an official docker image based on python 3.

i want get the price (follow red frame)

	c.OnHTML("div[id=price]", func(e *colly.HTMLElement) {
		fmt.Printf("test----%+v\n",e)
		price,err := strconv.ParseFloat(e.Text,64)
		//price := e.Text
		fmt.Printf("********* price----%+v\n",price)

		if err != nil {

			fmt.Pri

If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.

Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.

Really gr

问题是这样的，我想爬取商品分页的信息于是我用了个for循环，执行document = Jsoup.connect(domain+reviews+String.format(b, p)).get()改变p的值来改变评论的页码。

但是当爬完第一页后再爬取第二页评论时（没准备爬取一页评论时都会执行这句document = Jsoup.connect(domain+reviews+String.format(b, p)).get();）出现了这样的错误：
java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.s

Bug 描述
按教程文档说明的，使用docker-compose up -d 安装启动后，直接执行task报错
不知道哪里有问题呢？
我的docker运行环境是win10

`2020-02-15 15:58:04 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: xueqiu)
22020-02-15 15:58:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Nov 7 2019, 10:44:02) - [GCC 8.3.0], pyOpenSSL 19

Reflect this kind of things

{
method : 'POST',
form : { key: 'value', key2: 'value'}
}

What is the current behavior?

Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #

The urls using # are not followed.

If the current behavior is a bug, please provide the steps to reproduce

Try crawling a website like mykita.com/en/

What is the motivation / use case for changing the behavior?

Though hashes are not ment to chan

In the quick start section, it seems forgot to mention the pipeline setting, and without the setting seems will cause yield item appear wrong result. Just like #137, please update the document, if need help, I can do the contribution as well.

我运行的是这4条代码，有可以获得IP，但用python客户端调用没办法取出来

启动scrapy worker，包括代理IP采集器和校验器

python crawler_booter.py --usage crawler
python crawler_booter.py --usage validator
启动调度器，包括代理IP定时调度和校验

python scheduler_booter.py --usage crawler
python scheduler_booter.py --usage validator

The developer of the website I intend to scrape information from is sloppy and has left a lot of broken links.
When I execute an otherwise effective Ferret script on a list of pages, it stops altogether at every 404.
Is there a DOCUMENT_EXISTS or anything that would help the script go on?

There are several things not accurately documented/outdated:

-v2 is used the examples but does not work
# duckduckgo not supported although it is in the list of supported search engines
To get a list of all search engines --config is suggested but that just fails

On this gif ( https://raw.githubusercontent.com/constverum/ProxyBroker/master/docs/source/_static/cli_serve_example.gif ) the server prints an info line when a client connects.

The current version doesn't do that, though it would be very useful. I tried the command that is on the GIF.

I copied the examples/sciencenet_spider.py example and tried to run it using python 3.6 - but:

python sciencenet_spider.py
[2018:04:14 22:21:26] Spider started!
[2018:04:14 22:21:26] Using selector: KqueueSelector
[2018:04:14 22:21:26] Base url: http://blog.sciencenet.cn/
[2018:04:14 22:21:26] Item "Post": 0
[2018:04:14 22:21:26] Requests count: 0
[2018:04:14 22:21:26] Error coun

crawler

Here are 4,240 public repositories matching this topic...

scrapy / scrapy

Description

Steps to Reproduce

binux / pyspider

gocolly / colly

iawia002 / annie

jhao104 / proxy_pool

codelucas / newspaper

code4craft / webmagic

shengqiangzhang / examples-of-web-crawlers

guyueyingmu / avbook

s0md3v / Photon

crawlab-team / crawlab

injetlee / Python

bda-research / node-crawler

yujiosaka / headless-chrome-crawler

chyroc / WechatSogou

rmax / scrapy-redis

SpiderClub / haipproxy

MontFerret / ferret

BruceDone / awesome-crawler

gaojiuli / toapi

symfony / dom-crawler

imWildCat / scylla

Arachni / arachni

dotnetcore / DotnetSpider

NikolaiT / GoogleScraper

constverum / ProxyBroker

jae-jae / QueryList

gaojiuli / gain

xtuhcy / gecco

PuerkitoBio / gocrawl

Improve this page

Add this topic to your repo