crawler
Here are 4,240 public repositories matching this topic...
With the current installation and deployment manual, it is basically impossible to deploy successfully at one time.
It is also recommended to provide an official docker image based on python 3.
don't know how to do
If one opens the link to the docs provided in README the Readme opens on readthedocs.io. There is no navigation bar to find where one can browse to quick start page or advanced. You can only go there if one searches quick start and click on the page. Then there are navigation links for browsing through the docs.
Jus for the record:
I'm using Firefox (60.9.0 esr) on Windows 10 Pro.
Really gr
问题是这样的,我想爬取商品分页的信息于是我用了个for循环,执行document = Jsoup.connect(domain+reviews+String.format(b, p)).get()改变p的值来改变评论的页码。
但是当爬完第一页后再爬取第二页评论时(没准备爬取一页评论时都会执行这句document = Jsoup.connect(domain+reviews+String.format(b, p)).get();)出现了这样的错误:
java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.s
-
Updated
May 15, 2020 - Python
-
Updated
Jun 6, 2020 - PHP
-
Updated
Mar 14, 2020 - Python
docker安装的任务执行有问题
Bug 描述
按教程文档说明的,使用docker-compose up -d 安装启动后,直接执行task报错
不知道哪里有问题呢?
我的docker运行环境是win10
`2020-02-15 15:58:04 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: xueqiu)
22020-02-15 15:58:04 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Nov 7 2019, 10:44:02) - [GCC 8.3.0], pyOpenSSL 19
What is the current behavior?
Crawling a website that uses # (hashes) for url navigation does not crawl the pages that use #
The urls using # are not followed.
If the current behavior is a bug, please provide the steps to reproduce
Try crawling a website like mykita.com/en/
What is the motivation / use case for changing the behavior?
Though hashes are not ment to chan
In the quick start section, it seems forgot to mention the pipeline setting, and without the setting seems will cause yield item appear wrong result. Just like #137, please update the document, if need help, I can do the contribution as well.
python客户端调用为空
我运行的是这4条代码,有可以获得IP,但用python客户端调用没办法取出来
The developer of the website I intend to scrape information from is sloppy and has left a lot of broken links.
When I execute an otherwise effective Ferret script on a list of pages, it stops altogether at every 404.
Is there a DOCUMENT_EXISTS or anything that would help the script go on?
-
Updated
May 24, 2020
-
Updated
Jun 15, 2020 - PHP
-
Updated
Mar 15, 2020 - Python
-
Updated
Jan 28, 2020 - Ruby
-
Updated
Jun 25, 2020 - C#
There are several things not accurately documented/outdated:
-v2is used the examples but does not work# duckduckgo not supportedalthough it is in the list of supported search engines- To get a list of all search engines
--configis suggested but that just fails
On this gif ( https://raw.githubusercontent.com/constverum/ProxyBroker/master/docs/source/_static/cli_serve_example.gif ) the server prints an info line when a client connects.
The current version doesn't do that, though it would be very useful. I tried the command that is on the GIF.
I copied the examples/sciencenet_spider.py example and tried to run it using python 3.6 - but:
python sciencenet_spider.py
[2018:04:14 22:21:26] Spider started!
[2018:04:14 22:21:26] Using selector: KqueueSelector
[2018:04:14 22:21:26] Base url: http://blog.sciencenet.cn/
[2018:04:14 22:21:26] Item "Post": 0
[2018:04:14 22:21:26] Requests count: 0
[2018:04:14 22:21:26] Error coun
Improve this page
Add a description, image, and links to the crawler topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the crawler topic, visit your repo's landing page and select "manage topics."


Description
When I scrape without proxy, both https and http urls work.
Using proxy through https works just fine. My problem is when I try http urls.
In that moment I get the
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''errorAs I see, most of the people have this issue the other way around.
Steps to Reproduce
**Expected