ProxyPool
This tool is in developing and
READMEmay be out-dated.
An Python implementation of proxy pool.
ProxyPool is a tool to create a proxy pool with Scrapy and Redis, it will automatically add new available proxies to pool and maintain the pool to delete unusable proxies.
This tool currently get available proxies from 4 sources, I would add more sources in the future.
Compatibility
This tool has been tested on macOS Sierra 10.12.4 and Ubuntu 16.04 LTS successfully.
System Requirements:
- UNIX-Like systems(macOS, Ubuntu, etc..)
Fundamental Requirements:
- Redis 3.2.8
- Python 3.0+
Python package requirements:
- Scrapy 1.3.3
- redis 2.10.5
- Flask 0.12
I have not tested other versions of above packages, but I think it works fine for most users.
Features
- Automatically add new available proxies
- Automatically delete unusable proxies
- Less coding work by adding crawl rule, improve scalability
How-to
This tool requires Redis,please make sure Redis service(port 6379) has started
To start the tool, simply:
$ ./start.sh
It will start Crawling service、Pool maintain service、Maintain schedule service、Rule Maintain service and Web console
To monitor the tool, go to:
To stop the tool, simply:
$ sudo ./stop.sh
To add support for crawling more sites for proxies, this tool provides a usual crawling structure which should work for most free proxies site:
- Start the tool
- Open Web console(default port:5000)
- Switch to Rule management page
- Click New rule button
- Finish the form and submit
rule_namewill be used to distinguish different rulesurl_fmtwill be used to generate crawling pages, it's often that the coding rule of these free proxy providing website is something likexxx.com/yy/5row_xpathwill be used to extract a data row from page content.host_xpathwill be used to extract proxy ip from a data row extracted earlier.port_xpathwill be used to extract proxy port.addr_xpathwill be used to extract proxy address.mode_xpathwill be used to extract proxy mode.proto_xpathwill be used to extract proxy protocol.vt_xpathwill be used to extract proxy validation time.max_pagewill be used to control the size of crawling pages.- Above
xpaths can be set tonullto get a defaultunknownvalue.
- Once the form is submitted the rule will be applied automatically and start a new crawling process.
Data in Redis
All proxy information are stored in Redis.
Rule(hset)
| key | description |
|---|---|
| name | .. |
| url_fmt | format: http://www.kuaidaili.com/free/intr/{} |
| row_xpath | format: //div[@id="list"]/table//tr |
| host_xpath | format: td[1]/text() |
| port_xpath | .. |
| addr_xpath | .. |
| mode_xpath | .. |
| proto_xpath | .. |
| vt_xpath | .. |
| max_page | a int |
proxy_info(hset)
| key | description |
|---|---|
| proxy | full proxy address, format: 127.0.0.1:80 |
| ip | proxy ip, format: 127.0.0.1 |
| port | proxy port, format: 80 |
| addr | where is the proxy |
| mode | anonymous or not |
| protocol | HTTP or HTTPS |
| validation_time | source website checking time |
| failed_times | recently failed times |
| latency | proxy latency to source website |
rookies_proxies(set)
New proxies which have not been tested yet will be stored at here, a new proxy will be moved to available_proxies after successfully tested or will be deleted after maximum retry times reached.
available_proxies(set)
Available proxies will be stored at here, every proxy will be tested whether it is available or not in certain time.
availables_checking(zset)
Available proxies test queue, the score of these proxies is a timestamp to indicate its priority.
rookies_checking(zset)
New proxies test queue, similar to availables_checking.
Jobs(list)
FIFO queue, format:cmd|rule_name, tell Rule maintain service how to deal with the rule-specific spider's action such as start、pause、stop and delete.
How it work
Getting new proxies
- Crawling pages
- Extract
ProxyItemfrom content - Use pipeline to store
ProxyItemin Redis
Proxy maintain
New proxies:
- Iterate over each of new proxies
- Available
- Move to
available_proxies
- Move to
- Unavailable
- Delete proxy
- Available
Proxies in pool:
- Iterate over each of proxies
- Available
- Reset retry times and wait for next test
- Unavailable
- Not reach maximum retry times
- wait for next test
- Maximum retry times reached
- Delete proxy
- Not reach maximum retry times
- Available
Rule maintain
- Listen FIFO queue
Jobsin redis- Fetch action_type and rule_name
- pause
- Pause the engine of the crawler which has the rule of rule_name and set rule status to
paused
- Pause the engine of the crawler which has the rule of rule_name and set rule status to
- stop
- Any working crawlers are using the specific rule
- Stop the engine gracefully
- Set rule status to
waiting - Add callback to set status to
stoppedwhen engine stopped
- No such rule is used
- Set rule status to
stoppedimmediately
- Set rule status to
- Any working crawlers are using the specific rule
- start
- Any working crawlers are using the specific rule and status is not
waitingand engine is paused- Unpause the engine and set rule status to
started
- Unpause the engine and set rule status to
- No such rule is used
- Load rule info from redis and instantiate a new rule object
- Instantiate a new crawler with the rule
- Add callback to set status to
finishedwhen crawler finished - Set rule status to
started
- Any working crawlers are using the specific rule and status is not
- reload
- Any working crawlers are using the specific rule and status is not
waiting- Re-assign rule to the crawler
- Any working crawlers are using the specific rule and status is not
- pause
- Fetch action_type and rule_name
Schedule proxies checking
- Iterate over proxies in different status(rookie, available, lost)
- Fetch
zrankfrom redis- if
zrankisNonewhich means no checking schedule for the proxy- Add a new checking schedule
- if
- Fetch
Retrieve a available proxy for others
To retrieve currently available proxy, Just get one from available_proxies with any Redis client.
An scrapy middleware example:
class RandomProxyMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
s.conn = redis.Redis(decode_responses=True)
return s
def process_request(self, request, spider):
proxies = list(self.conn.smembers('available_proxies'))
if proxies:
while True:
proxy = choice(proxies)
if proxy.startswith('http'):
break
request.meta['proxy'] = proxyjson API(default port:5000):