crawling

Apify SDK — The scalable web crawling and scraping library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

npm automation scraping crawling javascript-library web-scraping web-crawling headless-chrome rpa apify puppeteer

Updated Aug 11, 2020
JavaScript

apache / nutch

Star

Apache Nutch is an extensible and scalable web crawler

java hadoop web-crawler nutch crawling apache

Updated Aug 11, 2020
Java

transitive-bullshit / awesome-puppeteer

Star

A curated list of awesome puppeteer resources.

automation awesome scraping crawling awesome-list headless-chrome puppeteer

Updated Aug 8, 2020

iawia002 / Lulu

Star

[Unmaintained] A simple and clean video/music/image downloader 👾

python crawler scraper downloader video scraping crawling python3

Updated Oct 18, 2019
Python

MorvanZhou / easy-scraping-tutorial

Star

Simple but useful Python web scraping tutorial code.

crawler regex scraping crawling requests asyncio scrapy beautifulsoup distributed-scraper urllib

Updated Oct 22, 2019
Jupyter Notebook

clemfromspace / scrapy-selenium

Star

Scrapy middleware to handle javascript pages using selenium

crawling selenium scrapy

Updated Jul 22, 2020
Python

essandess / isp-data-pollution

Star

ISP Data Pollution to Protect Private Browsing History with Obfuscation

data privacy obfuscation web crawling data-analytics privacy-enhancing-technologies

Updated Dec 16, 2018
Python

slotix / dataflowkit

Star

Extract structured data from web sites. Web sites scraping.

go golang scraper headless scraping crawling golang-library extract-data scraping-websites cdp chrome-fetcher

Updated Jun 12, 2020
Go

zhuyingda / webster

Star

a reliable high-level web crawling & scraping framework for Node.js.

nodejs javascript crawler spider javascript-framework crawling chromium automation-ui nodejs-framework automation-test headless-chrome scraping-framework puppeteer

Updated Apr 29, 2020
JavaScript

oltarasenko / crawly

Star

Crawly, a high-level web crawling & scraping framework for Elixir.

crawler scraper erlang elixir spider scraping crawling extract-data scraping-websites

Updated Jul 26, 2020
Elixir

DarkSand / Sasila

Star

一个灵活、友好的爬虫框架

python http crawler framework scraping crawling requests

Updated Oct 22, 2019
Python

infinitbyte / gopa

Star

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

lightweight elasticsearch crawler spider web-crawler scraping crawling web-scraping web-spider

Updated Nov 24, 2019
Go

scrapinghub / spidermon

Star

Scrapy Extension for monitoring spiders execution.

testing monitoring scraping crawling spiders monitoring-tool scrapinghub

Updated Aug 3, 2020
Python

rivermont / spidy

Star

The simple, easy to use command line web crawler.

python crawler web-crawler crawling python3 web-spider

Updated Jun 23, 2020
Python

stopstalk / stopstalk-deployment

Star

Stop stalking and start StopStalking 😉

python aws crawling codechef spoj uva competitive-programming hackerrank codeforces web2py materializecss hackerearth atcoder programming-contests timus stopstalk

Updated Aug 2, 2020
Python

alephdata / memorious

Star

Distributed crawling framework for documents and structured data.

scraping crawling scraping-framework

Updated Jul 28, 2020
Python

antchfx / antch

Star

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

golang crawler framework web-crawler scraping crawling web-spider

Updated May 31, 2020
Go

forkonlp / N2H4

Star

네이버 뉴스 수집을 위한 도구

crawler news crawling sort korean naver getcomments

Updated Mar 19, 2020
R

trandoshan-io / crawler

Star

Go process used to crawl websites

go docker golang crawler crawling nats-messaging

Updated Dec 19, 2019
Go

dimkouv / massivedl

Star

Download a large list of files concurrently

golang downloader crawling download-manager

Updated Oct 27, 2019
Go

google / corpuscrawler

Star

Crawler for linguistic corpora

crawling linguistics corpus-linguistics corpus-builder minority-language

Updated Jul 29, 2020
Python

N0taN3rd / Squidwarc

Star

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

crawler chrome crawling chrome-headless browser-automation headless-chrome webarchiving webarchives high-fidelity-preservation puppeteer

Updated May 19, 2020
JavaScript

jvandenaardweg / linkedin-profile-scraper

Star

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Updated Jul 9, 2020
TypeScript

mehmetozkaya / DotnetCrawler

Star

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c