#

warc

Here are 82 public repositories matching this topic...

ArchiveBox

ArchiveBox / ArchiveBox

Star

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Updated Jul 19, 2021
Python

internetarchive / heritrix3

Star

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

java warc heritrix webcrawling

Updated Jul 19, 2021
Java

conifer

Rhizome-Conifer / conifer

Star

Collect and revisit web pages.

python docker archives warc web-archiving wayback webrecorder pywb

Updated Jul 9, 2021
Python

ArchiveTeam / grab-site

Star

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

crawler spider archiving crawl warc

Updated Jul 6, 2021
Python

webrecorder / webrecorder-player

Sponsor Star

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

electron warc web-archiving webrecorder pywb

Updated Sep 17, 2020
JavaScript

ipwb

oduwsdl / ipwb

Star

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

python docker service-worker ipfs memento warc web-archiving wayback memento-rfc

Updated Jun 30, 2021
Python

wail

machawk1 / wail

Star

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

python gui warc web-archiving pyinstaller wayback heritrix openwayback

Updated Jul 19, 2021
Roff

webrecorder / warcio

Sponsor Star

Streaming WARC/ARC library for fast web archive IO

python warc web-archiving web-archives pywb

Updated Nov 3, 2020
Python

bitextor

bitextor / bitextor

Star

Bitextor generates translation memories from multilingual websites

crawler dictionaries tokenizer machine-translation wget apertium neural-machine-translation warc tmx statistical-machine-translation corpus-generator httrack sentence-segmentation corpus-tools creepy corpus-processing hunalign parallel-corpora document-aligner bicleaner

Updated Jul 19, 2021
Python

cocrawler / cocrawler

Star

CoCrawler is a versatile web crawler built using modern tools and concurrency.

screenshot crawler concurrency async-python python3 aiohttp warc aiohttp-client pluggable-modules

Updated Jul 3, 2021
Python

warcreate

machawk1 / warcreate

Star

Chrome extension to "Create WARC files from any webpage"

chrome-extension warc web-archiving

Updated Jun 28, 2021
JavaScript

commoncrawl / news-crawl

Star

News crawling with Storm-crawler - stores content as WARC

crawler news web-crawler apache-storm warc commoncrawl common-crawl

Updated Jan 28, 2021
Java

webrecorder / replayweb.page

Sponsor Star

Serverless Web Archive Replay directly in the browser

service-worker warc web-archiving wayback-machine web-archive replay-web-page web-replay

Updated Jul 16, 2021
JavaScript

helgeho / ArchiveSpark

Star

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

spark internet-archive warc web-archiving webarchive archivespark spark-framework

Updated May 6, 2021
Scala

N0taN3rd / wail

Star

🐋 One-Click User Instigated Preservation

electron warc web-archiving high-fidelity-preservation browser-based-presrevation

Updated Feb 3, 2019
JavaScript

CGamesPlay / chronicler

Star

Offline-first web browser

electron browser warc

Updated Jan 14, 2019
JavaScript

cocrawler / cdx_toolkit

Star

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

python warc web-archiving cdx web-archives commoncrawl cdx-api

Updated Jul 3, 2021
Python

N0taN3rd / node-warc

Star

Parse And Create Web ARChive (WARC) files with node.js

warc web-archiving webarchive web-archives webarchiving warc-files chrome-remote-interface pupeteer

Updated Jun 4, 2021
JavaScript

ArchiveTeam / wget-lua

Star

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

crawler spider lua crawling archiving wget crawl zstd warc webarchiving archiveteam wget-lua wget-at

Updated May 4, 2021
C

archivesunleashed / warclight

Star

A Rails engine supporting the discovery of web archives.

ruby rails rails-engine solr discovery blacklight warc webarchives webarchive-discovery

Updated Jul 19, 2021
Ruby

centic9 / CommonCrawlDocumentDownload

Sponsor Star

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java mime-types warc cdx-files commoncrawl

Updated Jul 1, 2021
Java

PromyLOPh / crocoite

Star

Web archiving using Google Chrome

devtools archiving chrome-browser warc

Updated Dec 30, 2019
Python

datatogether / warc

Star

Golang WARC (Web ARChive) Library

golang package archiving warc iipc

Updated Aug 6, 2019
Go

pirate / internet-archiving-talk

Sponsor Star

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

slideshow wget talks warc censorship web-archiving ethics internet-archiving archivebox

Updated Oct 19, 2020
JavaScript

hrbrmstr / warc

Star

📇 Tools to Work with the Web Archive Ecosystem in R

r rstats warc warc-files r-cyber warc-ecosystem

Updated Aug 20, 2017
R

Mixnode / mixnode-warcreader-php

Star

Read Web ARChive (WARC) files in PHP.

php warc webarchive

Updated Mar 10, 2017
PHP

webrecorder / cdxj-indexer

Sponsor Star

CDXJ Indexing of WARC/ARCs

warc web-archiving

Updated Jul 15, 2021
Python

jedireza / warc

Star

⚙️ A Rust library for reading and writing WARC files

rust rust-library warc

Updated Jul 3, 2021
Rust

ArchiveTeam / WebArchiver

Star

Decentralized web archiving

python crawler web decentralized archiving archiver warc webarchiving

Updated Aug 7, 2018
Python

antiufo / Shaman.Dokan.Warc

Star

Mounts WARC files on Windows

fuse dokan scraping mount warc web-archive

Updated Apr 20, 2019
C#

Improve this page

Add a description, image, and links to the warc topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."