Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
cmd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
xio
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Span

Span started as a single tool to convert Crossref API data into a VuFind/SOLR format as used in finc. An intermediate representation for article metadata is used for normalizing various input formats. Go was choosen as the implementation language because it is easy to deploy and has concurrency support built into the language. A basic scatter-gather design allowed to process millions of records fast.

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Installation

$ go get github.com/miku/span/cmd/...

Span has frequent releases, although not all versions will be packaged as deb or rpm.

Background

Initial import Tue Feb 3 19:11:08 2015, a single span command. In March 2015, span-import and span-export appeared. There were some rudimentary commands for dealing with holding files of various formats. In early 2016, a licensing tool was briefly named span-label before becoming span-tag. In Summer 2016, span-check, span-deduplicate, span-redact were added, later a first man-page followed. In Summer 2017, span-deduplicate was gone, the doi-based deduplication was split up between the blunt, but fast groupcover and the generic span-update-labels. A new span-oa-filter helped to mark open-access records. In Winter 2017, a span-freeze was added to allow for fixed configuration across dozens of files. The span-crossref-snapshot tool replaced a sequence of luigi tasks responsible for creating a snapshot of crossref data (the process has been summarized in a comment). In Summer 2018, three new tools were added: span-compare for generating index diffs for index update tickets, span-review for generating reports based on SOLR queries and span-webhookd for triggering index reviews and ticket updates through GitLab. During the development, new input and output formats have been added. The parallel processing of records has been streamlined with the help of a small library called parallel. Since Winter 2017, the zek struct generator takes care of the initial screening of sources serialized as XML - making the process of mapping new data sources easier.

Documentation

See: manual source.

Performance

Processing 150M JSON documents regularly and fast requires a bit of care. In the best case no complete processing of the data should take more than two hours or run slower than 20000 records/s. The most expensive part is the JSON serialization, but we keep JSON for now for the sake of readability. Experiments with faster JSON serializers and msgpack have been encouraging, a faster serialization should be the next measure to improve performance.

Most tools that work on lines will try to use as many workers as CPU cores. Except for span-tag - which needs to keep all holdings data in memory - all tools work well in a low-memory environment.

Integration

The span tools are used in various tasks in siskin (which contains all orchestration code). All span tools work fine standalone, and most will accept input from stdin as well, allowing for one-off things like:

$ metha-cat http://oai.web | span-import -i name | span-tag -c amsl | span-export | solrbulk

TODO

There is a open issue regarding more flexible license labelling. While this would be useful, it would be probably even more useful to separate content conversions from licensing issues altogether. There is lots of work done in prototypes, which explore how fast and how reliable we can rewrite documents in a production server.

Ideally, a cron job or trigger regularly checks and ensures compliance.

You can’t perform that action at this time.