parquet

Catchall for now for potential improvements to datastation/dsq.

SQL pre-processing
- Import only used fields (see #71)
- Do pre-filtering of data in SQLiteWriter, only insert things that match the WHERE clause
Support more input types using SQLiteWriter, basically requires supporting expanded nested objects in (see notes in #67 )
Maybe Handle jsonl in parallel since newlines must

It appears that all the dependencies required by Gaffer are available in Maven central, which is the default repository used by Maven. Although this may not have been the case in the past. When running builds Maven occasionally tries to check repos.spark-packages.org if it can't find a package in Maven central. This is often because of a mistake with the version.

It's unclear if this reposito

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

the pre-built binary is not supporting database?

roapi -t "vocabs=sqlite:///data/vocabulary.sqlite"
[2022-05-31T06:48:11Z INFO  roapi::context] loading `uri(sqlite:///data/vocabulary.sqlite)` as table `vocabs`
Error: Database error: Enable 'database' feature flag to support this

would you explain in README how to enable it?

I'm new to rust, after some searching i got it workin

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

I'm submitting a

[x ] bug report.

Current Behaviour:

After #249
Trying to run tests with pytest tests/rdf_tests/test_rdf_basic.py -k test_rdf_runner -s, you get a report file with all the tests run.
Some tests return errors, for example:

{
    "Basic - Term 7": {
        "input": "basic/data-4.ttl",
        "query": "basic/term-7.rq",
        "error": "Expected {Sele

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

We have 6 eslint rules that are still set to only warn, but we want to get them all up to error. They are:

@typescript-eslint/no-explicit-any -- this one will be a lot of work, and I'm not sure there is an easy path here, but we want to get there eventually
rilldata/rill-developer#331 @typescript-eslint/no-unused-vars -- this should be a simple es

As a user i would like a command to aid in debugging parquet files. For instance I would like to obtain the following file stats in a single command:

compression algorithm
page type v1/v2?
row group size
author / created by
version
metadata
page size
total records /row count
any internal info that could help too

parquet

Here are 259 public repositories matching this topic...

multiprocessio / dsq

apache / drill

gchq / Gaffer

apache / parquet-mr

uber / petastorm

roapi / roapi

quiltdata / quilt

apache / parquet-format

bigdatagenomics / adam

HariSekhon / DevOps-Python-tools

Cinchoo / ChoETL

Netflix / iceberg

ranaroussi / pystore

skale-me / skale

DerwenAI / kglab

I'm submitting a

Current Behaviour:

apache / parquet-cpp

RandomFractals / vscode-data-preview

moshe / elasticsearch_loader

spotify / ratatool

elastacloud / parquet-dotnet

sksamuel / centurion

rilldata / rill-developer

mukunku / ParquetViewer

ironSource / parquetjs

scikit-hep / awkward-0.x

fraugster / parquet-go

Eugene-Mark / bigdata-file-viewer

mjakubowski84 / parquet4s

Chabane / bigdata-playground

cldellow / sqlite-parquet-vtable

Improve this page

Add this topic to your repo