parquet

In many places in Gaffer, correct logging string formatting is used:
https://github.com/gchq/Gaffer/blob/fb63fc25be2f01d6476d1010780500cb0856b6c4/store-implementation/accumulo-store/src/main/java/uk/gov/gchq/gaffer/accumulostore/operation/hdfs/handler/AddElementsFromHdfsHandler.java#L254

However, in some places, string concatenation is incorrectly used instead:
https://github.com/gchq/Gaffer/

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

Note sure if it could be interesting but:

When registering a table:

addr: 0.0.0.0:8084
tables:
  - name: "example"
    uri: "/data/"
    option:
      format: "parquet"
      use_memory_table: false

add in options:
glob

pattern: "file_typev1*.parquet"

or regexp

pattern: "\wfile_type\wv1\w*.parquet"

It would allow selecting in uri's with different exte

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

I'm submitting a

[x ] bug report.

Current Behaviour:

After #249
Trying to run tests with pytest tests/rdf_tests/test_rdf_basic.py -k test_rdf_runner -s, you get a report file with all the tests run.
Some tests return errors, for example:

{
    "Basic - Term 7": {
        "input": "basic/data-4.ttl",
        "query": "basic/term-7.rq",
        "error": "Expected {Sele

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

As a user i would like a command to aid in debugging parquet files. For instance I would like to obtain the following file stats in a single command:

compression algorithm
page type v1/v2?
row group size
author / created by
version
metadata
page size
total records /row count
any internal info that could help too

parquet

Here are 247 public repositories matching this topic...

multiprocessio / dsq

gchq / Gaffer

apache / drill

apache / parquet-mr

uber / petastorm

roapi / roapi

quiltdata / quilt

apache / parquet-format

bigdatagenomics / adam

HariSekhon / DevOps-Python-tools

Cinchoo / ChoETL

Netflix / iceberg

ranaroussi / pystore

skale-me / skale

DerwenAI / kglab

I'm submitting a

Current Behaviour:

apache / parquet-cpp

RandomFractals / vscode-data-preview

moshe / elasticsearch_loader

elastacloud / parquet-dotnet

sksamuel / centurion

spotify / ratatool

mukunku / ParquetViewer

ironSource / parquetjs

scikit-hep / awkward-0.x

fraugster / parquet-go

Chabane / bigdata-playground

mjakubowski84 / parquet4s

cldellow / sqlite-parquet-vtable

Eugene-Mark / bigdata-file-viewer

RumbleDB / rumble

Improve this page

Add this topic to your repo