pydata

We're trying to introduce Parquet into our team, and the largest blocker that we've seen is the dreaded "schemas are inconsistent" error message:

RuntimeError: Schemas are inconsistent, try using to_parquet(..., schema="infer"), or pass an explicit pyarrow schema. Such as to_parquet(..., schema={"column1": pa.string()})

This error message is super unhelpful: surely Dask knows what th

Is your feature request related to a problem? Please describe.
Our Python docstrings have various style violations when compared against standards like pep257. Not only does this impact readability (which may be subjective), it also reduces the effectiveness of tools like Sphinx or numpydoc that rely on specific formatting in order to parse docstrings.

The stumpy.snippets feature is now completed in #283 which follows this work:

We have a rough notebook t

tornado.IOLoop.run_sync is deprecated and must be removed from our code base.

The CLI scripts are all calling this and a replacement with asyncio.run should be possible

Caveats

The way we handle signals needs to be adjusted
Once asyncio.run finishes we need to ensure the tornado loop is also closed
behaviour of preload modules may be affected if they are using loops about whe

Background

This thread is borne out of the discussion from #968 , in an effort to make documentation more beginner-friendly & more understandable.
One of the subtasks mentioned in that thread was to go through the function docstrings and include a minimal working example to each of the public functions in pyjanitor.

Criteria reiterated here for the benefit of discussion:

It sh

Description

There are several directives that are not supported in this theme (at least, they do not have an effect in the built docs), but that are a part of the rST / Sphinx spec. We should add support for these directives. Here are a few known ones:

highlights
pull-quotes
epigraphs

Implementation

The way to accomplish this would be to:

See wha

Problem description

Reading a dataset with eager's read functionality raises a ValueError when providing columns.

Example code (ideally copy-pastable)

import pandas as pd

from tempfile import TemporaryDirectory
from functools import partial
from storefact import get_store_from_url

from kartothek.io.eager import store_dataframes_as_dataset, read_dataset_as_data

In trying to write tests for #189, I'm finding very difficult to add columns to existing tests, as in some cases like the all_types table, the table is defined in a separate file than the tests and multiple tests try to write to the same table.

Additionally, our test suite doesn't prove that the data that are uploaded are the same as the data downloaded for all types.

We should consider m

For association testing and PCA (at least), it may be useful to have a function that imputes dosages/allele counts. With floating point values (i.e. from bgen), this can be very simple as a user, e.g. ds.call_genotype_probability.fillna(ds.call_genotype_probability.mean(dim="samples")). With alternate allele counts having a sentinel integer, it is a little more complicated. The best way t

As we write and update more docstrings, I think it would be helpful to specify what is expected and to do some checks in CI (and git pre-commit).

Like other libraries in the PyData ecosystem, I think we should rely heavily on the NumPy-style docstrings:

https://numpydoc.readthedocs.io/en/latest/format.html

We can even use velin to help enforce this and identify common mistakes:

htt

pydata

Here are 90 public repositories matching this topic...

dask / dask

rapidsai / cudf

databricks / koalas

pydata / pandas-datareader

TDAmeritrade / stumpy

dask / distributed

Caveats

pyjanitor-devs / pyjanitor

Background

DataTau / datascience-anthology-pydata

pydata / pydata-sphinx-theme

Description

Implementation

JDASoftwareGroup / kartothek

Problem description

Example code (ideally copy-pastable)

JasonKessler / Scattertext-PyData

rasbt / pydata-chicago2016-ml-tutorial

heavyai / pymapd

sktime / sktime-tutorial-pydata-amsterdam-2020

WinVector / pyvtreat

data-apis / array-api

pystatgen / sgkit

dimgold / pycon_social_networkx

mattilyra / pydataberlin-2017

stringfestdata / advancing-into-analytics-book

data-apis / array-api-comparison

martinapugliese / tales-science-data

jseabold / pandas-selectable

makepath / mapshader

gcampanella / pydata-london-2018

python-graphblas / python-graphblas

stanleyjzheng / PyData-Pseudolabelling-Keynote

yinleon / pydata2017

pydataberlin / meetup-slides

bweigel / ml_at_awslambda_pydatabln2018

Improve this page

Add this topic to your repo