Skip to content
#

pydata

Here are 87 public repositories matching this topic...

raybellwaves
raybellwaves commented Sep 2, 2021

I'm hoping to get an idea of the memory size of a dask.dataframe once I call .compute() on it

My current approach is

import dask.dataframe as dd
from dask.utils import format_bytes

ddf = dd.demo.make_timeseries(
    start="2000-01-01",
    end="2000-01-02",
    dtypes={"x": float, "y": float, "id": int},
    freq="10ms",
    partition_freq="24h",
)

format_bytes(ddf.memory_u
NeroCorleone
NeroCorleone commented Aug 11, 2020

Problem description

Reading a dataset with eager's read functionality raises a ValueError when providing columns.

Example code (ideally copy-pastable)

import pandas as pd

from tempfile import TemporaryDirectory
from functools import partial
from storefact import get_store_from_url

from kartothek.io.eager import store_dataframes_as_dataset, read_dataset_as_data
randyzwitch
randyzwitch commented Mar 28, 2019

In trying to write tests for #189, I'm finding very difficult to add columns to existing tests, as in some cases like the all_types table, the table is defined in a separate file than the tests and multiple tests try to write to the same table.

Additionally, our test suite doesn't prove that the data that are uploaded are the same as the data downloaded for all types.

We should consider m

eric-czech
eric-czech commented Jun 15, 2021

For association testing and PCA (at least), it may be useful to have a function that imputes dosages/allele counts. With floating point values (i.e. from bgen), this can be very simple as a user, e.g. ds.call_genotype_probability.fillna(ds.call_genotype_probability.mean(dim="samples")). With alternate allele counts having a sentinel integer, it is a little more complicated. The best way t

Improve this page

Add a description, image, and links to the pydata topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pydata topic, visit your repo's landing page and select "manage topics."

Learn more