Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33055][PYTHON][SQL] Add Python CalendarIntervalType #29935

Open
wants to merge 1 commit into
base: master
from

Conversation

@zero323
Copy link
Contributor

@zero323 zero323 commented Oct 2, 2020

What changes were proposed in this pull request?

This PR adds CalendarIntervalType to pyspark.sql.types.

Why are the changes needed?

Right now, queries resulting in interval types, fail with ValueError when _parse_datatype_json_string is invoked:

>>> spark.sql("SELECT current_date() - current_date()")                                                                                                                                  
Traceback (most recent call last):
...
ValueError: Could not parse datatype: interval

This is a serious loss of function.

Does this PR introduce any user-facing change?

If this PR is accepted, example query will yield expected results:

>>> spark.sql("SELECT current_date() - current_date()")
DataFrame[subtractdates(current_date(), current_date()): interval]

Please note that providing a mapping between internal and external type is not a goal here.

How was this patch tested?

New unit test.

@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

Test build #129354 has finished for PR 29935 at commit d9a6e44.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):
@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

@zero323 zero323 force-pushed the zero323:SPARK-33055 branch from d9a6e44 to 80fbf32 Oct 2, 2020
@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

Test build #129357 has finished for PR 29935 at commit 80fbf32.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):
@zero323 zero323 force-pushed the zero323:SPARK-33055 branch from 80fbf32 to 7d6ebd2 Oct 2, 2020
@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

Test build #129359 has finished for PR 29935 at commit 7d6ebd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):
@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

@SparkQA
Copy link

@SparkQA SparkQA commented Oct 2, 2020

@@ -186,6 +186,30 @@ def fromInternal(self, ts):
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)


class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Oct 3, 2020
Member

There have been a lot of discussions about exposing interval type in other language APIs but I lost the track. @yaooqinn and @cloud-fan, are we going to make internal as a proper exposed type? Or only support it in some contexts?

This comment has been minimized.

@zero323

zero323 Oct 3, 2020
Author Contributor

Yes, I glanced over a few threads and couldn't really figure out where it is going, hence a limited scope of this PR. However, if the type is supported in multiple contexts, current behavior doesn't seem like an intended one.

This comment has been minimized.

@nchammas

nchammas Jan 21, 2021
Contributor

Doesn't @zero323's example from the PR description show that Spark already exposes this type?

spark.sql("SELECT current_date() - current_date()")

For the record, btw, Postgres supports an interval type and done so since at least version 7.1, which was released in 2001. (I mention this since Postgres often comes up as a reference for whether Spark SQL should support a feature or not.)

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Jan 22, 2021
Member

The problem is that it has been half-exposed so far. There have been many discussions up to which context we should support. e.g.) CalendarInterval is marked as Unstable.

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Jan 22, 2021
Member

BTW, I personally agree with adding this in particular due to @nchammas point #29935 (comment) here

@zero323 zero323 marked this pull request as ready for review Oct 5, 2020
@zero323 zero323 force-pushed the zero323:SPARK-33055 branch from 7d6ebd2 to 6ca50c5 Nov 3, 2020
@SparkQA
Copy link

@SparkQA SparkQA commented Nov 3, 2020

Test build #130565 has finished for PR 29935 at commit 6ca50c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):
@SparkQA
Copy link

@SparkQA SparkQA commented Nov 3, 2020

@SparkQA
Copy link

@SparkQA SparkQA commented Nov 3, 2020

@zero323 zero323 force-pushed the zero323:SPARK-33055 branch from 6ca50c5 to 69438b4 Dec 1, 2020
@SparkQA
Copy link

@SparkQA SparkQA commented Jan 7, 2021

Test build #133778 has finished for PR 29935 at commit 69438b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):
@zero323
Copy link
Contributor Author

@zero323 zero323 commented Jan 16, 2021

FYI @MrPowers (in case of your going to work on Python API for make_interval and encounter this).

@MrPowers
Copy link
Contributor

@MrPowers MrPowers commented Jan 17, 2021

@zero323 - Adding CalendarIntervalType to PySpark is a great idea.

Additional context for others: CalendarIntervalType is already in the Scala API and allows for some awesome functionality.

Here's the Spark 3.0.1 behavior with Scala:

import java.sql.Date
import org.apache.spark.sql.functions._
val df = Seq(
  (Date.valueOf("2021-01-23"), Date.valueOf("2021-01-21"))
).toDF("date1", "date2")
df.withColumn("new_datediff", $"date1" - $"date2").show()
//+----------+----------+------------+
//|     date1|     date2|new_datediff|
//+----------+----------+------------+
//|2021-01-23|2021-01-21|      2 days|
//+----------+----------+------------+

df.withColumn("new_datediff", $"date1" - $"date2").printSchema()
//root
// |-- date1: date (nullable = true)
// |-- date2: date (nullable = true)
// |-- new_datediff: interval (nullable = true)

Getting this functionality in PySpark would be a huge win.

Let me know if there is anything I can do to help you move this PR forward @zero323!

Copy link
Contributor

@nchammas nchammas left a comment

LGTM. Are we just waiting on a second look from @yaooqinn or @cloud-fan?

def toInternal(self, di):
raise NotImplementedError(
"Conversion from external Python types to interval not supported"
)
Comment on lines +202 to +205

This comment has been minimized.

@nchammas

nchammas Jan 21, 2021
Contributor

I suppose in the future if we want to support conversion of Python's datetime.timedelta, it would happen here, right?

This comment has been minimized.

@zero323

zero323 Jan 21, 2021
Author Contributor

For full support we might need both Python and JVM component. If I recall correctly timedelta has razrovine mapping to their internal net.razorvine.pickle.objects.TimeDelta.

In the opposite direction we could, if I am not mistaken, start with making CalendarInterval bean compatible, but there is compatibility issue ‒ we'd have to map from Spark's months to Python's days.

@@ -186,6 +186,30 @@ def fromInternal(self, ts):
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)


class CalendarIntervalType(DataType, metaclass=DataTypeSingleton):

This comment has been minimized.

@nchammas

nchammas Jan 21, 2021
Contributor

Doesn't @zero323's example from the PR description show that Spark already exposes this type?

spark.sql("SELECT current_date() - current_date()")

For the record, btw, Postgres supports an interval type and done so since at least version 7.1, which was released in 2001. (I mention this since Postgres often comes up as a reference for whether Spark SQL should support a feature or not.)

@zero323
Copy link
Contributor Author

@zero323 zero323 commented Jan 21, 2021

Let me know if there is anything I can do to help you move this PR forward @zero323!

Thanks @MrPowers! I reckon we mostly need some attention here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants