[SPARK-33055][PYTHON][SQL] Add Python CalendarIntervalType #29935
Conversation
|
Test build #129354 has finished for PR 29935 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129357 has finished for PR 29935 at commit
|
|
Kubernetes integration test starting |
|
Test build #129359 has finished for PR 29935 at commit
|
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
| @@ -186,6 +186,30 @@ def fromInternal(self, ts): | |||
| return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000) | |||
|
|
|||
|
|
|||
| class CalendarIntervalType(DataType, metaclass=DataTypeSingleton): | |||
HyukjinKwon
Oct 3, 2020
Member
There have been a lot of discussions about exposing interval type in other language APIs but I lost the track. @yaooqinn and @cloud-fan, are we going to make internal as a proper exposed type? Or only support it in some contexts?
There have been a lot of discussions about exposing interval type in other language APIs but I lost the track. @yaooqinn and @cloud-fan, are we going to make internal as a proper exposed type? Or only support it in some contexts?
zero323
Oct 3, 2020
Author
Contributor
Yes, I glanced over a few threads and couldn't really figure out where it is going, hence a limited scope of this PR. However, if the type is supported in multiple contexts, current behavior doesn't seem like an intended one.
Yes, I glanced over a few threads and couldn't really figure out where it is going, hence a limited scope of this PR. However, if the type is supported in multiple contexts, current behavior doesn't seem like an intended one.
nchammas
Jan 21, 2021
•
Contributor
Doesn't @zero323's example from the PR description show that Spark already exposes this type?
spark.sql("SELECT current_date() - current_date()")
For the record, btw, Postgres supports an interval type and done so since at least version 7.1, which was released in 2001. (I mention this since Postgres often comes up as a reference for whether Spark SQL should support a feature or not.)
Doesn't @zero323's example from the PR description show that Spark already exposes this type?
spark.sql("SELECT current_date() - current_date()")For the record, btw, Postgres supports an interval type and done so since at least version 7.1, which was released in 2001. (I mention this since Postgres often comes up as a reference for whether Spark SQL should support a feature or not.)
HyukjinKwon
Jan 22, 2021
Member
The problem is that it has been half-exposed so far. There have been many discussions up to which context we should support. e.g.) CalendarInterval is marked as Unstable.
The problem is that it has been half-exposed so far. There have been many discussions up to which context we should support. e.g.) CalendarInterval is marked as Unstable.
HyukjinKwon
Jan 22, 2021
Member
BTW, I personally agree with adding this in particular due to @nchammas point #29935 (comment) here
BTW, I personally agree with adding this in particular due to @nchammas point #29935 (comment) here
|
Test build #130565 has finished for PR 29935 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #133778 has finished for PR 29935 at commit
|
|
FYI @MrPowers (in case of your going to work on Python API for |
|
@zero323 - Adding Additional context for others: CalendarIntervalType is already in the Scala API and allows for some awesome functionality. Here's the Spark 3.0.1 behavior with Scala: import java.sql.Date
import org.apache.spark.sql.functions._
val df = Seq(
(Date.valueOf("2021-01-23"), Date.valueOf("2021-01-21"))
).toDF("date1", "date2")
df.withColumn("new_datediff", $"date1" - $"date2").show()
//+----------+----------+------------+
//| date1| date2|new_datediff|
//+----------+----------+------------+
//|2021-01-23|2021-01-21| 2 days|
//+----------+----------+------------+
df.withColumn("new_datediff", $"date1" - $"date2").printSchema()
//root
// |-- date1: date (nullable = true)
// |-- date2: date (nullable = true)
// |-- new_datediff: interval (nullable = true)Getting this functionality in PySpark would be a huge win. Let me know if there is anything I can do to help you move this PR forward @zero323! |
|
LGTM. Are we just waiting on a second look from @yaooqinn or @cloud-fan? |
| def toInternal(self, di): | ||
| raise NotImplementedError( | ||
| "Conversion from external Python types to interval not supported" | ||
| ) |
nchammas
Jan 21, 2021
Contributor
I suppose in the future if we want to support conversion of Python's datetime.timedelta, it would happen here, right?
I suppose in the future if we want to support conversion of Python's datetime.timedelta, it would happen here, right?
zero323
Jan 21, 2021
Author
Contributor
For full support we might need both Python and JVM component. If I recall correctly timedelta has razrovine mapping to their internal net.razorvine.pickle.objects.TimeDelta.
In the opposite direction we could, if I am not mistaken, start with making CalendarInterval bean compatible, but there is compatibility issue ‒ we'd have to map from Spark's months to Python's days.
For full support we might need both Python and JVM component. If I recall correctly timedelta has razrovine mapping to their internal net.razorvine.pickle.objects.TimeDelta.
In the opposite direction we could, if I am not mistaken, start with making CalendarInterval bean compatible, but there is compatibility issue ‒ we'd have to map from Spark's months to Python's days.
| @@ -186,6 +186,30 @@ def fromInternal(self, ts): | |||
| return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000) | |||
|
|
|||
|
|
|||
| class CalendarIntervalType(DataType, metaclass=DataTypeSingleton): | |||
nchammas
Jan 21, 2021
•
Contributor
Doesn't @zero323's example from the PR description show that Spark already exposes this type?
spark.sql("SELECT current_date() - current_date()")
For the record, btw, Postgres supports an interval type and done so since at least version 7.1, which was released in 2001. (I mention this since Postgres often comes up as a reference for whether Spark SQL should support a feature or not.)
Doesn't @zero323's example from the PR description show that Spark already exposes this type?
spark.sql("SELECT current_date() - current_date()")For the record, btw, Postgres supports an interval type and done so since at least version 7.1, which was released in 2001. (I mention this since Postgres often comes up as a reference for whether Spark SQL should support a feature or not.)
What changes were proposed in this pull request?
This PR adds
CalendarIntervalTypetopyspark.sql.types.Why are the changes needed?
Right now, queries resulting in interval types, fail with
ValueErrorwhen_parse_datatype_json_stringis invoked:This is a serious loss of function.
Does this PR introduce any user-facing change?
If this PR is accepted, example query will yield expected results:
Please note that providing a mapping between internal and external type is not a goal here.
How was this patch tested?
New unit test.