Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beam.CombineValues on DataFlow runner causes ambiguous failure with python SDK #21432

Open
damccorm opened this issue Jun 4, 2022 · 0 comments
Open

Comments

@damccorm
Copy link
Collaborator

@damccorm damccorm commented Jun 4, 2022

 

The following beam pipeline works correctly using DirectRunner but fails with a very vague error when using DataflowRunner.


(    
pipeline    
| beam.io.ReadFromPubSub(input_topic, with_attributes=True)    
| beam.Map(pubsub_message_to_row)
   
| beam.WindowInto(beam.transforms.window.FixedWindows(5))    
| beam.GroupBy(<beam.Row col name>)
   
| beam.CombineValues(<instance of beam.CombineFn subclass>)    
| beam.Values()  
| beam.io.gcp.bigquery.WriteToBigQuery(
. . . )
)

Stacktrace:


Traceback (most recent call last):
  File "src/read_quality_pipeline/__init__.py", line 128, in <module>

   (
  File "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py",
line 597, in __exit__
    self.result.wait_until_finish()
  File "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
line 1633, in wait_until_finish
    raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
Dataflow pipeline failed. State: FAILED, Error:
Error processing pipeline. 

Log output:


2022-02-01T16:54:43.645Z: JOB_MESSAGE_WARNING: Autoscaling is enabled for Dataflow Streaming Engine.
Workers will scale between 1 and 100 unless maxNumWorkers is specified.
2022-02-01T16:54:43.736Z: JOB_MESSAGE_DETAILED:
Autoscaling is enabled for job 2022-02-01_08_54_40-8791019287477103665. The number of workers will be
between 1 and 100.
2022-02-01T16:54:43.757Z: JOB_MESSAGE_DETAILED: Autoscaling was automatically enabled
for job 2022-02-01_08_54_40-8791019287477103665.
2022-02-01T16:54:44.624Z: JOB_MESSAGE_ERROR: Error
processing pipeline. 

With the CombineValues step removed this pipeline successfully starts in dataflow.

 

I thought this was an issue with Dataflow on the server side since the Dataflow API (v1b3.projects.locations.jobs.messages) is just returning the textPayload: "Error processing pipeline". But then I found the issue BEAM-12636 where a go SDK user has the same error message but seemingly as a result of bugs in the go SDK?

Imported from Jira BEAM-13795. Original Jira may contain additional context.
Reported by: Jake_Zuliani.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant