BEAM-9094 Configure S3 client for IO to s3 compatible object stores #13180

dandy10 · 2020-10-23T18:47:00Z

I would like to be able to configure s3 compatible IO in order to be able to use alternative endpoints. #10560 began moving in this direction but has been stalled for a while. The main comment there was that pipeline options should be used for configuration.

Some open questions that I have are:

Is there a reason that a separate S3IO instance was created for every operation in the s3filesystem? If so I can save a reference to the pipeline options in the constructor and pass that in instead of saving the single S3IO.
Are there any naming considerations for the S3Options? Should they begin with a naming prefix to differentiate from other options provided on the command line?
Does it make sense to pass in tokens/keys through pipeline options? Should they instead be pulled from the environment?

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

dandy10 · 2020-10-23T18:57:23Z

@chamikaramj @aaltay @udim @pabloem @charlesccychen, apologies for tagging you all, you're on the nearest OWNERS files and not sure who is most relevant.

pabloem · 2020-10-23T19:02:13Z

thanks. I can review this

pabloem · 2020-10-23T19:28:36Z

Run Portable_Python PreCommit

pabloem · 2020-10-23T20:10:25Z

sdks/python/apache_beam/options/pipeline_options.py

+    parser.add_argument(
+        '--endpoint_url',
+        default=None,
+        help='The complete URL to use for the constructed s3 client.')
+    parser.add_argument(
+        '--region_name',
+        default=None,
+        help='The name of the region associated with the client.')
+    parser.add_argument(
+        '--api_version', default=None, help='The API version to use.')
+    parser.add_argument(
+        '--verify',
+        default=None,
+        help='Whether or not to verify SSL certificates.')
+    parser.add_argument(
+        '--use_ssl',
+        default=True,
+        help='Whether or not to use SSL. By default, SSL is used.')


You are correct that it's desirable to use a sort of namespace prefix. Perhaps --aws_ or --aws_s3_. What do you think?

You must know more than me about s3 and AWS - I wonder if aws_session_token, aws_secret_access_key, aws_access_key_id in this context are specific for s3, or if they provide some sort of AWS-wide authentication?

If they're s3-specific, then maybe we should namespace them as aws_s3? Let me know what you think.

+1. Please do not use endpoint_url, region_name, api_version, verify, use_ssl and so on without a prefix.

Fair enough. I've moved them all to use the same s3 prefix which I think makes sense given they are collected under the S3Options class. @pabloem I believe the access keys can also be used for other AWS services, although I've never actually used them. I think it makes sense to consolidate the more generic aws options with the s3 options together for now given that this is the only use case at the moment. If there is another AWS service added in the future it could make sense then to split them up.

Thanks! That makes sense to me. Can you fix the broken unittests? Other than that, the change looks great (and it's very welcome, as we'd been needing it).

Will do. I tried to have a look at the failing tests and can't figure out which ones have actually failed (the formatting is quite difficult to parse). Unfortunately I don't have access to windows to run locally. The same pattern of failures seems to be affecting #13187 which is just a comment change, so perhaps the failures are unrelated?

pabloem · 2020-10-26T20:52:49Z

The precommit failures are from dataflow jobs trying to start the sdk worker:

pabloem · 2020-10-26T22:19:19Z

I am bit confused by it. I can't repro it locally - I'll try a couple more things.

pabloem · 2020-10-26T22:19:29Z

Run Python Precommit

dandy10 · 2020-10-27T12:16:28Z

I think it is probably due to the incorrect processing of the boolean flag. I've changed to using an action on the argparser, and hopefully that should sort the issue.

dandy10 · 2020-10-27T12:42:58Z

the two failures in the windows test are PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: so it seems that the tests are not being isolated sufficiently.

pabloem · 2020-10-27T17:06:03Z

that's right - those are known flakes. Thanks @dandy10 !
I'll merge once Flink PVR passes

pabloem · 2020-10-27T17:06:10Z

Run Python_PVR_Flink PreCommit

pabloem · 2020-10-27T17:56:28Z

thanks @dandy10 ! this is great!

ConverJens · 2021-01-18T15:08:25Z

@dandy10 @pabloem
Great work with this PR!
I'm trying to get s3 (Minio) to work for TFX, and I get it to work for all but the beam components where I get this strange error:

Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 742, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 867, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/iobase.py", line 1129, in process
    self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/options/value_provider.py", line 135, in _f
    return fnc(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 196, in open_writer
    return FileBasedSinkWriter(self, writer_path)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 417, in __init__
    self.temp_handle = self.sink.open(temp_shard_path)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/options/value_provider.py", line 135, in _f
    return fnc(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 138, in open
    return FileSystems.create(temp_path, self.mime_type, self.compression_type)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filesystems.py", line 229, in create
    return filesystem.create(path, mime_type, compression_type)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3filesystem.py", line 171, in create
    return self._path_open(path, 'wb', mime_type, compression_type)
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3filesystem.py", line 151, in _path_open
    raw_file = s3io.S3IO(options=self._options).open(
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3io.py", line 63, in __init__
    raise ValueError('Must provide one of client or options')
ValueError: Must provide one of client or options

Do you have any idea what I'm doing wrong?

These are the beam pipeline args that I'm supplying and I know for sure that at least the multi process and nr_of_workers arguments are applied:

'--direct_running_mode=multi_processing',
f'--direct_num_workers={NR_OF_CPUS}',
'--s3_endpoint_url=minio-service.kubeflow:9000',
f'--s3_access_key={ACCESS_KEY}',
f'--s3_secret_access_key={SECRET_ACCESS_KEY},
'--s3_verify=False'

Help would be greatly appreciated!

pabloem self-requested a review October 23, 2020 19:02

johndoe12312 added 4 commits October 23, 2020 20:03

add s3 options

933a8c5

pass s3 options through to boto3_client

21e4763

pass pipeline options through S3IO

27a856f

consolidate naming convention without prefix

fbc6dd6

dandy10 force-pushed the s3-config branch from 4e86acf to fbc6dd6 Compare October 23, 2020 19:03

document change in CHANGES.md

584db8b

pabloem reviewed Oct 23, 2020

View reviewed changes

johndoe12312 added 6 commits October 23, 2020 21:53

consolidate on s3 prefix for pipeline options

594379a

do not shadow imported module name

3f07e5d

reinstantiate S3IO for each operation

c3e991a

undo unintentional format changes

ef99f9e

update tests with new constructor args

b8544d0

format fix

4786646

pabloem self-requested a review October 24, 2020 18:47

johndoe12312 added 2 commits October 25, 2020 14:02

fix naming error

6fc8ffd

remove duplicated property access

7f88b2a

use action for boolean flag

25d7daf

pabloem merged commit b35d4cc into apache:master Oct 27, 2020

dandy10 deleted the s3-config branch January 16, 2021 20:00

BEAM-9094 Configure S3 client for IO to s3 compatible object stores #13180

BEAM-9094 Configure S3 client for IO to s3 compatible object stores #13180

Uh oh!

Conversation

dandy10 commented Oct 23, 2020

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Uh oh!

dandy10 commented Oct 23, 2020

Uh oh!

pabloem commented Oct 23, 2020

Uh oh!

pabloem commented Oct 23, 2020

Uh oh!

pabloem Oct 23, 2020

Choose a reason for hiding this comment

Uh oh!

aaltay Oct 23, 2020

Choose a reason for hiding this comment

Uh oh!

dandy10 Oct 23, 2020

Choose a reason for hiding this comment

Uh oh!

pabloem Oct 24, 2020

Choose a reason for hiding this comment

Uh oh!

dandy10 Oct 24, 2020

Choose a reason for hiding this comment

Uh oh!

pabloem commented Oct 26, 2020

Uh oh!

pabloem commented Oct 26, 2020

Uh oh!

pabloem commented Oct 26, 2020

Uh oh!

dandy10 commented Oct 27, 2020

Uh oh!

dandy10 commented Oct 27, 2020

Uh oh!

pabloem commented Oct 27, 2020

Uh oh!

pabloem commented Oct 27, 2020

Uh oh!

pabloem commented Oct 27, 2020

Uh oh!

ConverJens commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ConverJens commented Jan 18, 2021 •

edited

Loading