-
Notifications
You must be signed in to change notification settings - Fork 4.5k
BEAM-9094 Configure S3 client for IO to s3 compatible object stores #13180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@chamikaramj @aaltay @udim @pabloem @charlesccychen, apologies for tagging you all, you're on the nearest OWNERS files and not sure who is most relevant. |
|
thanks. I can review this |
|
Run Portable_Python PreCommit |
| parser.add_argument( | ||
| '--endpoint_url', | ||
| default=None, | ||
| help='The complete URL to use for the constructed s3 client.') | ||
| parser.add_argument( | ||
| '--region_name', | ||
| default=None, | ||
| help='The name of the region associated with the client.') | ||
| parser.add_argument( | ||
| '--api_version', default=None, help='The API version to use.') | ||
| parser.add_argument( | ||
| '--verify', | ||
| default=None, | ||
| help='Whether or not to verify SSL certificates.') | ||
| parser.add_argument( | ||
| '--use_ssl', | ||
| default=True, | ||
| help='Whether or not to use SSL. By default, SSL is used.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that it's desirable to use a sort of namespace prefix. Perhaps --aws_ or --aws_s3_. What do you think?
You must know more than me about s3 and AWS - I wonder if aws_session_token, aws_secret_access_key, aws_access_key_id in this context are specific for s3, or if they provide some sort of AWS-wide authentication?
If they're s3-specific, then maybe we should namespace them as aws_s3? Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Please do not use endpoint_url, region_name, api_version, verify, use_ssl and so on without a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I've moved them all to use the same s3 prefix which I think makes sense given they are collected under the S3Options class. @pabloem I believe the access keys can also be used for other AWS services, although I've never actually used them. I think it makes sense to consolidate the more generic aws options with the s3 options together for now given that this is the only use case at the moment. If there is another AWS service added in the future it could make sense then to split them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That makes sense to me. Can you fix the broken unittests? Other than that, the change looks great (and it's very welcome, as we'd been needing it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I tried to have a look at the failing tests and can't figure out which ones have actually failed (the formatting is quite difficult to parse). Unfortunately I don't have access to windows to run locally. The same pattern of failures seems to be affecting #13187 which is just a comment change, so perhaps the failures are unrelated?
|
I am bit confused by it. I can't repro it locally - I'll try a couple more things. |
|
Run Python Precommit |
|
I think it is probably due to the incorrect processing of the boolean flag. I've changed to using an action on the argparser, and hopefully that should sort the issue. |
|
the two failures in the windows test are |
|
that's right - those are known flakes. Thanks @dandy10 ! |
|
Run Python_PVR_Flink PreCommit |
|
thanks @dandy10 ! this is great! |
|
@dandy10 @pabloem Do you have any idea what I'm doing wrong? These are the beam pipeline args that I'm supplying and I know for sure that at least the multi process and nr_of_workers arguments are applied: Help would be greatly appreciated! |

I would like to be able to configure s3 compatible IO in order to be able to use alternative endpoints. #10560 began moving in this direction but has been stalled for a while. The main comment there was that pipeline options should be used for configuration.
Some open questions that I have are:
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.