Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast fail metric #764

Open
wants to merge 8 commits into
base: master
from
Open

Fast fail metric #764

wants to merge 8 commits into from

Conversation

@rdji123
Copy link
Contributor

@rdji123 rdji123 commented Nov 17, 2020

What are you trying to accomplish with this PR?
Adding statsd data to get a better understanding of fast fail occurrences.

cc @Shopify/pipeline

rdji123 added 2 commits Nov 16, 2020
@rdji123 rdji123 requested a review from Shopify/krane as a code owner Nov 17, 2020
if progress_condition.present?
StatsD.client.increment('kubectl.error', 1, tags: { context: context, namespace: namespace,
progress_condition: deploy_failing_to_progress? })
end
Comment on lines 94 to 97

This comment has been minimized.

@timothysmith0609

timothysmith0609 Nov 17, 2020
Contributor

I don't really like the idea of instrumenting the resource models, themselves, and would prefer if we could move that to the actual task runner, where such instrumentation makes more sense. It's worth noting we already capture timeout errors and publish them via StatsD (see https://shopify.datadoghq.com/dashboard/5kc-557-amd/krane--kubernetes-deploy-dash?from_ts=1605624831373&live=true&to_ts=1605628431373).

Alternatively, since we are constrained by the KubernetesResource interface, could we add a statsd_tag in Deployment#deploy_timed_out? when deploy_failing_to_progress? is true to indicate an actual progressing failure? 🤔 It would be hard to know if it's a fail-fast or an initial progressing failure, though.

rdji123 added 4 commits Nov 19, 2020
if progress_condition.present?
StatsD.client.increment('kubectl.error', 1, tags: statsd_tags)
end
Comment on lines 94 to 96

This comment has been minimized.

@timothysmith0609

timothysmith0609 Nov 19, 2020
Contributor

I'm confused, why are we incrementing kubectl.error?

This comment has been minimized.

@rdji123

rdji123 Nov 19, 2020
Author Contributor

I wanted to use an existing metric in Krane. Do you have any suggestions as to which metric we can use? We can also create a new metric for this

This comment has been minimized.

@timothysmith0609

timothysmith0609 Nov 19, 2020
Contributor

I don't think there's a metric we have that's quite suitable. Perhaps I'm ignorant, but is there some reason against using a bespoke metric?

This comment has been minimized.

@rdji123

rdji123 Nov 19, 2020
Author Contributor

No reason against it! I've updated the metric to a new one 👍

rdji123 added 2 commits Nov 19, 2020
@@ -91,6 +91,10 @@ def deploy_timed_out?
return false if deploy_failed?
return super if timeout_override

if progress_condition.present?
StatsD.client.increment('fail_fast', 1, tags: statsd_tags)

This comment has been minimized.

@ayatsynych

ayatsynych Nov 19, 2020

might be a good idea to keep the kubectl prefix, so it's clear that this metric is specific to kubectl and it's easier to discover

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.