Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Fast fail metric #764
Fast fail metric #764
Conversation
| if progress_condition.present? | ||
| StatsD.client.increment('kubectl.error', 1, tags: { context: context, namespace: namespace, | ||
| progress_condition: deploy_failing_to_progress? }) | ||
| end |
timothysmith0609
Nov 17, 2020
Contributor
I don't really like the idea of instrumenting the resource models, themselves, and would prefer if we could move that to the actual task runner, where such instrumentation makes more sense. It's worth noting we already capture timeout errors and publish them via StatsD (see https://shopify.datadoghq.com/dashboard/5kc-557-amd/krane--kubernetes-deploy-dash?from_ts=1605624831373&live=true&to_ts=1605628431373).
Alternatively, since we are constrained by the KubernetesResource interface, could we add a statsd_tag in Deployment#deploy_timed_out? when deploy_failing_to_progress? is true to indicate an actual progressing failure? 🤔 It would be hard to know if it's a fail-fast or an initial progressing failure, though.
I don't really like the idea of instrumenting the resource models, themselves, and would prefer if we could move that to the actual task runner, where such instrumentation makes more sense. It's worth noting we already capture timeout errors and publish them via StatsD (see https://shopify.datadoghq.com/dashboard/5kc-557-amd/krane--kubernetes-deploy-dash?from_ts=1605624831373&live=true&to_ts=1605628431373).
Alternatively, since we are constrained by the KubernetesResource interface, could we add a statsd_tag in Deployment#deploy_timed_out? when deploy_failing_to_progress? is true to indicate an actual progressing failure?
| if progress_condition.present? | ||
| StatsD.client.increment('kubectl.error', 1, tags: statsd_tags) | ||
| end |
timothysmith0609
Nov 19, 2020
Contributor
I'm confused, why are we incrementing kubectl.error?
I'm confused, why are we incrementing kubectl.error?
rdji123
Nov 19, 2020
Author
Contributor
I wanted to use an existing metric in Krane. Do you have any suggestions as to which metric we can use? We can also create a new metric for this
I wanted to use an existing metric in Krane. Do you have any suggestions as to which metric we can use? We can also create a new metric for this
timothysmith0609
Nov 19, 2020
Contributor
I don't think there's a metric we have that's quite suitable. Perhaps I'm ignorant, but is there some reason against using a bespoke metric?
I don't think there's a metric we have that's quite suitable. Perhaps I'm ignorant, but is there some reason against using a bespoke metric?
rdji123
Nov 19, 2020
Author
Contributor
No reason against it! I've updated the metric to a new one 👍
No reason against it! I've updated the metric to a new one
| @@ -91,6 +91,10 @@ def deploy_timed_out? | |||
| return false if deploy_failed? | |||
| return super if timeout_override | |||
|
|
|||
| if progress_condition.present? | |||
| StatsD.client.increment('fail_fast', 1, tags: statsd_tags) | |||
ayatsynych
Nov 19, 2020
might be a good idea to keep the kubectl prefix, so it's clear that this metric is specific to kubectl and it's easier to discover
might be a good idea to keep the kubectl prefix, so it's clear that this metric is specific to kubectl and it's easier to discover
What are you trying to accomplish with this PR?
Adding statsd data to get a better understanding of fast fail occurrences.
cc @Shopify/pipeline