Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

Open
joshimhoff opened this issue May 25, 2021 · 5 comments · May be fixed by #68488

Comments

@joshimhoff
Copy link
Collaborator

@joshimhoff joshimhoff commented May 25, 2021

Is your feature request related to a problem? Please describe.
Here are some example logs:

I210511 17:06:12.977887 1 cli/start.go:821 ⋮ initiating graceful shutdown of server
...
I210511 17:07:06.103135 18255896 server/drain.go:174 ⋮ [server drain process] drain remaining: 3
I210511 17:07:06.103174 18255896 server/drain.go:176 ⋮ [server drain process] drain details: range lease iterations: 3
I210511 17:07:07.360491 171 server/status/runtime.go:525 ⋮ [n1] runtime stats: 728 MiB RSS, 472 goroutines, 93 MiB/202 MiB/177 MiB GO alloc/idle/total, 350 MiB/452 MiB CGO alloc/total, 0.1 CGO/sec, 2.1/1.0 %(u/s)time, 0.0 %gc (0x), 358 KiB/1.0 MiB (r/w)net
I210511 17:07:07.978903 18255897 cli/start.go:831 ⋮ 26 running tasks
E210511 17:07:12.121121 18258961 server/server.go:1901 ⋮ [n1,client=‹10.0.2.129:53070›] serving SQL client conn: server is not accepting clients
I210511 17:07:12.978900 18255897 cli/start.go:831 ⋮ 26 running tasks

Above logs show 1m passing without graceful drain finishing (without all range lease transfers moving over), leading to some impact during update.

More detailed logs might help with debugging. It's not clear what is going wrong from the above logs.

Describe the solution you'd like

I expect it would help debugging if we logged more verbose info once we got down to N=10 or so range leases still held on the node. We could log:

  1. Range IDs, min & max keys, etc. of all ranges still leased by node.
  2. Range lease transfer attempt info. Log when a transfer attempt is made, log when error is returned if any, & log showing how long blocked on transfer attempt. (Tracing would help here too.)

Describe alternatives you've considered
We could do nothing, relying on operators using vmodule to get the additional verbosity when needed.

Additional context
On CC, graceful drain doesn't always finish before SIGKILL time. We want to fix this.

Epic: CRDB-8501

@knz
Copy link
Member

@knz knz commented May 25, 2021

NB: the logging you're requesting is already available via the vmodule setting.
Are you suggesting we should promote these log messages to be unconditional?

@knz knz added this to Incoming in KV Backlog via automation May 25, 2021
@knz knz changed the title Do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over May 25, 2021
@joshimhoff
Copy link
Collaborator Author

@joshimhoff joshimhoff commented May 25, 2021

That is my suggestion but ONLY once we get down to N=10 (not sure what N should be) or so range leases still held on the node. Do you think that make sense? I think there is no reason to not be very verbose once you are down to a small number of ranges. What do you think about that?

@knz
Copy link
Member

@knz knz commented May 25, 2021

I'm not opposed to it.

@joshimhoff
Copy link
Collaborator Author

@joshimhoff joshimhoff commented May 25, 2021

Ack. I think how useful it would be depends on the scope of the issue we see on CC. I can get more info about scope without this change. I can post here when I have a sense of scope.

@cameronnunez
Copy link
Collaborator

@cameronnunez cameronnunez commented Jul 29, 2021

@tbg has suggested that we can use a verbose condition, since draining could stall/encounter errors at a point where the number of remaining leases is greater than N. We can set verbose to true when the number of remaining leases does not decrease (or worse increases) with graceful drain calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment