-
Notifications
You must be signed in to change notification settings - Fork 18.8k
libnet: Controller: cache networks & endpoints in-memory #49736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,11 @@ | |||
| package maputil | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is failing because of that:
| package maputil | |
| //go:build go1.22 | |
| package maputil |
99867d9 to
40f7c61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Just some suggestions/nits ... and should it have a changelog comment, saying this might solve a long-standing issue with a "has active endpoints" on network deletion?
libnetwork/sandbox_store.go
Outdated
| } | ||
| } | ||
| } | ||
| c.cacheEndpoint(ep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like getEndpointFromStore will already have cached it, so this should be moved to the if err != nil block above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes! That's where it was originally 🙈
libnetwork/endpoint_store.go
Outdated
| // maintained by the Controller. | ||
| // | ||
| // This method is thread-safe. | ||
| func (c *Controller) deleteEndpoint(ep *Endpoint) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name might be a bit misleading - deleteEndpoint doesn't actually delete the endpoint, it forgets it ... the caller should already have deleted it from the Sandbox etc? Perhaps deleteStoredEndpoint or something like that?
At some point, it might be good to put the cache in a separate package to protect its map/mutex, then the call would be something like c.epStore.Delete(ep) would also work. But, as you noted offline, that'd be a bigger job because Endpoint/Network would create a circular dependency, or the cache would need to be implemented with a generic.
(upsertEndpoint is clearer, but ditto for deleteNetwork.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed upsertEndpoint into storeEndpoint, and deleteEndpoint into deleteStoredEndpoint (and followed the same pattern for networks).
|
|
||
| package maputil | ||
|
|
||
| func FilterValues[K comparable, V any](in map[K]V, fn func(V) bool) []V { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callers only want a count, the slice always gets discarded after a len() ... so maybe CountValues, and countEndpoints/countNetworks - and we can add findXYZ funcs if it turns out they're needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you could get rid of these warnings for findEndpoints/findNetworks, because there'd be no danger ...
// This method is thread-safe, but do not use it unless you're sure your code
// uses the returned endpoints in thread-safe way (see the comment on
// Controller.endpoints).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we can add findXYZ funcs if it turns out they're needed?
I didn't want to change ActiveEndpointsError in this PR to make clear it's focused on data persistence.
But once this one is merged, I'll open another PR to change it, and findNetworks / findEndpoints will be needed. So, to avoid needless code churn, I think it's fine to add these here.
e49ce34 to
baf40a2
Compare
The `Controller`'s store is used by: - `deleteFromStore` - `getEndpointFromStore` - `getEndpointsFromStore` - `updateToStore` - … and other methods that can't store / delete / retrieve an Endpoint Calls to `updateToStore` and `deleteFromStore` have been replaced with `upsertEndpoint` and `deleteEndpoint`. Both `getEndpointFromStore` and `getEndpointsFromStore` call `cacheEndpoint` to ensure endpoints loaded from the datastore are kept in-memory. Finally, `sandboxRestore` was instantiating `Endpoint` itself. These are cached too. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The `Controller`'s store is used by: - `deleteFromStore` - `getNetworks` - `getNetworksFromStore` - `updateToStore` - … and other methods that can't store / delete / retrieve a Network Calls to `updateToStore` and `deleteFromStore` have been replaced with `upsertNetwork` and `deleteNetwork`. Both `getNetworks` and `getNetworksFromStore` call `cacheNetwork` to ensure networks loaded from the datastore are kept in-memory. Finally, `sandboxRestore` was instantiating `Network` itself. These are cached too. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
baf40a2 to
1169763
Compare
endpointCnt is a refcounter used to track how many endpoints use a network, and how many networks references a config-only network. It's stored separately from the network. This is only used to determine if a network can be removed. This commit removes the `endpointCnt` struct and all its references. The refcounter is replaced by two lookups in the newly introduced `networks` and `endpoints` caches added to the `Controller`. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
1169763 to
51d7f95
Compare
| assert.Equal(t, len(found), 2) | ||
| assert.Equal(t, found[0], ep1) | ||
| assert.Equal(t, found[1], ep2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use assert.Check
| assert.Equal(t, len(found), 1) | ||
| assert.Equal(t, found[0], ep2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| "gotest.tools/v3/assert" | ||
| ) | ||
|
|
||
| func TestEndpointStore(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just hit some flakiness:
https://github.com/moby/moby/actions/runs/14308075196/job/40096446556?pr=49746#step:7:2040
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @akerouanton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, hit some flakiness in #49762 (comment) as well
networknamehas error has active endpoints #42119- What I did
EndpointandNetworkin libnetController.--config-fromnetwork, or endpoint is referencing a network before deleting it.endpointCntstruct.Libnetwork is afflicted by long-standing data consistency issues caused by its persistence layer creating new instances of
Network,EndpointandSandboxevery time an object is loaded from the datastore or its in-memory cache. See for instance the following code that ends with aCopyTo:moby/libnetwork/datastore/cache.go
Lines 134 to 149 in 2e92272
For instance, this led to #47186.
Fixing the data persistence layer itself is particularly hard as it touches virtually all code paths in libnetwork. We need to be especially cautious with mutex locking -- libnetwork makes heavy use of mutexes, but since two concurrent API calls referencing the same object result in to two different instances of that object living in memory, the concurrency model might be broken (unknowingly).
By maintaining a separate cache of all objects in libnet's
Controller, we'll be able to slowly iterate on all code paths, and verify each one through static analysis, without touching the data persistence layer.This PR specifically adds an
Endpointand aNetworkcache to theController. This allows us to delete the structendpointCntwhich was used to ref count networks (either when they were referenced byconfig-fromnetworks, or by endpoints) to prevent deleting networks in use.Instead,
(*Network).delete()looks for endpoints /config-fromnetworks that reference the network being deleted from the cache maintained in theController.- How I did it
See individual commits.
- How to verify it
Through unit and integration tests.
- Human readable description for the release notes
- Improve how network-endpoint relationships are tracked to reduce the chance of the "has active endpoints" error to be wrongfully returned.