Runtime Workers with state. by cheme · Pull Request #7687 · paritytech/substrate

cheme · 2020-12-07T18:39:47Z

This is a draft pr with some code around runtime workers.

The goal is to be able to access and later modify state from workers.

Design principles are:

can run inline (without using threads)
panic in worker need to panic parent worker unless the worker got 'dismiss' (or there was no call to 'join').
allow different execution mode (at this point we got only some read access but it already put in place the general
design: see failure on transaction drop for 'ReadAt' mode).

Please note, that if we do force worker to Always join (and panic on a worker that did not join), we could considerably simplify implementation:

could record proof with a single proof recorder.
no need for inline processing to run lazily.
no dismiss call and no need to even consider dismiss handling.

I think this could be a safe design principle but it also mean that we need to panic on drop from state-machine (not really difficult, just rather impacting and not really aligned with previous runtime worker pr \cc @NikVolf ).

Current modes of execution

This PR defines multiple state access mode, as a good way to start a discussion.
In practice only a few are really interesting.
Those are define in sp_exrternalities/src/lib.rs .

Note that in no case the child worker can see changes done by the parent worker during
execution, keeping thing deterministic that way (AtSpawn variant therefore can run incoherent code).

Stateless

The existing worker type, cannot access state, spawning a worker of a different
type is not doable.

ReadLastBlock

Read only access to previous block state, not usefull, but test the backend access
to state.

ReadAtSpawn

Read only access to the existing state when calling 'spawn'.

As all next worker state access mode:
Simply add the current state-machine changes in front of the state.
Involves change copy that depends on the number of running change (no share state).
Ensure that the transaction on join is the transaction at spawn or a transaction stacked
over it (see overlaychange/markers.rs).

ReadOptimistic

Will logs child worker read access and parent worker write access.
On join if there is conflict between both access no result is returned (and
user have to run worker without concurrency on its own).
see overlaychange/loggers.rs

So this should be default as we ensure everything did run over a coherent state (since change in parent
worker don't overlap with child accesses).

ReadDeclarative

Declare (single key or prefix) child worker read access on spawn.
If child worker access a variable out of the declaration we can panic or return None on join as for optimistic (in this case there is no early exit :().

see overlaychange/filters.rs

WriteAtSpawn

ReadWrite access to state as defined at spawn.
In case of conflict the changes from parent worker are overwritten by child worker changes on join.
So this is deterministic but can result in some incoherent changes.

WriteLightOptimistic

In this case we only forbid to write both in parent and child.
So it can still be incoherent regarding read access.

WriteLightDeclarative

We need to declare child write accesses.

WriteOptimistic

To be coherent, we check parent read against child write, parent write against child write and parent write against child read.

This is maybe the only usefull mode.

WriteDeclarative

We need to declare child read and write accesses.

TODO list

[x] Reuse worker instance (only locally to an execution), and add some capacity limit.
We simply added with multiple consumer and a single sender.
Note that we reuse existing worker ids to manage that queue. For this reason, now native also uses these ids
and cannot be call directly without externalities.
[x] Allow running without any thread creation (what happen when capacity limit is reach).
[x] Different worker capability (the current model got a caveat: nested call cannot spawn worker with higher capability).
The read mode over previous block state is more here to test and should be remove I think.
https://github.com/cheme/substrate/blob/cba0105459699e43813ca8d9b78b13b330e97ff5/primitives/tasks/src/lib.rs#L74
[x] Access storage from worker.
[x] Ability to use with cumulus (see substrate runtime test). The 'spawn' implementation do not create threads but
run the process lazilly by casting back the wasm fn pointer. Also note that we do not catch panic (this explains
that I don't consider silencing panic result from workers).
https://github.com/cheme/substrate/blob/cba0105459699e43813ca8d9b78b13b330e97ff5/test-utils/runtime/src/lib.rs#L1206
[x] Create draft pr.
[] Read again and update docs (some late design change may not be in the docs).
[x] Read only with access restricted to untouched values
(join returning none on concurrent access)
[x] Read only with access restricted to declared values/prefix
(panicking version, parent panic on write of such values, child panic on write of undeclared values)
[x] Write with access/write restricted to untouched values
[x] Write only with access/write restricted to declared values/prefix
[] Rewrite optimistic logger for performance
[] Allow to spawn additional worker type from main ones. Especially 'declared' from their 'optimistic' variant and the other way around (need to avoid runing optimistic at all in some case). see function 'guard_compatible_child_workers' for current support.
[] Try to find a better way to handle unsafe call to externality: the fact that 'spawn' is an externalities extension
that itself uses externalities is currently using an unsafe call. In practice it is only an issue if the code does modify
the extension (unregistering especially). So another idea would be to encapsulate it in an ext implementation
with locked access to register/unregister extension for a specific TypeId.
[] Restore tests from sp_tasks: the native spawn now need to query the the spawn extension from externalities (it
was less code). Core issue here is that the default implementation is at 'client' level and moving it in 'primitives'
requires moving some wasm trait from 'client common' which currently got error with many deps. I initially did it
but code got quickly hard to read (lot of no_std caveats).
[] Configurable capacity (number of worker per execution), currently we do not really limit worker (but mechanism are in place
to be able to).
[] write more test on capacity management, especially trying to deadlock (currently we reduce nested capacity by the depth of
nested calls, (seems sound, but at some point I did consider it unsound and can't remember why)).
[] execution strategy both is not really doing what I would like in 'test-utils': find a way to test native without switching to wasm on error.
[] more tests
[] consider allowing more externalities from worker: root should be doable but this would leak hash trait everywhere, at
this point it does not seems worth it, same issue with 'storage_hash'.
[] Concurrent read change overlay: at this point change overlay is cloned on spawn.
I will probably skip this point because the design I have in mind is similar to the transactional design,
that wasn't merge in a previous pr and this will got the same issue (a bit more since it is slightly more
complex due to the fact that the index will be two dimensions). Still very doable.
[] dyn_clonable should work with no_std
[] find a way to kill running threads with dismiss: this is a bit out of scope of this PR, but it is really not easy.
At this point dismiss is still useful to avoid running queued worker that did not start when a thread did panic.
Actually that may be useless in this case (panic does panic parent worker).
It is just usefull when we call it from runtime then (eg some result in parent thread tells me the worker result is
not needed anymore).
So could also consider dropping some complexity there (the dismiss handler). It is already behind 'abort-future' feature.

here).

associated type). Commit before switching to simply Clone.

here)

issues.

times.

externalities.

to externality (does panic). Also inline will be way more touchy to implement than expected, see comment on macro non send AsyncExt with enum handle.

(except spawn that will call directly the extension).

implement missing externalities function now that things are in place.

now marker need to get remove when joining or killing

Also consider spliting in two different types.

- writes access. - no backend containing overlay and primitive to directly access child worker externalities instead.

cheme · 2021-01-10T11:25:55Z

Looking at block processing, a possibly must have usage would be per extrinsic worker.
I mean from transaction pool, results of execution could contains strategy (access log for optimistic, and declaration for declarative),
and be use to order extrinsic optimistically to allow paralellism on extrinsic.

Allowing to build a block execution that with per extrinsic paralellism if possible.

But the current design is not compatible for it and would need to be extended.

Currently we only log or declare:

read access: not compatible with a parent or children write access.
write access: not compatible with a parent or children read access. Could be rename read_write.

Note that since write include read in this case, write accesses are also mutually exclusive.

In the extrinsic case, without too much analysis, there is two immediate issues:

the extrinsic number: a transient key use to access current extrinsic number.
the log list: a value containing a serialized ordered list of log event.

Both when use with the previous design requires write access and are use (or almost use) in all extrinsic,
making the usecase impossible.

So additional log (declaration) to allow that would be:

write only access: this access is imcompatible with any read access. But does no exclusive write access.
The write are done at 'join' like all writes, but since there is no read it is fine.
This should work for the extrinsic number, except for extrinsic that read these number.
In state-machine: 'set_storage' is inherently write only, but 'get_mut' would be read/write: probably need
a variant of readwrite that forbid read access (like writeonlyentry(&mut _)).
Local key: an alternative for extrinsic number would be to define key value that are scoped to a writer:
write at start and revert at the end. Looks like a better way to handle this case, but previous mode is
a sub mode of the one proposed for log and probably still relevant.
write only for append access: this define a new operation: 'append only', it work as write only access, but
also add the worker ids to its appended value and things get reordered on join to follow the workers ids order.
Simplier definition would be to use join order as append order.
write only for append access over a range of key: same as append in a single value but for an ordered range (maybe simply prefix) of key.
Locks read for the prefix and writes are done at join in join order.

For those cases I firstly thought of bufferizing then reordering writes depending on 'spawn' order but everything is easier when using 'join' order.
Still it puts a constraint when using api that seems less natural, so still an open question.
-> thinking twice 'call' seems more relevant in order to order freely 'join' (in practice 'join' relieve some constraint and should be call asap).

The question remains of keeping both 'write' and 'write only': both are redundant. But if we consider that 'write' usually
involve 'read' having a combination of 'write', 'read' and 'write_only' instead of 'write_only' and 'read' can make
sense (smaller declaration, not really smaller logs if we do not have separate logs). Still really implementation does not have
to follow accesslog or declaration definition so more compact encoding should be use.

TLDR; current semantic of write including read access seems too limiting, and write and read should probably be disjoint.
Also some write only append operation could be buffered and reordered depending on writer sequencing later.
Also a worker local key overwrite could be interesting (lock write access).

gavofyork · 2021-01-16T19:55:50Z

Please note, that if we do force worker to Always join

This is fine.

gavofyork · 2021-01-16T20:10:40Z

for the index and event logs, we would just have those be collected in-thread and passed out to the aggregator (the main thread) which would join and aggregate them manually. they wouldn't try to access storage for any of these globals directly.

cheme · 2021-01-16T22:32:53Z

Please note, that if we do force worker to Always join

This is fine.

this really sounds nice.

cheme · 2021-01-26T16:44:00Z

I am splitting this PR in multiple smaller one, first part is #7978 , closing this one.

cheme added 30 commits November 13, 2020 11:30

with_ext boundary issue

44f151d

variant of issue

825723d

using assert unwind safe (also could have use std mutex).

01406e2

use capacity to manage (not that recursive call is a bit hard to use

5b7a713

here).

Inline lazy execute when no capacity

1c01899

using function async_storage (trie storage being optionally clone to

0596f41

associated type). Commit before switching to simply Clone.

without associated type (already require Sized).

77cb0be

clone (would need derivative or substarte macor to avoid boilerplate

cfbf069

here)

checkpoint (dyn type still free)

2d017c1

actually first associated approach seems correct regarding lifetime

a719ed9

issues.

Getting somewhere, still the 'static bound on hasher is painfull.

9194150

Redesign of message loop, we possibly need a new externalities each

3fd84c0

times.

Way to get some backend, require to actual add access for it in

ac16cd9

externalities.

early kill needs to be call from externality to avoid reentrant access

e436091

to externality (does panic). Also inline will be way more touchy to implement than expected, see comment on macro non send AsyncExt with enum handle.

small fixes

c3ed6aa

Renaming

15a248b

first basic test

42ed546

minor fixes

73e2d42

fix test to register the extension and move native code to use sp_io

e3151b1

(except spawn that will call directly the extension).

small but very relevant fixes

e066faf

test for inline (actually inline ext very missing here)

54f1471

no_std stuff

79b39d1

test working for paralell, not inline (need to call with ext), next

62189d9

implement missing externalities function now that things are in place.

boilerplate read, put all that requires hash currently out of scope.

860acea

Inline call with externalities too.

7d1dbc0

fix extension registering issue.

b561627

now marker need to get remove when joining or killing

comment about fix to do for failing substrate-test-runtime test.

7d69e4a

Using worker result, next put it in task interface.

a958bd3

Also consider spliting in two different types.

Worker result before splitting.

676bfff

switch to simply returning option on join

3ea4770

cheme added 14 commits January 3, 2021 13:06

very costy extract delta version

887f71c

clearer struct

8192220

Drain implementation that makes more sense (less cloning).

133b9cb

few tx tests

4e03706

support missing end interval

564f0ca

issue with radix iter

afd6a1e

update dep to fix tests

5071648

restore contiguous logic and test

1f8ab17

Merge branch 'master' into threads_draft

4a0906b

testing start.

61b7cb0

tests and radix-tree fixes

0110d77

few doc changes

d2ad3ca

Merge branch 'master' into threads

88a879b

Squashed commit of the following:

f697e11

- writes access. - no backend containing overlay and primitive to directly access child worker externalities instead.

cheme added 5 commits January 15, 2021 09:10

Merge branch 'threads' into threads_draft

01b788e

No worker for basic externalities.

a468ae4

read only resolving on join.

659a2a9

Merge branch 'master' into threads_draft

3b1da9a

started renaming and added write only defs.

df10cf7

cheme added 4 commits January 18, 2021 17:46

missing renames

691aea3

Make AtJoin variants the default ones.

7a34d8c

remove pending changes

7208fd2

few renaming and move async backend to state machine

f390fce

cheme mentioned this pull request Jan 25, 2021

Runtime workers local pools. #7978

Closed

cheme closed this Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Workers with state.#7687

Runtime Workers with state.#7687
cheme wants to merge 138 commits intoparitytech:masterfrom
cheme:threads

cheme commented Dec 7, 2020 •

edited

Loading

Uh oh!

cheme commented Jan 10, 2021

Uh oh!

gavofyork commented Jan 16, 2021

Uh oh!

gavofyork commented Jan 16, 2021

Uh oh!

cheme commented Jan 16, 2021 •

edited

Loading

Uh oh!

cheme commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cheme commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current modes of execution

Stateless

ReadLastBlock

ReadAtSpawn

ReadOptimistic

ReadDeclarative

WriteAtSpawn

WriteLightOptimistic

WriteLightDeclarative

WriteOptimistic

WriteDeclarative

TODO list

Uh oh!

cheme commented Jan 10, 2021

Uh oh!

gavofyork commented Jan 16, 2021

Uh oh!

gavofyork commented Jan 16, 2021

Uh oh!

cheme commented Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cheme commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cheme commented Dec 7, 2020 •

edited

Loading

cheme commented Jan 16, 2021 •

edited

Loading