Affinity support in Thor
Description
Conclusion
is duplicated by
Activity
Gavin Halliday November 25, 2015 at 2:26 PM
BTW my issue14522 branch contains my code so far. If we decide to moved it to the init scripts then I'll hand over to mark to implement.
I have added extra parameter from the script - it made the code simpler, and meant it could be done much earlier.
Jacob Cobbett-Smith November 25, 2015 at 2:02 PM
I was worried about 3rd party libs that would allocate, spin-off threads etc. on wrong socket, e.g. jvm initialization etc.
Is that a concern?
If we are trying to keep everything cleanly separated, setting affinity via the scripts may be cleanest.
Gavin Halliday November 25, 2015 at 1:38 PM
I have this working, but want to double check the question I asked earlier....
If we set the process affinity after it has started running, any heap memory (or anything else) that has been allocated before that point may be associated with the wrong socket. Is this significant and does this mean that it would be better to implement it in the script, rather than once the process has started?
Gavin Halliday November 24, 2015 at 4:14 PM
This is the suggestion so far - add the following global options:
affinity=<cpu-list>
if specified, each slave in the cluster will have the same affinity. Takes precedence over autoNodeAffinity.
autoNodeAffinity =true/false
default true. If there are more than 1 slaves per machine, and more than one numa nodes/sockets, bind each slaves to a subset of the numa nodes, based on the slave number. (Typically each slave will be bound to a single numa node).
autoNodeAffinityNodes = optional-list-of-nodes
if specified the autoNodeAffinity works on the subset of the nodes in the list. Allows a Thor instance to run on nodes 0,1 and a different Thor to run on 2,3 (on a 4 socket system). Not likely to be used for x86 yet, but might make sense for power8.
numaBindLocal=true/false
default true. Only allow allocations from the node that the process is bound to.
Does this sound a reasonable set of options?
(We could also have an autoCoreAffinity to bind to a subset of the cores on a node, but I don't currently see a requirement for that,)
Mark Kelly November 23, 2015 at 2:29 PM
Measuring cache-to-cache transfer latency (ns)
Local Socket L2->L2 HIT latency 26.7
Local Socket L2->L2 HITM latency 30.4
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
Reader Numa Sock
Writer Numa Sock 0 1
0 - 123.4
1 121.6 -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
Reader Numa Sock
Writer Numa Sock 0 1
0 - 74.1
1 73.0 -
[Bring in lines into L1/L2/L3 and then transfer control to another thread (which is either
running on another core on the same socket or a different socket). This thread will read
the same data and this will force cache-to-cache transfers from the cache that already has
these lines. We can measure both Hit (hitting clean lines) and HitM (hitting lines in
modified state) latencies by manipulating the initial thread to either just read the data into
clean state or modify the data and keep it in M state]
From various pieces of testing, processor affinity can have a big difference on the speed of multi threaded systems.
We should consider having an option that would bind a slave process to a particular socket if there were more slaves than sockets enabled on the system. (Maybe another separate option to restrict to cores within the socket if there are > 1 processes per socket.) If we could also use the correctly associated memory that would provide more advantages.
The disadvantage is that sort (and other cpu intensive operations) wouldn't be able to use all the cpus. In practice that is likely to lead to overcommited cpus anyway.
The most efficient set up may possibly be 2 (number of socket) slave processes per node, and multiple slaves per slave process.