| <HTML> | |
| <HEAD> | |
| <TITLE>Garbage collector scalability</TITLE> | |
| </HEAD> | |
| <BODY> | |
| <H1>Garbage collector scalability</h1> | |
| In its default configuration, the Boehm-Demers-Weiser garbage collector | |
| is not thread-safe. It can be made thread-safe for a number of environments | |
| by building the collector with the appropriate | |
| <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation | |
| flag. This has primarily two effects: | |
| <OL> | |
| <LI> It causes the garbage collector to stop all other threads when | |
| it needs to see a consistent memory state. | |
| <LI> It causes the collector to acquire a lock around essentially all | |
| allocation and garbage collection activity. | |
| </ol> | |
| Since a single lock is used for all allocation-related activity, only one | |
| thread can be allocating or collecting at one point. This inherently | |
| limits performance of multi-threaded applications on multiprocessors. | |
| <P> | |
| On most platforms, the allocator/collector lock is implemented as a | |
| spin lock with exponential back-off. Longer wait times are implemented | |
| by yielding and/or sleeping. If a collection is in progress, the pure | |
| spinning stage is skipped. This has the advantage that uncontested and | |
| thus most uniprocessor lock acquisitions are very cheap. It has the | |
| disadvantage that the application may sleep for small periods of time | |
| even when there is work to be done. And threads may be unnecessarily | |
| woken up for short periods. Nonetheless, this scheme empirically | |
| outperforms native queue-based mutual exclusion implementations in most | |
| cases, sometimes drastically so. | |
| <H2>Options for enhanced scalability</h2> | |
| Version 6.0 of the collector adds two facilities to enhance collector | |
| scalability on multiprocessors. As of 6.0alpha1, these are supported | |
| only under Linux on X86 and IA64 processors, though ports to other | |
| otherwise supported Pthreads platforms should be straightforward. | |
| They are intended to be used together. | |
| <UL> | |
| <LI> | |
| Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to | |
| run the mark phase in parallel in multiple threads, and thus on multiple | |
| processors. The mark phase typically consumes the large majority of the | |
| collection time. Thus this largely parallelizes the garbage collector | |
| itself, though not the allocation process. Currently the marking is | |
| performed by the thread that triggered the collection, together with | |
| <I>N</i>-1 dedicated | |
| threads, where <I>N</i> is the number of processors detected by the collector. | |
| The dedicated threads are created once at initialization time. | |
| <P> | |
| A second effect of this flag is to switch to a more concurrent | |
| implementation of <TT>GC_malloc_many</tt>, so that free lists can be | |
| built, and memory can be cleared, by more than one thread concurrently. | |
| <LI> | |
| Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread | |
| local allocation. It does not, by itself, cause thread local allocation | |
| to be used. It simply allows the use of the interface in | |
| <TT>gc_local_alloc.h</tt>. | |
| <P> | |
| Memory returned from thread-local allocators is completely interchangeable | |
| with that returned by the standard allocators. It may be used by other | |
| threads. The only difference is that, if the thread allocates enough | |
| memory of a certain kind, it will build a thread-local free list for | |
| objects of that kind, and allocate from that. This greatly reduces | |
| locking. The thread-local free lists are refilled using | |
| <TT>GC_malloc_many</tt>. | |
| <P> | |
| An important side effect of this flag is to replace the default | |
| spin-then-sleep lock to be replace by a spin-then-queue based implementation. | |
| This <I>reduces performance</i> for the standard allocation functions, | |
| though it usually improves performance when thread-local allocation is | |
| used heavily, and thus the number of short-duration lock acquisitions | |
| is greatly reduced. | |
| </ul> | |
| <P> | |
| The easiest way to switch an application to thread-local allocation is to | |
| <OL> | |
| <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, | |
| and then include the <TT>gc.h</tt> | |
| header in each client source file. | |
| <LI> Invoke <TT>GC_thr_init()</tt> before any allocation. | |
| <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, | |
| and/or <TT>GC_GCJ_MALLOC</tt>. | |
| </ol> | |
| <H2>The Parallel Marking Algorithm</h2> | |
| We use an algorithm similar to | |
| <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by | |
| Endo, Taura, and Yonezawa</a> at the University of Tokyo. | |
| However, the data structures and implementation are different, | |
| and represent a smaller change to the original collector source, | |
| probably at the expense of extreme scalability. Some of | |
| the refinements they suggest, <I>e.g.</i> splitting large | |
| objects, were also incorporated into out approach. | |
| <P> | |
| The global mark stack is transformed into a global work queue. | |
| Unlike the usual case, it never shrinks during a mark phase. | |
| The mark threads remove objects from the queue by copying them to a | |
| local mark stack and changing the global descriptor to zero, indicating | |
| that there is no more work to be done for this entry. | |
| This removal | |
| is done with no synchronization. Thus it is possible for more than | |
| one worker to remove the same entry, resulting in some work duplication. | |
| <P> | |
| The global work queue grows only if a marker thread decides to | |
| return some of its local mark stack to the global one. This | |
| is done if the global queue appears to be running low, or if | |
| the local stack is in danger of overflowing. It does require | |
| synchronization, but should be relatively rare. | |
| <P> | |
| The sequential marking code is reused to process local mark stacks. | |
| Hence the amount of additional code required for parallel marking | |
| is minimal. | |
| <P> | |
| It should be possible to use generational collection in the presence of the | |
| parallel collector, by calling <TT>GC_enable_incremental()</tt>. | |
| This does not result in fully incremental collection, since parallel mark | |
| phases cannot currently be interrupted, and doing so may be too | |
| expensive. | |
| <P> | |
| Gcj-style mark descriptors do not currently mix with the combination | |
| of local allocation and incremental collection. They should work correctly | |
| with one or the other, but not both. | |
| <P> | |
| The number of marker threads is set on startup to the number of | |
| available processors (or to the value of the <TT>GC_NPROCS</tt> | |
| environment variable). If only a single processor is detected, | |
| parallel marking is disabled. | |
| <P> | |
| Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside | |
| the collector to immediately yield the processor instead of busy waiting | |
| first. In the case of a multiprocessor and a client with multiple | |
| simultaneously runnable threads, this may have disastrous performance | |
| consequences (e.g. a factor of 10 slowdown). | |
| <H2>Performance</h2> | |
| We conducted some simple experiments with a version of | |
| <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to | |
| run multiple concurrent client threads in the same address space. | |
| Each client thread does the same work as the original benchmark, but they share | |
| a heap. | |
| This benchmark involves very little work outside of memory allocation. | |
| This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine | |
| under Linux 2.2.12. | |
| <P> | |
| Running with a thread-unsafe collector, the benchmark ran in 9 | |
| seconds. With the simple thread-safe collector, | |
| built with <TT>-DLINUX_THREADS</tt>, the execution time | |
| increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. | |
| (The times for the <TT>malloc</tt>/i<TT>free</tt> version | |
| with glibc <TT>malloc</tt> | |
| are 10.51 (standard library, pthreads not linked), | |
| 20.90 (one thread, pthreads linked), | |
| and 24.55 seconds respectively. The benchmark favors a | |
| garbage collector, since most objects are small.) | |
| <P> | |
| The following table gives execution times for the collector built | |
| with parallel marking and thread-local allocation support | |
| (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested | |
| the client using either one or two marker threads, and running | |
| one or two client threads. Note that the client uses thread local | |
| allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector | |
| switches to a locking strategy that is better tuned to less frequent | |
| lock acquisition. The standard allocation primitives thus peform | |
| slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be | |
| avoided in time-critical code. | |
| <P> | |
| (The results using <TT>pthread_mutex_lock</tt> | |
| directly for allocation locking would have been worse still, at | |
| least for older versions of linuxthreads. | |
| With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the | |
| lock with pthread_mutex_try_lock(), busy_waiting between attempts. | |
| After a fixed number of attempts, we use pthread_mutex_lock().) | |
| <P> | |
| These measurements do not use incremental collection, nor was prefetching | |
| enabled in the marker. We used the C version of the benchmark. | |
| All measurements are in elapsed seconds on an unloaded machine. | |
| <P> | |
| <TABLE BORDER ALIGN="CENTER"> | |
| <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> | |
| <TH>2 marker threads (secs.)</th></tr> | |
| <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> | |
| <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> | |
| </table> | |
| <PP> | |
| The execution time for the single threaded case is slightly worse than with | |
| simple locking. However, even the single-threaded benchmark runs faster than | |
| even the thread-unsafe version if a second processor is available. | |
| The execution time for two clients with thread local allocation time is | |
| only 1.4 times the sequential execution time for a single thread in a | |
| thread-unsafe environment, even though it involves twice the client work. | |
| That represents close to a | |
| factor of 2 improvement over the 2 client case with the old collector. | |
| The old collector clearly | |
| still suffered from some contention overhead, in spite of the fact that the | |
| locking scheme had been fairly well tuned. | |
| <P> | |
| Full linear speedup (i.e. the same execution time for 1 client on one | |
| processor as 2 clients on 2 processors) | |
| is probably not achievable on this kind of | |
| hardware even with such a small number of processors, | |
| since the memory system is | |
| a major constraint for the garbage collector, | |
| the processors usually share a single memory bus, and thus | |
| the aggregate memory bandwidth does not increase in | |
| proportion to the number of processors. | |
| <P> | |
| These results are likely to be very sensitive to both hardware and OS | |
| issues. Preliminary experiments with an older Pentium Pro machine running | |
| an older kernel were far less encouraging. | |
| </body> | |
| </html> | |