Paper: A fork() in the road
A review of Baumann et.als paper on limitations of the fork system call present in UNIX operating systems.
This is a paper that only affects a small number of people: those doing OS research and those that directly call fork
from their programs. Still, I find it interesting since it points to limitations in the UNIX design. When I learned to program, UNIX was seen as the gold standard. I remember reading an LKML post from Linus Torvalds where he meant that OS development was about the refinement of existing ideas, not so much about inventing new concepts. But people still do OS research, and the paper describes how having to support fork
can be a burden.
The main criticism in the paper is about how fork
affects many other parts of the system. It was originally a very simple idea: copy the address space of the parent and let the child redirect file descriptors, reduce permissions or, alter the namespace of the child. But fork
today impacts all other operating system abstractions with which it was once orthogonal. Below are some practical examples of things mentioned in the paper.
Buffered I/O and Fork
What do you think this program will print?
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
int main(void) {
printf("Hello World! ");
pid_t pid = fork();
return 0;
}
It prints the output twice, Hello World! Hello World!
, since the stdio buffers are flushed both in the parent and in the child. So in this case fork
doesn’t compose, since the calling code needs to explicitly flush I/O before calling fork
.
Threads and Fork
This program starts a thread that locks a mutex for 1ms. It then forks a child which tries to lock the same mutex. What do you think it will print and why?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/wait.h>
static pthread_mutex_t thread_mutex = PTHREAD_MUTEX_INITIALIZER;
static void* worker(void* arg) {
pthread_detach(pthread_self());
for (;;) {
pthread_mutex_lock(&thread_mutex);
usleep(1000);
pthread_mutex_unlock(&thread_mutex);
}
return NULL;
}
int main() {
pthread_t thread;
pthread_create(&thread, NULL, worker, 0);
usleep(1000);
pid_t pid = fork();
if (pid == 0) {
pthread_mutex_lock(&thread_mutex);
const char buf[] = "lock acquired\n";
write(fileno(stderr), buf, strlen(buf));
pthread_mutex_unlock(&thread_mutex);
exit(0);
}
wait3(NULL, WNOHANG, NULL);
return 0;
}
On my system, the subprocess always deadlocks before printing the lock acquired
message. A child created by fork
has only a single thread (a copy of the calling thread). So there’s a good chance that the mutex will be in the locked state when the child starts executing. One common case where this happens is if a thread doing memory allocation and holding a heap lock, while fork
is called.
Large address spaces and Fork
How long do you think it takes to fork if the parent has 10K VMAs? Here’s a test program from Chromes Issue 819228: Consider using posix_spawn() on Linux.
/* Allocates several memory ranges and reports how long it takes to fork().
*
* clang slow_fork.c -o slow_fork && ./slow_fork 100 10
*
* Almost no error checking anywhere.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>
#define PAGE_SIZE 4096
#define SIZE 100 * PAGE_SIZE
void* mmap_anonymous(size_t size) {
return mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
}
void** create_vmas(int count) {
void** result = (void**) calloc(count, sizeof(void*));
if (!result)
return NULL;
for (int i = 0; i < count; i++) {
result[i] = mmap_anonymous(SIZE);
if (!result[i])
return NULL;
}
return result;
}
void remove_vmas(void** arrays, int count) {
for (int i = 0; i < count; i++) {
munmap(arrays[i], SIZE);
}
free(arrays);
}
void memset_all(void** arrays, int count) {
for (int i = 0; i < count; i++) {
memset(arrays[i], 0, SIZE);
}
}
struct timespec now() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now;
}
long interval_ns(struct timespec tick, struct timespec tock) {
return (tock.tv_sec - tick.tv_sec) * 1000000000L
+ (tock.tv_nsec - tick.tv_nsec);
}
int main(int argc, char** argv) {
int vmas, iterations;
void** areas;
struct timespec tick, tock;
if (argc != 3) {
printf("Usage: %s <nb_vmas> <iterations>\n", argv[0]);
return 1;
}
vmas = atoi(argv[1]);
iterations = atoi(argv[2]);
for (int i = 0; i < iterations; i++) {
areas = create_vmas(vmas);
if (!areas)
return 1;
memset_all(areas, vmas);
tick = now();
if (fork()) {
// parent.
tock = now();
long ns = interval_ns(tick, tock);
printf("%.2fus to fork().\n", (double) ns / 1e3);
} else {
_exit(0);
}
remove_vmas(areas, vmas);
}
return 0;
}
On my system, here are the run-times:
$ ./fork 1 1
149.65us to fork().
$ ./fork 10
89.63us to fork().
$ ./fork 100 1
576.18us to fork().
$ ./fork 1000 1
2661.38us to fork().
$ ./fork 10000 1
35420.67us to fork().
So 35ms to fork
with 10K VMAs. If I replace the fork
call with vfork
it takes only 115us! The speed difference is because fork
needs to copy the page table entries to the new address space.
To get around these limitations some systems mark memory-mapped regions as not to be forked. Others like Chrome creates a zygote process that has as little memory allocated as possible to ensure fast process creation times.
Alternatives to Fork
You can use vfork
or posix_spawn
instead. The vfork
function creates a new process but does not copy the parent’s page tables. Instead, it starts the child in the same address space as its parent and starts it before the parent. The only thing that can be done from there is calling èxec. The
posix_spawn API is much more complicated than just calling
fork
, but as we’ve seen the simplicity comes with a great burden on modern systems.