powerlevel10k/docs/listdir.md

# Fast directory listing

In order to find untracked files in a git repository, [gitstatusd](../README.md) needs to list the
contents of every directory. gitstatusd does it 27% faster than a reasonable implementation that a
seasoned C/C++ practitioner might write. This document explains the optimizations that went into it.
As directory listing is a common operation, many other projects can benefit from applying these
optimizations.

## v1

Given a path to a directory, `ListDir()` must produce the list of files in that directory. Moreover,
the list must be sorted lexicographically to enable fast comparison with Git index.

The following C++ implementation gets the job done. For simplicity, it returns an empty list on
error.

```c++
vector<string> ListDir(const char* dirname) {
  vector<string> entries;
  if (DIR* dir = opendir(dirname)) {
    while (struct dirent* ent = (errno = 0, readdir(dir))) {
      if (!Dots(ent->d_name)) entries.push_back(ent->d_name);
    }
    if (errno) entries.clear();
    sort(entries.begin(), entries.end());
    closedir(dir);
  }
  return entries;
}
```

Every directory has entries `"."` and `".."`, which we aren't interested in. We filter them out with
a helper function `Dots()`.

```c++
bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }
```

To check how fast `ListDir()` performs, we can run it many times on a typical directory. One million
runs on a directory with 32 files with 16-character names takes 12.7 seconds.

## v2

Experienced C++ practitioners will scoff at our implementation of `ListDir()`. If it's meant to be
efficient, returning `vector<string>` is an unaffordable convenience. To avoid heap allocations we
can use a simple arena that will allow us to reuse memory between different `ListDir()` calls.

(Changed and added lines are marked with comments.)

```c++
void ListDir(const char* dirname, string& arena, vector<char*>& entries) {  // +
  entries.clear();                                                          // +
  if (DIR* dir = opendir(dirname)) {
    arena.clear();                                                          // +
    while (struct dirent* ent = (errno = 0, readdir(dir))) {
      if (!Dots(ent->d_name)) {
        entries.push_back(reinterpret_cast<char*>(arena.size()));           // +
        arena.append(ent->d_name, strlen(ent->d_name) + 1);                 // +
      }
    }
    if (errno) entries.clear();
    for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)];      // +
    sort(entries.begin(), entries.end(),                                   // +
         [](const char* a, const char* b) { return strcmp(a, b) < 0; });   // +
    closedir(dir);
  }
}
```

To make performance comparison easier, we can normalize them relative to the baseline. v1 will get
performance score of 100. A twice-as-fast alternative will be 200.

| version |     optimization           |     score |
|---------|----------------------------|----------:|
| v1      | baseline                   |     100.0 |
| **v2**  | **avoid heap allocations** | **112.7** |

Avoiding heap allocations makes `ListDir()` 12.7% faster. Not bad. As an added bonus, those casts
will fend off the occasional frontend developer who accidentally wanders into the codebase.

## v3

`opendir()` is an expensive call whose performance is linear in the number of subdirectories in the
path because it needs to perform a lookup for every one of them. We can replace it with `openat()`,
which takes a file descriptor to the parent directory and a name of the subdirectory. Just a single
lookup, less CPU time. This optimization assumes that callers already have a descriptor to the
parent directory, which is indeed the case for gitstatusd, and is often the case in other
applications that traverse filesystem.

```c++
void ListDir(int parent_fd, const char* dirname, string& arena, vector<char*>& entries) {   // +
  entries.clear();
  int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);  // +
  if (dir_fd < 0) return;                                                                   // +
  if (DIR* dir = fdopendir(dir_fd)) {
    arena.clear();
    while (struct dirent* ent = (errno = 0, readdir(dir))) {
      if (!Dots(ent->d_name)) {
        entries.push_back(reinterpret_cast<char*>(arena.size()));
        arena.append(ent->d_name, strlen(ent->d_name) + 1);
      }
    }
    if (errno) entries.clear();
    for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)];
    sort(entries.begin(), entries.end(),
         [](const char* a, const char* b) { return strcmp(a, b) < 0; });
    closedir(dir);
  } else {                                                                                  // +
    close(dir_fd);                                                                          // +
  }                                                                                         // +
}
```

This is worth about 3.5% in speed.

| version | optimization                         |     score |
|---------|--------------------------------------|----------:|
| v1      | baseline                             |     100.0 |
| v2      | avoid heap allocations               |     112.7 |
| **v3**  | **open directories with `openat()`** | **116.2** |

## v4

Copying file names to the arena isn't free but it doesn't seem like we can avoid it. Poking around
we can see that the POSIX API we are using is implemented on Linux on top of `getdents64` system
call. Its documentation isn't very encouraging:

```text
These are not the interfaces you are interested in.  Look at
readdir(3) for the POSIX-conforming C library interface.  This page
documents the bare kernel system call interfaces.

Note: There are no glibc wrappers for these system calls.
```

Hmm... The API looks like something we can take advantage of, so let's try it anyway.

First, we'll need a simple `Arena` class that can allocate 8KB blocks of memory.

```c++
class Arena {
 public:
  enum { kBlockSize = 8 << 10 };

  char* Alloc() {
    if (cur_ == blocks_.size()) blocks_.emplace_back(kBlockSize, 0);
    return blocks_[cur_++].data();
  }

  void Clear() { cur_ = 0; }

 private:
  size_t cur_ = 0;
  vector<string> blocks_;
};
```

Next, we need to define `struct dirent64_t` ourselves because there is no wrapper for the system
call we are about to use.

```c++
struct dirent64_t {
  ino64_t d_ino;
  off64_t d_off;
  unsigned short d_reclen;
  unsigned char d_type;
  char d_name[];
};
```

Finally we can get to the implementation of `ListDir()`.

```c++
void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) {                       // +
  entries.clear();
  int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
  if (dir_fd < 0) return;
  arena.Clear();                                                                          // +
  while (true) {                                                                          // +
    char* buf = arena.Alloc();                                                            // +
    int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize);                      // +
    if (n <= 0) {                                                                         // +
      if (n) entries.clear();                                                             // +
      break;                                                                              // +
    }                                                                                     // +
    for (int pos = 0; pos < n;) {                                                         // +
      auto* ent = reinterpret_cast<dirent64_t*>(buf + pos);                               // +
      if (!Dots(ent->d_name)) entries.push_back(ent->d_name);                             // +
      pos += ent->d_reclen;                                                               // +
    }                                                                                     // +
  }                                                                                       // +
  sort(entries.begin(), entries.end(),
       [](const char* a, const char* b) { return strcmp(a, b) < 0; });
  close(dir_fd);
}
```

How are we doing with this one?

| version | optimization                     |     score |
|---------|----------------------------------|----------:|
| v1      | baseline                         |     100.0 |
| v2      | avoid heap allocations           |     112.7 |
| v3      | open directories with `openat()` |     116.2 |
| **v4**  | **call `getdents64()` directly** | **137.8** |

Solid 20% speedup. Worth the trouble. Unfortunately, we now have just one `reinterpret_cast` instead
of two, and it's not nearly as scary-looking. Hopefully with the next iteration we can get back some
of that evil vibe of low-level code.

As a bonus, every element in `entries` has `d_type` at offset -1. This can be useful to the callers
that need to distinguish between regular files and directories (gitstatusd, in fact, needs this).
Note how `ListDir()` implements this feature at zero cost, as a lucky accident of `dirent64_t`
memory layout.

## v5

The CPU profile of `ListDir()` reveals that almost all userspace CPU time is spent in `strcmp()`.
Digging into the source code of `std::sort()` we can see that it uses Insertion Sort for short
collections. Our 32-element vector falls under the threshold. Insertion Sort makes `O(N^2)`
comparisons, hence a lot of CPU time in `strcmp()`. Switching to `qsort()` or
[Timsort](https://en.wikipedia.org/wiki/Timsort) is of no use as all good sorting algorithms fall
back to Insertion Sort.

If we cannot make fewer comparisons, perhaps we can make each of them faster? `strcmp()` compares
characters one at a time. It cannot read ahead as it can be illegal to touch memory past the first
null byte. But _we_ know that it's safe to read a few extra bytes past the end of `d_name` for every
entry except the last in the buffer. And since we own the buffer, we can overallocate it so that
reading past the end of the last entry is also safe.

Combining these ideas with the fact that file names on Linux are at most 255 bytes long, we can
invoke `getdents64()` like this:

```c++
int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256);
```

And then compare entries like this:

```c++
[](const char* a, const char* b) { return memcmp(a, b, 255) < 0; }
```

This version doesn't give any speedup compared to the previous but it opens an avenue for another
optimization. The pointers we pass to `memcmp()` aren't aligned. To be more specific, their
numerical values are `N * 8 + 3` for some `N`. When given such a pointer, `memcmp()` will check the
first 5 bytes one by one, and only then switch to comparing 8 bytes at a time. If we can handle the
first 5 bytes ourselves, we can pass aligned memory to `memcmp()` and take full advantage of its
vectorized loop.

Here's the implementation:

```c++
uint64_t Read64(const void* p) {                                                          // +
  uint64_t x;                                                                             // +
  memcpy(&x, p, sizeof(x));                                                               // +
  return x;                                                                               // +
}                                                                                         // +

void ByteSwap64(void* p) {                                                                // +
  uint64_t x = __builtin_bswap64(Read64(p));                                              // +
  memcpy(p, &x, sizeof(x));                                                               // +
}                                                                                         // +

void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) {
  entries.clear();
  int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
  if (dir_fd < 0) return;
  arena.Clear();
  while (true) {
    char* buf = arena.Alloc();
    int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256);                // +
    if (n <= 0) {
      if (n) entries.clear();
      break;
    }
    for (int pos = 0; pos < n;) {
      auto* ent = reinterpret_cast<dirent64_t*>(buf + pos);
      if (!Dots(ent->d_name)) {
        ByteSwap64(ent->d_name);                                                          // +
        entries.push_back(ent->d_name);
      }
      pos += ent->d_reclen;
    }
  }
  sort(entries.begin(), entries.end(), [](const char* a, const char* b) {
    uint64_t x = Read64(a);                                                               // +
    uint64_t y = Read64(b);                                                               // +
    return x < y || (x == y && a != b && memcmp(a + 5, b + 5, 256) < 0);                  // +
  });
  for (char* p : entries) ByteSwap64(p);                                                  // +
  close(dir_fd);
}
```

This is for Little Endian architecture. Big Endian doesn't need `ByteSwap64()`, so it'll be a bit
faster.

| version | optimization                     |     score |
|---------|----------------------------------|----------:|
| v1      | baseline                         |     100.0 |
| v2      | avoid heap allocations           |     112.7 |
| v3      | open directories with `openat()` |     116.2 |
| v4      | call `getdents64()` directly     |     137.8 |
| **v5**  | **hand-optimize `strcmp()`**     | **143.3** |

Fast and respectably arcane.

## Conclusion

Through a series of incremental improvements we've sped up directory listing by 43.3% compared to a
naive implementation (v1) and 27.2% compared to a reasonable implementation that a seasoned C/C++
practitioner might write (v2).

However, these numbers are based on an artificial benchmark while the real judge is always the real
code. Our goal was to speed up gitstatusd. Benchmark was just a tool. Thankfully, the different
versions of `ListDir()` have the same comparative performance within gitstatusd as in the benchmark.
In truth, the directory chosen for the benchmark wasn't arbitrary. It was picked by sampling
gitstatusd when it runs on [chromium](https://github.com/chromium/chromium) git repository.

The final version of `ListDir()` spends 97% of its CPU time in the kernel. If we assume that it
makes the minimum possible number of system calls and these calls are optimal (true to the best
of my knowledge), it puts the upper bound on possible future performance improvements at just 3%.
There is almost nothing left in `ListDir()` to optimize.

![ListDir() CPU profile](
  https://raw.githubusercontent.com/romkatv/gitstatus/1ac366952366d89980b3f3484f270b4fa5ae4293/cpu-profile-listdir.png)

(The CPU profile was created with [gperftools](https://github.com/gperftools/gperftools) and
rendered with [pprof](https://github.com/google/pprof)).