Evolution of a binary search implementation for line-sorted text files

by Péter Szabó 〈pts@fazekas.hu〉 at Sun Dec 1 15:31:06 CET 2013

Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix. This article explains the design and implementation details a software which can do that for you. If you are not interested in the details, you can get the software and use it. Use the C implementation for maximum speed. There is also a Python implementation which is easier to understand, reuse as a library and extend (but it's slower: it does more lseek(2) and read(2) operations on the file, and it also uses a bit more CPU).

Usage instructions are not included here, see the pts-line-bisect project page showing the README for that.

A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features.

In this article we take a more pure approach though, and assume that the lines of the input text file are lexicographically sorted, and we don't do any decoding. We also assume that lines are spearated by '\n' (= 10) bytes. These assumptions work for ASCII, UTF-8 and ISO-8859-1 (Latin-1) text. They also work for numbers and dates if they are zero-padded (or space-padded), right-aligned, and rendered at constant width.

If each line has the same length, then a simple binary search can be used to find the interesting range of lines in O(s · log n) time (where n is the number of lines and s is the length of each line). Actually there are two caveats: two binary search searches are needed (one for the start of the interval and one for the end), and implementing a binary search can be tricky because of corner cases and possible off-by-one errors. We'll address these later below.

The outline of this article:

  1. a quick recap on binary search in a list;
  2. how to change it so it would work on a text files rather than lists;
  3. the I/O challenge: how to reduce the amount of disk I/O (lseek(2) and read(2));
  4. summary and notes.

Understanding bisection and binary search

Here is a Python implementation (also available in the built-in bisect module) for finding the start of the interval in a sorted list:

def bisect_left(a, x, lo=0, hi=None):
  """Return the index where to insert item x in list a, assuming a is sorted.

  The return value i is such that all e in a[:i] have e < x, and all e in
  a[i:] have e >= x.  So if x already appears in the list, a.insert(x) will
  insert just before the leftmost x already there.

  Optional args lo (default 0) and hi (default len(a)) bound the
  slice of a to be searched.
  """
  if lo < 0:
    raise ValueError('lo must be non-negative')
  if hi is None:
    hi = len(a)
  while lo < hi:
    mid = (lo + hi) >> 1
    if x <= a[mid]:  # Change `<=' to `<', and you get bisect_right.
      hi = mid
    else:
      lo = mid + 1
  return lo

To avoid the corner case bugs, pay attention to <= versus <, and the +1 in mid+1. bisect_right finds the end of the interval. Its implementation is almost the same, you just have to change a <= to <. Here is how you can use bisect_left and bisect_right to find an remove a range from a list:

>>> a = [10, 20, 20, 20, 30, 40, 50, 50, 60]
>>> from bisect import bisect_left, bisect_right
>>> start, end = bisect_left(a, 20), bisect_right(a, 50)
>>> start, end
(1, 8)
>>> del a[start : end]
>>> a
[10, 60]

The choice of the meaning of the arguments in and the return value bisect_left and bisect_right is smart, because it avoids most of the off-by-one errors and most of the related confusion. C++ has std::lower_bound and std::upper_bound (respectively) with the same design, but it uses iterators instead of indexes. It is straightforward to port bisect_left and bisect_right to C, using either indexes or pointers.

If you need binary search to find a specific element of a sorted list, you can use bisect_left and check if the element at the index returned is indeed there. You need the check, because bisect_left returns a meaningful value (i.e. an insertion offset) even if the element you are looking for is missing. As an alternative, you can use bisect_right, subtract 1 from the index, and see if it's there. Don't forget to do a range check before indexing.

Binary search in a text file with variable line length

The basic idea is to do the bisection (index halving) on the byte positions, but compare whole lines. This sounds like useless, because if we seek to the middle of the file in the beginning, we might end up in the middle of a line, and that's useless for comparison. There are two natural fixes for that: F1. after the seek, find the beginning of the line; F2. after the seek, find the end of the line. They both look feasible, but we are a bit afraid of introducing corner case bugs.

Let's choose F2, because F1 needs scanning backwards in a file, and the implementation of that sounds ugly. A naïve implementation of F2:

def bisect_left_incorrect(f, x):
  """... Warning: Incorrect implementation with corner case bugs!"""
  x = x.rstrip('\n')
  f.seek(0, 2)  # Seek to EOF.
  lo, hi = 0, f.tell()
  while lo < hi:
    mid = (lo + hi) >> 1
    f.seek(mid)
    f.readline()  # Ignore previous line, find our line.
    if x <= f.readline().rstrip('\n'):
      hi = mid
    else:
      lo = mid + 1
  return lo

Three obvious bugs: B1. that it will never find the line in the beginning of the file; B2. if x should be inserted to the end of the file, the bisection makes the wrong decision, because the 2nd f.readline() returns an empty string on EOF; B3. the return value doesn't contain the offset of the actual line, but it's in the previous line. Surprisingly, there are no other bugs (see the proof later). Luckily, it's easy to fix these bugs. Interestingly, the first 10 Google search results for binary search text file show many pieces of code (Java, Python, C#) using the same design (i.e. calling readline twice), and most of them contain both bugs B1 and B2, and neither the authors nor the commenters bothered fixing them.

Here is how to fix bug B1: If mid is 0, then skip the first readline, otherwise seek to mid - 1 (rather than mid), and do both readlines. It's easy to see that we need both of these changes, e.g. if the file starts with 2 \ns, only with both changes will the program find both empty lines (the first one at offset 0, the second one at offset 1).

Here is how to fix bug B2: If the second readline has returned an empty string (because of EOF), force the condition to be false (meaning that the position we're looking for is before EOF), thus hi = mid would be executed, moving backwards, away from EOF.

Here is how to fix bug B3: Remember the file offset (using f.tell) just before the 2nd readline, and return that.

These are easy fixes, but they make the code a bit less elegant a bit longer. The following solution also contains a few shortcuts to speed up execution in corner cases.

def bisect_left(f, x):
  """Return the smallest offset where to insert line x into sorted file f.

  Bisection (binary search) on newline-separated, sorted file lines.
  If you use sort(1) to sort the file, run it as `LC_ALL=C sort' to make it
  lexicographically sorted, ignoring locale.

  If there is no trailing newline at the end of f, and the returned offset is
  at the end of f, then don't forget to append a '\\n' before appending x.

  Args:
    f: Seekable file object or file-like object to search in. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    x: Line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present.
  Returns:
    Smallest possible byte offset in where where to insert line x.
  """
  x = x.rstrip('\n')
  if not x: return 0  # Shortcut. Don't do it for bisect_right.
  f.seek(0, 2)  # Seek to EOF.
  size = f.tell()
  if size <= 0: return 0  # Shortcut.
  lo, hi, mid = 0, size - 1, 1
  while lo < hi:
    mid = (lo + hi) >> 1
    if mid > 0:
      f.seek(mid - 1)  # Just to figure out where our line starts.
      f.readline()  # Ignore previous line, find our line.
      midf = f.tell()
    else:
      midf = 0
      f.seek(midf)
    line = f.readline()  # We read at f.tell() == midf.
    # EOF (`not line') is always larger than any line we search for.
    if not line or x <= line.rstrip('\n'):  # `<' for bisect_right.
      hi = mid
    else:
      lo = mid + 1
  if mid == lo: return midf  # Shortcut.
  if lo <= 0: return 0
  f.seek(lo - 1)
  f.readline()
  return f.tell()

See the comments in the code above to how it can be changed to bisect_right (only one shortcut has to be removed and the usual change from <= to < in the line comparison).

Please note that we've dropped the arguments lo and hi, because the internal lo and hi variables are not meaningful to the caller, because they don't start at the line boundary. It would possible to add back support using manual forward or backward scanning for a \n in the beginning, but I omitted that here, beause I couldn't find a compelling use case for such an ugly piece of code.

It looks like we've completely solved the problem. But in fact we haven't proven the correctness (i.e. that the functions return exactly those line offsets which their docstring describes), and we haven't yet optimized for I/O speed.

For the proof of correctness we can reduce this algorithm to the bisect_left for lists, and use the fact that that one is correct. For example, let's consider the input file:

ab
foo
world
zip

And then see that at which mid offset which line is read:

midmidfline
0 0ab
1 3foo
2 3foo
3 3foo
4 7world
5 7world
6 7world
7 7world
813zip
   
midmidfline
913zip
1013zip
1113zip
1213zip
1313zip
1417(EOF)
1517(EOF)
1617(EOF)
1717(EOF)

It looks like that each line (except for the first) is repeated as many times as the length of the previous line (including the trailing \n). It will turn out that the amount of duplication doesn't matter as long as there are no lines removed.

The key observation: If we duplicate some strings in a sorted list, it will still remain sorted, thus bisect_left and bisect_right give correct results (with the inflated indexes), no matter how many times the individual lines were duplicated. So what the line offset-based bisect_left above does is that it emulates a longer, inflated list of lines, and does a binary search it. The result is an offset within the inflated list, but, because we've fixed B2, before returning it, bisect_left converts the result to the offset the user is interested in (i.e. converting lo from mid to midf offset). To conclude the proof, we need to check if the correct translation happens at EOF; and it indeed does (because we've fixed B2).

The full, slow implementation in Python

See below the full implementation of bisect_right as well, and also a new convenience function bisect_interval, which calls bisect_left for the interval start and bisect_right for the interval end. The seeking and line reading was abstracted away to make it possible to introduce caches and speedups later:

def _read_and_compare(cache, ofs, f, tester):
  """Read a line from f at ofs, and test it.

  Finds out where the line starts, reads the line, calls tester with
  the line with newlines stripped, and returns the results.

  If ofs is in the middle of a line in f, then the following line will be
  used, otherwise the line starting at ofs will be used. (The term ``middle''
  includes the offset of the trailing '\\n'.)

  If the line used starts at EOF, then tester won't be not
  called, and True is used instead.

  Args:
    cache: Currently ignored.
      TODO: Speed it up by adding caching of previous test results.
    ofs: The offset in f to read from. If ofs is in the middle of a line, then
      the following line will be used.
    f: Seekable file object or file-like object to read from. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    tester: Single-argument callable which will be called for the line, with
      the trailing '\\n' stripped. If the line used is at EOF, then tester
      won't be called and True will be used as the result.
  Returns:
    List or tuple of the form [fofs, g, dummy], where g is the test result
    (or True at EOF), fofs is the start offset of the line used in f,
    and dummy is an implementation detail that can be ignored.
  """
  if ofs > 0:
    f.seek(ofs - 1)  # Just to figure out where our line starts.
    f.readline()  # Ignore previous line, find our line.
    fofs = f.tell()
  else:
    fofs = 0
    f.seek(fofs)
  line = f.readline()  # We read at f.tell() == fofs.
  # EOF (`not line') is always larger than any line we search for.
  return fofs, (not line or tester(line.rstrip('\n'))), ()


def bisect_way(f, x, is_left):
  """Return an offset where to insert line x into sorted file f.

  Bisection (binary search) on newline-separated, sorted file lines.
  If you use sort(1) to sort the file, run it as `LC_ALL=C sort' to make it
  lexicographically sorted, ignoring locale.

  If there is no trailing newline at the end of f, and the returned offset is
  at the end of f, then don't forget to append a '\\n' before appending x.

  Args:
    f: Seekable file object or file-like object to search in. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    x: Line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present.
    is_left: If true, emulate bisect_left. See the return value for more info.
  Returns:
    Byte offset in where where to insert line x. If is_left is true (i.e.
    bisect_left), then the smallest possible offset is returned, otherwise
    (i.e. bisect_right) the largest possible address is returned.
  """
  x = x.rstrip('\n')
  if is_left and not x:  # Shortcut.
    return 0
  f.seek(0, 2)  # Seek to EOF.
  size = f.tell()
  if size <= 0:  # Shortcut.
    return 0
  if is_left:
    tester = x.__le__  # x <= y.
  else:
    tester = x.__lt__  # x < y.
  lo, hi, mid, cache = 0, size - 1, 1, []
  while lo < hi:
    mid = (lo + hi) >> 1
    midf, g, _ = _read_and_compare(cache, mid, f, tester)
    if g:
      hi = mid
    else:
      lo = mid + 1
  if mid != lo:
    midf = _read_and_compare(cache, lo, f, tester)[0]
  return midf


def bisect_right(f, x):
  """Return the largest offset where to insert line x into sorted file f.

  Similar to bisect.bisect_right, but operates on lines rather then elements.
  Convenience function which just calls bisect_way(..., is_left=False).
  """
  return bisect_way(f, x, False)


def bisect_left(f, x):
  """Return the smallest offset where to insert line x into sorted file f.

  Similar to bisect.bisect_left, but operates on lines rather then elements.
  Convenience function which just calls bisect_way(..., is_left=True).
  """
  return bisect_way(f, x, True)


def bisect_interval(f, x, y=None, is_open=False):
  """Return (start, end) offset pair for lines between x and y.

  Args:
    f: Seekable file object or file-like object to search in. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    x: First line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present.
    y: First line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present. If None, x is used.
    is_open: If true, then the returned interval consists of lines
      x <= line < y. Otherwise it consists of lines x <= line <= y.
  Returns:
    Return (start, end) offset pair containing lines between x and y (see
    arg is_open whether x and y are inclusive) in sorted file f, before offset
    `size'. These offsets contain the lines: start <= ofs < end. Trailing
    '\\n's are included in the interval (except at EOF if there was none).
  """
  x = x.rstrip('\n')
  if y is None:
    y = x
  else:
    y = y.strip('\n')
  end = bisect_way(f, y, is_open)
  if is_open and x == y:
    return end, end
  else:
    # TODO: Speed up ignoring everything after offset `end'.
    return bisect_way(f, x, True), end

All this is a long piece of code (partially because of the docstrings), but it doesn't contain any new ideas or any surprises.

The 2nd call to bisect_way in bisect_interval above seems to seems to be doing unnecessary work to find the interval start offset, because we already know that it won't be larger than end. If it was possible to restrict the search to the prefix of the file, limiting it at a specific offset, bisect_interval would be faster. Let's introduce a size argument for that to each function. Only the _read_and_compare changes significantly, the others just pass the size around:

def _read_and_compare(cache, ofs, f, size, tester):
  """Read a line from f at ofs, and test it.

  Finds out where the line starts, reads the line, calls tester with
  the line with newlines stripped, and returns the results.

  If ofs is in the middle of a line in f, then the following line will be
  used, otherwise the line starting at ofs will be used. (The term ``middle''
  includes the offset of the trailing '\\n'.)

  Bytes of f after offset `size' will be ignored. If a line spans over
  offset `size', it gets read fully (by f.readline()), and then truncated.

  If the line used starts at EOF (or at `size'), then tester won't be not
  called, and True is used instead.

  Args:
    cache: Currently ignored.
      TODO: Speed it up by adding caching of previous test results.
    ofs: The offset in f to read from. If ofs is in the middle of a line, then
      the following line will be used.
    f: Seekable file object or file-like object to read from. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    size: Size limit for reading. Bytes in f after offset `size' will be
      ignored.
    tester: Single-argument callable which will be called for the line, with
      the trailing '\\n' stripped. If the line used is at EOF, then tester
      won't be called and True will be used as the result.
  Returns:
    List or tuple of the form [fofs, g, dummy], where g is the test result
    (or True at EOF), fofs is the start offset of the line used in f,
    and dummy is an implementation detail that can be ignored.
  """
  assert 0 <= ofs <= size
  if ofs:
    f.seek(ofs - 1)  # Just to figure out where our line starts.
    f.readline()  # Ignore previous line, find our line.
    fofs = min(size, f.tell())
  else:
    fofs = 0
    f.seek(0)
  assert 0 <= ofs <= fofs <= size
  g = True  # EOF is always larger than any line we search for.
  if fofs < size:
    line = f.readline()  # We read at f.tell() == fofs.
    if f.tell() > size:
      line = line[:size - fofs]
    if line:
      g = tester(line.rstrip('\n'))
  return fofs, g, ()


def bisect_way(f, x, is_left, size=None):
  """Return an offset where to insert line x into sorted file f.

  Bisection (binary search) on newline-separated, sorted file lines.
  If you use sort(1) to sort the file, run it as `LC_ALL=C sort' to make it
  lexicographically sorted, ignoring locale.

  If there is no trailing newline at the end of f, and the returned offset is
  at the end of f, then don't forget to append a '\\n' before appending x.

  Args:
    f: Seekable file object or file-like object to search in. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    x: Line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present.
    is_left: If true, emulate bisect_left. See the return value for more info.
    size: Size limit for reading. Bytes in f after offset `size' will be
      ignored. If None, then no limit.
  Returns:
    Byte offset in where where to insert line x. If is_left is true (i.e.
    bisect_left), then the smallest possible offset is returned, otherwise
    (i.e. bisect_right) the largest possible address is returned.
  """
  x = x.rstrip('\n')
  if is_left and not x:  # Shortcut.
    return 0
  if size is None:
    f.seek(0, 2)  # Seek to EOF.
    size = f.tell()
  if size <= 0:  # Shortcut.
    return 0
  if is_left:
    tester = x.__le__  # x <= y.
  else:
    tester = x.__lt__  # x < y.
  lo, hi, mid, cache = 0, size - 1, 1, []
  while lo < hi:
    mid = (lo + hi) >> 1
    midf, g, _ = _read_and_compare(cache, mid, f, size, tester)
    if g:
      hi = mid
    else:
      lo = mid + 1
  if mid != lo:
    midf = _read_and_compare(cache, lo, f, size, tester)[0]
  return midf


def bisect_right(f, x, size=None):
  """Return the largest offset where to insert line x into sorted file f.

  Similar to bisect.bisect_right, but operates on lines rather then elements.
  Convenience function which just calls bisect_way(..., is_left=False).
  """
  return bisect_way(f, x, False, size)


def bisect_left(f, x, size=None):
  """Return the smallest offset where to insert line x into sorted file f.

  Similar to bisect.bisect_left, but operates on lines rather then elements.
  Convenience function which just calls bisect_way(..., is_left=True).
  """
  return bisect_way(f, x, True, size)


def bisect_interval(f, x, y=None, is_open=False, size=None):
  """Return (start, end) offset pair for lines between x and y.

  Args:
    f: Seekable file object or file-like object to search in. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    x: First line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present.
    y: First line to search for. Must not contain '\\n', except for maybe a
      trailing one, which will be ignored if present. If None, x is used.
    is_open: If true, then the returned interval consists of lines
      x <= line < y. Otherwise it consists of lines x <= line <= y.
    size: Size limit for reading. Bytes in f after offset `size' will be
      ignored. If None, then no limit.
  Returns:
    Return (start, end) offset pair containing lines between x and y (see
    arg is_open whether x and y are inclusive) in sorted file f, before offset
    `size'. These offsets contain the lines: start <= ofs < end. Trailing
    '\\n's are included in the interval (except at EOF if there was none).
  """
  x = x.rstrip('\n')
  if y is None:
    y = x
  else:
    y = y.strip('\n')
  end = bisect_way(f, y, is_open, size)
  if is_open and x == y:
    return end, end
  else:
    return bisect_way(f, x, True, end), end

Please note that the 2nd call to bisect_way in bisect_interval above passes end instead of size; this is how the optimization was implemented. Much work in the dependencies, a single call argument change in the final location.

Reducing the number of I/O operations

In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)

An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.

The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.

In searches the disk seek time usually subsumes disk read times and other data processing times (such as comparing the strings in memory or copying data from one memory buffer to another). So if we want an in-the-box solution to improve the speed of our text file binary search, we should do as few seeks as possible. As it is now, the code does an f.seek(...) call for each bisection, and the design of the binary search doesn't let us do fewer, so it seems that there is no room for improvement. But it is: because we can change it so that not every f.seek(...) call would require a seek in the kernel (i.e. the lseek(2)) system call).

Before making such a change, let's remember that most I/O libraries provide input buffering by default. (These include the Python file object, the Ruby file object, the files opened in Perl, the C <stdio.h> functions and the C++ std::istream class hierarchy. In Java one has to create a BufferedInputStream or a BufferedReader manually to get read buffering.) That is, each read operation reads at least 8KB of data (or configurable) from the kernel, and subsequent read operations read from the read buffer (i.e. the data already read but not consumed) instead of asking the kernel again. The purpose of this is to get rid of the system call latency, and not to reduce the number of disk seeks. Because of the need for context switching and between-kernel-and-userspace data copies, a system call latency (delay) occurs whenever the program calls the kernel. Even without this buffering in the program, the number of disk seeks would be equally small, because the kernel also does buffering (of at least 4KB), so if there is an lseek(2) system call followed by a read(2) system call to read data from elsewhere, and the new location is close enough to the old location, then the kernel can find and return the data from its buffers, bypassing the hard disk altogether. (On non-Unix systems the names of the system calls are different, but exactly the same happens, as viewed from our high-level perspective.)

We don't have to make any code changes to take advantage of this kernel-level buffering, because the kernel does it for us by default. For example, the last 12 f.seek(...) calls in bisect_left operate on the same 4096-byte region (plus the length of the last line), which spans over one or two 4KB blocks on disk, and since these blocks have been read recently (right in the same bisect_left invocation), the kernel returns it from its buffer. Thus the last 11 f.seek(...) calls and their corresponding f.readline() calls don't need a disk seek or a disk read (unless the last line of the block is very long), and thus they will be very fast. Again, we didn't have to make any effort in our code to get this speed benefit.

There is another benefit though for which we have to improve our code. This is getting rid of the system call latency. If the data is in the kernel buffers already (or it is being read read from a fast SSD with at least 50000 read IOPS (IO operations per second)), then the bottleneck is system call overhead. (In interpreted programming languages such as Python there can also be a CPU bottleneck because of the slow interpreter. We should also optimize for code length because of that.)

So our goal now is to reduce the number of system calls. In Python, f.tell() doesn't generate a system call (except for the very first time it's called on f), so we have to reduce the number of times f.readline() and f.seek(...) is called. There is a simple first step on that road: before each f.seek(...), call f.tell() to figure out where we are, and omit the seek call if we happen to be at the right position already.

The implementation of this optimization is straightforward, it just adds two ifs to _read_and_compare:

def _read_and_compare(cache, ofs, f, size, tester):
  """..."""
  assert 0 <= ofs <= size
  if ofs:
    if f.tell() != ofs - 1:  # Avoid lseek(2) call if not needed.
      f.seek(ofs - 1)  # Just to figure out where our line starts.
    f.readline()  # Ignore previous line, find our line.
    # Calling f.tell() is cheap, because Python remembers the lseek(2) retval.
    fofs = min(size, f.tell())
  else:
    fofs = 0
    if size and f.tell():  # Avoid lseek(2) call if not needed.
      f.seek(0)
  assert 0 <= ofs <= fofs <= size
  g = True  # EOF is always larger than any line we search for.
  if fofs < size:
    line = f.readline()  # We read at f.tell() == fofs.
    if f.tell() > size:
      line = line[:size - fofs]
    if line:
      g = tester(line.rstrip('\n'))
  return fofs, g, ()

A generic idea to reduce the number of system calls is to add a cache to the program, and fetch results from the cache instead of the kernel whenever possible. Binary search usually doesn't benefit from caching, because it never needs the same piece of data again, but our implementation does: as soon as the remaining input is very short (just a few lines), it's reading the same lines over and over again, with the same midf, but with a different mid. To further illustrate this, let's imagine an input file which contains a 64-byte line, and the caller is looking for a line larger than that. Since it's a binary search and 26 = 64, bisect_left will need 6 (or one more) attempts to find it. Each time it will use a different mid, but the same midf = 64. Remember that midf is the offset we get after reading and ignoring a partial line after the offset mid.

To get rid of these last few f.readline() calls, we could add a cache that maps midf values to their corresponding line strings. In the example, the entry (midf = 64, line = EOF) would be added to the cache, and then it would be used for the rest of the calls, thus eliminating 6 f.readline() calls. (If the caller was looking for the empty string, this cache wouldn't help him.) In fact, we don't have to record the whole line in the cache, it's enough to remember the compare result (g), i.e. if it was smaller than x, the line the caller is looker for. That's because the binary search does only such comparisons with the file lines, it doesn't use them for anything else.

So this cache can get rid of some of the 2nd f.readline() calls in _read_and_compare. But what about the 1st calls, i.e. those which figure out midf from mid? In the _read_and_compare implementation, fofs corresponds to midf and ofs corresponds to mid, so let's use this naming. Luckily, we can eliminate some the 1st calls by caching ofs in addition to fofs. Let's suppose we have an entry (ofs = 42, fofs = 55, g = False) in the cache, and _read_and_compare(ofs=48) is called. We can use the cached result in this cache entry, because the cache entry describes a a line whose end is at offset 55, and offset 42 is somewhere in the middle in the line. Since 42 ≤ 48 ≤ 55, this implies that offset 48 is also somewhere in the middle of the same line, so we can use the cached result, g = False. The value of g is not essential here, it would work the same way with g = True.

Here is an implementation for such a cache (don't worry if you don't get it for the first reading, it will be explained afterwards):

def _read_and_compare(cache, ofs, f, size, tester):
  """Read a line from f at ofs, and test it.

  Finds out where the line starts, reads the line, calls tester with
  the line with newlines stripped, and returns the results.

  If ofs is in the middle of a line in f, then the following line will be
  used, otherwise the line starting at ofs will be used. (The term ``middle''
  includes the offset of the trailing '\\n'.)

  Bytes of f after offset `size' will be ignored. If a line spans over
  offset `size', it gets read fully (by f.readline()), and then truncated.

  If the line used starts at EOF (or at `size'), then tester won't be not
  called, and True is used instead.

  A cache of previous offsets and test results is read and updated. The size of
  the cache is bounded (it contains at most 4 offsets and 2 test results).

  Args:
    cache: A cache containing previous offsets and test results. It's a list
      of lists. An empty cache is the empty list, so initialize it to []
      before the first call. len(cache) is 0, 1, or 2. Each entry in cache is
      of the form [fofs, g, ofs].
    ofs: The offset in f to read from. If ofs is in the middle of a line, then
      the following line will be used.
    f: Seekable file object or file-like object to read from. The methods
      f.tell(), f.seek(ofs_arg) and f.readline() will be used.
    size: Size limit for reading. Bytes in f after offset `size' will be
      ignored.
    tester: Single-argument callable which will be called for the line, with
      the trailing '\\n' stripped. If the line used is at EOF, then tester
      won't be called and True will be used as the result.
  Returns:
    List or tuple of the form [fofs, g, dummy], where g is the test result
    (or True at EOF), fofs is the start offset of the line used in f,
    and dummy is an implementation detail that can be ignored.
  """
  assert len(cache) <= 2
  assert 0 <= ofs <= size
  if cache and cache[0][2] <= ofs <= cache[0][0]:
    cache.reverse()  # Move cache[0] to the end since we've just fetched it.
  elif len(cache) > 1 and cache[-1][2] <= ofs <= cache[-1][0]:
    pass  # We've found cache[-1] (same index as cache[1]).
  else:
    if ofs:
      if f.tell() != ofs - 1:  # Avoid lseek(2) call if not needed.
        f.seek(ofs - 1)  # Just to figure out where our line starts.
      f.readline()  # Ignore previous line, find our line.
      # Calling f.tell() is cheap, because Python remembers the lseek(2) retval.
      fofs = min(size, f.tell())
    else:
      fofs = 0
    assert 0 <= ofs <= fofs <= size
    if cache and cache[0][0] == fofs:
      cache.reverse()  # Move cache[0] to the end since we've just fetched it.
      if cache[-1][2] > ofs:
        cache[-1][2] = ofs
    elif len(cache) > 1 and cache[-1][0] == fofs:
      if cache[-1][2] > ofs:
        cache[-1][2] = ofs
    else:
      g = True  # EOF is always larger than any line we search for.
      if fofs < size:
        if not fofs and f.tell():
          f.seek(0)
        line = f.readline()  # We read at f.tell() == fofs.
        if f.tell() > size:
          line = line[:size - fofs]
        if line:
          g = tester(line.rstrip('\n'))
      if len(cache) > 1:  # Don't keep more than 2 items in the cache.
        del cache[0]
      cache.append([fofs, g, ofs])
  return cache[-1]  # Return the most recent item of the cache.

Fortunately, we could make this improvement without changing any other functions. This was not a coincidence, an oracle was guiding our code and interface design. (Another popular explanation is that the real final code was written earlier than the code evolution steps in this article.)

Even without understanding the full logic of the implementation above, we can easily observe that the size of the cache never increases above 2 entries. (Just before it is about to increase, we remove an entry.)

Another observation (albeit a less obvious one), that this is an LRU cache, i.e. it discards the least recently used entry on overflow. Thus it keeps the 2 most recently used entries. The cache.reverse() calls keep the cache in usage order: whenever we use entry 0, we call cache.reverse() to swap it with entry 1. (If there is no other entry besides entry 0, cache.reverse() has no effect. How lucky.) del cache[0] is used for removal, which removes the LRU entry, and cache.append(...) is used for adding, which appends to the end, making the newly added entry the most recently used.

Having understood these properties, the other aspects of the implementation is easy to follow. First we try to fetch from the cache by ofs. If that fails, we call f.readline() to compute the fofs. Then if we match the cache entry by fops, and we try to decrease the ofs of an existing cache entry. If there was no match by fops, we read the line, compare it, and add the result as a new cache entry.

One question remains: wouldn't we benefit from more than 2 cache entries? If we limited the cache size to 3 or 4 instead, code execution speed wouldn't degrade much, and we could potentially get rid of more f.readline() and f.seek() calls. The obvious disadvantage is a small increase in complexity. Most of the time we don't need more than 2 entries, because the binary search operates on lines far from each other. Except, of course, near the end, when it has already hit the correct line, but not at the beginning. In that case 2 cache entries (one for the good line and one for the next or previous line, from which we're trying to distinguish it) are enough. Even with very short (less than 8 bytes) lines, 13% of the cases the cache eliminated all 3 calls, and in an additional 20% of the cases only the 2nd f.readline() call was eliminated. Measuring and explaining the additional, potentially small, speed benefits of cache sizes 3 and 4 is left as an exercise to you.

Implementation and perfomance notes for the C code

In addition to the Python code, this text file binary search algorithm has been implemented in C as well. The C implementation mostly a direct transliteration of the Python one, for additional reductions in CPU and memory usage, and increased portability for memory-constrained systems (such as Linux-based routers) where a Python interpreter is too heavy-weight. It is also a preferred benchmark subject: if anyone claims to have written a faster text file binary search implementation, its speed should be compared to this C implementation's.

Differences in the C and Python implementations:

Of course, both implementations support long files (i.e. files longer than 2GB, whose size gets larger than 31 bits) if the underlying operating system and filesystem supports it. Python file objects support these long files by default, and for C _FILE_OFFSET_BITS is #defined to 64, and this is respected by Linux glibc (and possibly other systems as well).

Future work

You can name it your work or somebody else's work, at your discretion.

Leave your comment about this article on my blog. Constructive criticism is also welcome. Constructively going into raptures over it is even more welcome.

Have fun binary searching, but don't forget to sort your files first!