Let's suppose you have a sorted text file, in which each line is lexicographically larger than the previous one, and you want to find a specific range of lines, or all lines with a specific prefix. This article explains the design and implementation details a software which can do that for you. If you are not interested in the details, you can get the software and use it. Use the C implementation for maximum speed. There is also a Python implementation which is easier to understand, reuse as a library and extend (but it's slower: it does more lseek(2) and read(2) operations on the file, and it also uses a bit more CPU).
Usage instructions are not included here, see the pts-line-bisect project page showing the README for that.
A typical real-word use case for such a binary search tool is retrieving lines corresponding to a particular time range in log files (or time based measurement records). These files text files with variable-length lines, with the log timestamp in the beginning of the line, and they are generated in increasing timestamp order. Unfortunately the lines are not lexicographically sorted, so the timestamp has to be decoded first for the comparison. The bsearch tool does that, it also supports parsing arbitrary, user-specifiable datetime formats, and it can binary search in gzip(1)ped files as well (by building an index). It's also of high performance and low overhead, partially because it is written in C++. So bsearch is practical tool with lots of useful features.
In this article we take a more pure approach
though, and assume that the lines of the input text file are lexicographically
sorted, and we don't do any decoding. We also assume that lines are spearated
by '\n'
(= 10) bytes. These assumptions work for ASCII, UTF-8 and
ISO-8859-1 (Latin-1) text. They also work for numbers and dates if they are
zero-padded (or space-padded), right-aligned, and rendered at constant width.
If each line has the same length, then a simple binary search can be used to find the interesting range of lines in O(s · log n) time (where n is the number of lines and s is the length of each line). Actually there are two caveats: two binary search searches are needed (one for the start of the interval and one for the end), and implementing a binary search can be tricky because of corner cases and possible off-by-one errors. We'll address these later below.
The outline of this article:
Here is a Python implementation (also available in the built-in bisect module) for finding the start of the interval in a sorted list:
def bisect_left(a, x, lo=0, hi=None): """Return the index where to insert item x in list a, assuming a is sorted. The return value i is such that all e in a[:i] have e < x, and all e in a[i:] have e >= x. So if x already appears in the list, a.insert(x) will insert just before the leftmost x already there. Optional args lo (default 0) and hi (default len(a)) bound the slice of a to be searched. """ if lo < 0: raise ValueError('lo must be non-negative') if hi is None: hi = len(a) while lo < hi: mid = (lo + hi) >> 1 if x <= a[mid]: # Change `<=' to `<', and you get bisect_right. hi = mid else: lo = mid + 1 return lo
To avoid the corner case bugs, pay attention to <=
versus
<
, and the +1
in mid+1
.
bisect_right finds the end of the interval. Its implementation is
almost the same, you just have to change a <=
to
<
. Here is how you can use bisect_left and
bisect_right to find an remove a range from a list:
>>> a = [10, 20, 20, 20, 30, 40, 50, 50, 60] >>> from bisect import bisect_left, bisect_right >>> start, end = bisect_left(a, 20), bisect_right(a, 50) >>> start, end (1, 8) >>> del a[start : end] >>> a [10, 60]
The choice of the meaning of the arguments in and the return value bisect_left and bisect_right is smart, because it avoids most of the off-by-one errors and most of the related confusion. C++ has std::lower_bound and std::upper_bound (respectively) with the same design, but it uses iterators instead of indexes. It is straightforward to port bisect_left and bisect_right to C, using either indexes or pointers.
If you need binary search to find a specific element of a sorted list, you can use bisect_left and check if the element at the index returned is indeed there. You need the check, because bisect_left returns a meaningful value (i.e. an insertion offset) even if the element you are looking for is missing. As an alternative, you can use bisect_right, subtract 1 from the index, and see if it's there. Don't forget to do a range check before indexing.
The basic idea is to do the bisection (index halving) on the byte positions, but compare whole lines. This sounds like useless, because if we seek to the middle of the file in the beginning, we might end up in the middle of a line, and that's useless for comparison. There are two natural fixes for that: F1. after the seek, find the beginning of the line; F2. after the seek, find the end of the line. They both look feasible, but we are a bit afraid of introducing corner case bugs.
Let's choose F2, because F1 needs scanning backwards in a file, and the implementation of that sounds ugly. A naïve implementation of F2:
def bisect_left_incorrect(f, x): """... Warning: Incorrect implementation with corner case bugs!""" x = x.rstrip('\n') f.seek(0, 2) # Seek to EOF. lo, hi = 0, f.tell() while lo < hi: mid = (lo + hi) >> 1 f.seek(mid) f.readline() # Ignore previous line, find our line. if x <= f.readline().rstrip('\n'): hi = mid else: lo = mid + 1 return lo
Three obvious bugs: B1. that it will never find the line in the beginning
of the file; B2. if x should be inserted to the end of the file, the
bisection makes the wrong decision, because the 2nd
f.readline()
returns an empty string on EOF; B3. the return value
doesn't contain the offset of the actual line, but it's in the previous line.
Surprisingly, there
are no other bugs (see the proof later). Luckily, it's easy to fix these bugs.
Interestingly, the first 10 Google search results for
binary
search text file show many pieces of code (Java, Python, C#) using the same
design (i.e. calling readline twice), and most of them contain both
bugs B1 and B2,
and neither the authors nor the commenters bothered fixing them.
Here is how to fix bug B1: If mid is 0, then skip the first
readline, otherwise seek to mid - 1 (rather than mid),
and do both readlines. It's easy to see that we need both of these
changes, e.g. if the file starts with 2 \n
s, only with both
changes will the program find both empty lines (the first one at offset 0,
the second one at offset 1).
Here is how to fix bug B2: If the second readline has returned an
empty string (because of EOF), force the condition to be false (meaning that
the position we're looking for is before EOF), thus
hi = mid
would be executed, moving backwards, away from EOF.
Here is how to fix bug B3: Remember the file offset (using
f.tell
) just before the 2nd readline, and return that.
These are easy fixes, but they make the code a bit less elegant a bit longer. The following solution also contains a few shortcuts to speed up execution in corner cases.
def bisect_left(f, x): """Return the smallest offset where to insert line x into sorted file f. Bisection (binary search) on newline-separated, sorted file lines. If you use sort(1) to sort the file, run it as `LC_ALL=C sort' to make it lexicographically sorted, ignoring locale. If there is no trailing newline at the end of f, and the returned offset is at the end of f, then don't forget to append a '\\n' before appending x. Args: f: Seekable file object or file-like object to search in. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. x: Line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. Returns: Smallest possible byte offset in where where to insert line x. """ x = x.rstrip('\n') if not x: return 0 # Shortcut. Don't do it for bisect_right. f.seek(0, 2) # Seek to EOF. size = f.tell() if size <= 0: return 0 # Shortcut. lo, hi, mid = 0, size - 1, 1 while lo < hi: mid = (lo + hi) >> 1 if mid > 0: f.seek(mid - 1) # Just to figure out where our line starts. f.readline() # Ignore previous line, find our line. midf = f.tell() else: midf = 0 f.seek(midf) line = f.readline() # We read at f.tell() == midf. # EOF (`not line') is always larger than any line we search for. if not line or x <= line.rstrip('\n'): # `<' for bisect_right. hi = mid else: lo = mid + 1 if mid == lo: return midf # Shortcut. if lo <= 0: return 0 f.seek(lo - 1) f.readline() return f.tell()
See the comments in the code above to how it can be changed to
bisect_right (only one shortcut has to be removed and the usual
change from <=
to <
in the line comparison).
Please note that we've dropped the arguments lo and hi,
because the internal lo and hi variables are not meaningful to
the caller, because they don't start at the line boundary. It would possible
to add back support using manual forward or backward scanning for a
\n
in the beginning, but I omitted that here, beause I couldn't
find a compelling use case for such an ugly piece of code.
It looks like we've completely solved the problem. But in fact we haven't proven the correctness (i.e. that the functions return exactly those line offsets which their docstring describes), and we haven't yet optimized for I/O speed.
For the proof of correctness we can reduce this algorithm to the bisect_left for lists, and use the fact that that one is correct. For example, let's consider the input file:
ab foo world zip
And then see that at which mid offset which line is read:
mid | midf | line |
0 | 0 | ab |
1 | 3 | foo |
2 | 3 | foo |
3 | 3 | foo |
4 | 7 | world |
5 | 7 | world |
6 | 7 | world |
7 | 7 | world |
8 | 13 | zip |
mid | midf | line |
9 | 13 | zip |
10 | 13 | zip |
11 | 13 | zip |
12 | 13 | zip |
13 | 13 | zip |
14 | 17 | (EOF) |
15 | 17 | (EOF) |
16 | 17 | (EOF) |
17 | 17 | (EOF) |
It looks like that each line (except for the first) is repeated as many
times as the length of the previous line (including the trailing
\n
). It will turn out that the amount of duplication doesn't
matter as long as there are no lines removed.
The key observation: If we duplicate some strings in a sorted list, it will still remain sorted, thus bisect_left and bisect_right give correct results (with the inflated indexes), no matter how many times the individual lines were duplicated. So what the line offset-based bisect_left above does is that it emulates a longer, inflated list of lines, and does a binary search it. The result is an offset within the inflated list, but, because we've fixed B2, before returning it, bisect_left converts the result to the offset the user is interested in (i.e. converting lo from mid to midf offset). To conclude the proof, we need to check if the correct translation happens at EOF; and it indeed does (because we've fixed B2).
See below the full implementation of bisect_right as well, and also a new convenience function bisect_interval, which calls bisect_left for the interval start and bisect_right for the interval end. The seeking and line reading was abstracted away to make it possible to introduce caches and speedups later:
def _read_and_compare(cache, ofs, f, tester): """Read a line from f at ofs, and test it. Finds out where the line starts, reads the line, calls tester with the line with newlines stripped, and returns the results. If ofs is in the middle of a line in f, then the following line will be used, otherwise the line starting at ofs will be used. (The term ``middle'' includes the offset of the trailing '\\n'.) If the line used starts at EOF, then tester won't be not called, and True is used instead. Args: cache: Currently ignored. TODO: Speed it up by adding caching of previous test results. ofs: The offset in f to read from. If ofs is in the middle of a line, then the following line will be used. f: Seekable file object or file-like object to read from. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. tester: Single-argument callable which will be called for the line, with the trailing '\\n' stripped. If the line used is at EOF, then tester won't be called and True will be used as the result. Returns: List or tuple of the form [fofs, g, dummy], where g is the test result (or True at EOF), fofs is the start offset of the line used in f, and dummy is an implementation detail that can be ignored. """ if ofs > 0: f.seek(ofs - 1) # Just to figure out where our line starts. f.readline() # Ignore previous line, find our line. fofs = f.tell() else: fofs = 0 f.seek(fofs) line = f.readline() # We read at f.tell() == fofs. # EOF (`not line') is always larger than any line we search for. return fofs, (not line or tester(line.rstrip('\n'))), () def bisect_way(f, x, is_left): """Return an offset where to insert line x into sorted file f. Bisection (binary search) on newline-separated, sorted file lines. If you use sort(1) to sort the file, run it as `LC_ALL=C sort' to make it lexicographically sorted, ignoring locale. If there is no trailing newline at the end of f, and the returned offset is at the end of f, then don't forget to append a '\\n' before appending x. Args: f: Seekable file object or file-like object to search in. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. x: Line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. is_left: If true, emulate bisect_left. See the return value for more info. Returns: Byte offset in where where to insert line x. If is_left is true (i.e. bisect_left), then the smallest possible offset is returned, otherwise (i.e. bisect_right) the largest possible address is returned. """ x = x.rstrip('\n') if is_left and not x: # Shortcut. return 0 f.seek(0, 2) # Seek to EOF. size = f.tell() if size <= 0: # Shortcut. return 0 if is_left: tester = x.__le__ # x <= y. else: tester = x.__lt__ # x < y. lo, hi, mid, cache = 0, size - 1, 1, [] while lo < hi: mid = (lo + hi) >> 1 midf, g, _ = _read_and_compare(cache, mid, f, tester) if g: hi = mid else: lo = mid + 1 if mid != lo: midf = _read_and_compare(cache, lo, f, tester)[0] return midf def bisect_right(f, x): """Return the largest offset where to insert line x into sorted file f. Similar to bisect.bisect_right, but operates on lines rather then elements. Convenience function which just calls bisect_way(..., is_left=False). """ return bisect_way(f, x, False) def bisect_left(f, x): """Return the smallest offset where to insert line x into sorted file f. Similar to bisect.bisect_left, but operates on lines rather then elements. Convenience function which just calls bisect_way(..., is_left=True). """ return bisect_way(f, x, True) def bisect_interval(f, x, y=None, is_open=False): """Return (start, end) offset pair for lines between x and y. Args: f: Seekable file object or file-like object to search in. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. x: First line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. y: First line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. If None, x is used. is_open: If true, then the returned interval consists of lines x <= line < y. Otherwise it consists of lines x <= line <= y. Returns: Return (start, end) offset pair containing lines between x and y (see arg is_open whether x and y are inclusive) in sorted file f, before offset `size'. These offsets contain the lines: start <= ofs < end. Trailing '\\n's are included in the interval (except at EOF if there was none). """ x = x.rstrip('\n') if y is None: y = x else: y = y.strip('\n') end = bisect_way(f, y, is_open) if is_open and x == y: return end, end else: # TODO: Speed up ignoring everything after offset `end'. return bisect_way(f, x, True), end
All this is a long piece of code (partially because of the docstrings), but it doesn't contain any new ideas or any surprises.
The 2nd call to bisect_way in bisect_interval above seems to seems to be doing unnecessary work to find the interval start offset, because we already know that it won't be larger than end. If it was possible to restrict the search to the prefix of the file, limiting it at a specific offset, bisect_interval would be faster. Let's introduce a size argument for that to each function. Only the _read_and_compare changes significantly, the others just pass the size around:
def _read_and_compare(cache, ofs, f, size, tester): """Read a line from f at ofs, and test it. Finds out where the line starts, reads the line, calls tester with the line with newlines stripped, and returns the results. If ofs is in the middle of a line in f, then the following line will be used, otherwise the line starting at ofs will be used. (The term ``middle'' includes the offset of the trailing '\\n'.) Bytes of f after offset `size' will be ignored. If a line spans over offset `size', it gets read fully (by f.readline()), and then truncated. If the line used starts at EOF (or at `size'), then tester won't be not called, and True is used instead. Args: cache: Currently ignored. TODO: Speed it up by adding caching of previous test results. ofs: The offset in f to read from. If ofs is in the middle of a line, then the following line will be used. f: Seekable file object or file-like object to read from. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. size: Size limit for reading. Bytes in f after offset `size' will be ignored. tester: Single-argument callable which will be called for the line, with the trailing '\\n' stripped. If the line used is at EOF, then tester won't be called and True will be used as the result. Returns: List or tuple of the form [fofs, g, dummy], where g is the test result (or True at EOF), fofs is the start offset of the line used in f, and dummy is an implementation detail that can be ignored. """ assert 0 <= ofs <= size if ofs: f.seek(ofs - 1) # Just to figure out where our line starts. f.readline() # Ignore previous line, find our line. fofs = min(size, f.tell()) else: fofs = 0 f.seek(0) assert 0 <= ofs <= fofs <= size g = True # EOF is always larger than any line we search for. if fofs < size: line = f.readline() # We read at f.tell() == fofs. if f.tell() > size: line = line[:size - fofs] if line: g = tester(line.rstrip('\n')) return fofs, g, () def bisect_way(f, x, is_left, size=None): """Return an offset where to insert line x into sorted file f. Bisection (binary search) on newline-separated, sorted file lines. If you use sort(1) to sort the file, run it as `LC_ALL=C sort' to make it lexicographically sorted, ignoring locale. If there is no trailing newline at the end of f, and the returned offset is at the end of f, then don't forget to append a '\\n' before appending x. Args: f: Seekable file object or file-like object to search in. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. x: Line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. is_left: If true, emulate bisect_left. See the return value for more info. size: Size limit for reading. Bytes in f after offset `size' will be ignored. If None, then no limit. Returns: Byte offset in where where to insert line x. If is_left is true (i.e. bisect_left), then the smallest possible offset is returned, otherwise (i.e. bisect_right) the largest possible address is returned. """ x = x.rstrip('\n') if is_left and not x: # Shortcut. return 0 if size is None: f.seek(0, 2) # Seek to EOF. size = f.tell() if size <= 0: # Shortcut. return 0 if is_left: tester = x.__le__ # x <= y. else: tester = x.__lt__ # x < y. lo, hi, mid, cache = 0, size - 1, 1, [] while lo < hi: mid = (lo + hi) >> 1 midf, g, _ = _read_and_compare(cache, mid, f, size, tester) if g: hi = mid else: lo = mid + 1 if mid != lo: midf = _read_and_compare(cache, lo, f, size, tester)[0] return midf def bisect_right(f, x, size=None): """Return the largest offset where to insert line x into sorted file f. Similar to bisect.bisect_right, but operates on lines rather then elements. Convenience function which just calls bisect_way(..., is_left=False). """ return bisect_way(f, x, False, size) def bisect_left(f, x, size=None): """Return the smallest offset where to insert line x into sorted file f. Similar to bisect.bisect_left, but operates on lines rather then elements. Convenience function which just calls bisect_way(..., is_left=True). """ return bisect_way(f, x, True, size) def bisect_interval(f, x, y=None, is_open=False, size=None): """Return (start, end) offset pair for lines between x and y. Args: f: Seekable file object or file-like object to search in. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. x: First line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. y: First line to search for. Must not contain '\\n', except for maybe a trailing one, which will be ignored if present. If None, x is used. is_open: If true, then the returned interval consists of lines x <= line < y. Otherwise it consists of lines x <= line <= y. size: Size limit for reading. Bytes in f after offset `size' will be ignored. If None, then no limit. Returns: Return (start, end) offset pair containing lines between x and y (see arg is_open whether x and y are inclusive) in sorted file f, before offset `size'. These offsets contain the lines: start <= ofs < end. Trailing '\\n's are included in the interval (except at EOF if there was none). """ x = x.rstrip('\n') if y is None: y = x else: y = y.strip('\n') end = bisect_way(f, y, is_open, size) if is_open and x == y: return end, end else: return bisect_way(f, x, True, end), end
Please note that the 2nd call to bisect_way in bisect_interval above passes end instead of size; this is how the optimization was implemented. Much work in the dependencies, a single call argument change in the final location.
In an external (file-based) binary search the slowest operation is the disk seek needed for each bisection. Most of the software and hardware components would be waiting for the hard disk to move the reading head. (Except, of course, that seek times are negligible when non-spinning storage hardware such as SSD or memory card is used.)
An out-of-the box solution would be adding the data to more disk-efficient key-value store. There are several programs providing such stores. Most of them are based on a B-tree, B*-tree or B+-tree data structure if sorted iteration and range searches have to be supported, or disk-based hashtables otherwise. Some of the high-performance single-machine key-value stores: cdb (read-only), Tokyo Cabinet, Kyoto Cabinet, LevelDB; see more in the NoSQL software list.
The fundamental speed difference between a B-tree search and a binary search in a sorted list stems from the fact that B-trees have a branching factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a B-tree search reduces possible input size by a factor larger than 2, while in a binary search each step reduces the the input size by a factor 2 only (i.e. we keep either the bottom half or the top half). So both kinds of searches are logarithmic, but the base of the logarithm is different, and this causes a constant factor difference in the disk seek count. By careful tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks for each search even for 100GB of data, while a binary search in a such a large file with 50 bytes per record would need 31 disk seeks. By taking 10ms as the seek time (see more info about typical hard disk seek times), a typical B-tree search takes 0.03 second, and a typical binary search takes 0.31 second.
In searches the disk seek time usually subsumes disk read times and other
data processing times (such as comparing the strings in memory or copying
data from one memory buffer to another). So if we want an in-the-box
solution to improve the speed of our text file binary search, we should do
as few seeks as possible. As it is now, the code does an
f.seek(...)
call for each bisection, and the design of the
binary search doesn't let us do fewer, so it seems that there is no room for
improvement. But it is: because we can change it so that not every
f.seek(...)
call would require a seek in the kernel
(i.e. the lseek(2)) system call).
Before making such a change, let's remember that most I/O libraries provide input buffering by default. (These include the Python file object, the Ruby file object, the files opened in Perl, the C <stdio.h> functions and the C++ std::istream class hierarchy. In Java one has to create a BufferedInputStream or a BufferedReader manually to get read buffering.) That is, each read operation reads at least 8KB of data (or configurable) from the kernel, and subsequent read operations read from the read buffer (i.e. the data already read but not consumed) instead of asking the kernel again. The purpose of this is to get rid of the system call latency, and not to reduce the number of disk seeks. Because of the need for context switching and between-kernel-and-userspace data copies, a system call latency (delay) occurs whenever the program calls the kernel. Even without this buffering in the program, the number of disk seeks would be equally small, because the kernel also does buffering (of at least 4KB), so if there is an lseek(2) system call followed by a read(2) system call to read data from elsewhere, and the new location is close enough to the old location, then the kernel can find and return the data from its buffers, bypassing the hard disk altogether. (On non-Unix systems the names of the system calls are different, but exactly the same happens, as viewed from our high-level perspective.)
We don't have to make any code changes to take advantage of this
kernel-level buffering, because the kernel does it for us by default. For
example, the last 12 f.seek(...)
calls in bisect_left
operate on the same 4096-byte region (plus the length of the last line),
which spans over one or two 4KB blocks on disk, and since these blocks have
been read recently (right in the same bisect_left invocation), the
kernel returns it from its buffer. Thus the last 11 f.seek(...)
calls and their corresponding f.readline()
calls don't need a
disk seek or a disk read (unless the last line of the block is very long),
and thus they will be very fast. Again, we didn't have to make any effort in
our code to get this speed benefit.
There is another benefit though for which we have to improve our code. This is getting rid of the system call latency. If the data is in the kernel buffers already (or it is being read read from a fast SSD with at least 50000 read IOPS (IO operations per second)), then the bottleneck is system call overhead. (In interpreted programming languages such as Python there can also be a CPU bottleneck because of the slow interpreter. We should also optimize for code length because of that.)
So our goal now is to reduce the number of system calls. In Python,
f.tell()
doesn't generate a system call (except for the very
first time it's called on f
), so we have to reduce the number of
times f.readline()
and f.seek(...)
is called. There
is a simple first step on that road: before each f.seek(...)
,
call f.tell()
to figure out where we are, and omit the seek call
if we happen to be at the right position already.
The implementation of this optimization is straightforward, it just adds
two if
s to _read_and_compare
:
def _read_and_compare(cache, ofs, f, size, tester): """...""" assert 0 <= ofs <= size if ofs: if f.tell() != ofs - 1: # Avoid lseek(2) call if not needed. f.seek(ofs - 1) # Just to figure out where our line starts. f.readline() # Ignore previous line, find our line. # Calling f.tell() is cheap, because Python remembers the lseek(2) retval. fofs = min(size, f.tell()) else: fofs = 0 if size and f.tell(): # Avoid lseek(2) call if not needed. f.seek(0) assert 0 <= ofs <= fofs <= size g = True # EOF is always larger than any line we search for. if fofs < size: line = f.readline() # We read at f.tell() == fofs. if f.tell() > size: line = line[:size - fofs] if line: g = tester(line.rstrip('\n')) return fofs, g, ()
A generic idea to reduce the number of system calls is to add a cache to the program, and fetch results from the cache instead of the kernel whenever possible. Binary search usually doesn't benefit from caching, because it never needs the same piece of data again, but our implementation does: as soon as the remaining input is very short (just a few lines), it's reading the same lines over and over again, with the same midf, but with a different mid. To further illustrate this, let's imagine an input file which contains a 64-byte line, and the caller is looking for a line larger than that. Since it's a binary search and 26 = 64, bisect_left will need 6 (or one more) attempts to find it. Each time it will use a different mid, but the same midf = 64. Remember that midf is the offset we get after reading and ignoring a partial line after the offset mid.
To get rid of these last few f.readline()
calls, we could
add a cache that maps midf values to their corresponding line
strings. In the example, the entry (midf = 64, line = EOF)
would be added to the cache, and then it would be used for the rest of the
calls, thus eliminating 6 f.readline()
calls. (If the caller
was looking for the empty string, this cache wouldn't help him.) In fact, we
don't have to record the whole line in the cache, it's enough to remember
the compare result (g), i.e. if it was smaller than x, the
line the caller is looker for. That's because the binary search does only
such comparisons with the file lines, it doesn't use them for anything else.
So this cache can get rid of some of the 2nd f.readline()
calls
in _read_and_compare
. But what about the 1st calls, i.e. those
which figure out midf from mid?
In the _read_and_compare
implementation, fofs
corresponds to midf and ofs corresponds to mid, so
let's use this naming. Luckily, we can eliminate some the 1st calls by caching
ofs in addition to fofs. Let's suppose we have an entry
(ofs = 42, fofs = 55, g = False) in the cache, and
_read_and_compare(ofs=48)
is called. We can use the cached result
in this cache entry, because the cache entry describes a
a line whose end is at offset 55, and offset 42 is somewhere in the middle in
the line. Since 42 ≤ 48 ≤ 55, this implies that offset 48 is also
somewhere in the middle of the same line, so we can use the cached result,
g = False. The value of g is not essential here, it would work
the same way with g = True.
Here is an implementation for such a cache (don't worry if you don't get it for the first reading, it will be explained afterwards):
def _read_and_compare(cache, ofs, f, size, tester): """Read a line from f at ofs, and test it. Finds out where the line starts, reads the line, calls tester with the line with newlines stripped, and returns the results. If ofs is in the middle of a line in f, then the following line will be used, otherwise the line starting at ofs will be used. (The term ``middle'' includes the offset of the trailing '\\n'.) Bytes of f after offset `size' will be ignored. If a line spans over offset `size', it gets read fully (by f.readline()), and then truncated. If the line used starts at EOF (or at `size'), then tester won't be not called, and True is used instead. A cache of previous offsets and test results is read and updated. The size of the cache is bounded (it contains at most 4 offsets and 2 test results). Args: cache: A cache containing previous offsets and test results. It's a list of lists. An empty cache is the empty list, so initialize it to [] before the first call. len(cache) is 0, 1, or 2. Each entry in cache is of the form [fofs, g, ofs]. ofs: The offset in f to read from. If ofs is in the middle of a line, then the following line will be used. f: Seekable file object or file-like object to read from. The methods f.tell(), f.seek(ofs_arg) and f.readline() will be used. size: Size limit for reading. Bytes in f after offset `size' will be ignored. tester: Single-argument callable which will be called for the line, with the trailing '\\n' stripped. If the line used is at EOF, then tester won't be called and True will be used as the result. Returns: List or tuple of the form [fofs, g, dummy], where g is the test result (or True at EOF), fofs is the start offset of the line used in f, and dummy is an implementation detail that can be ignored. """ assert len(cache) <= 2 assert 0 <= ofs <= size if cache and cache[0][2] <= ofs <= cache[0][0]: cache.reverse() # Move cache[0] to the end since we've just fetched it. elif len(cache) > 1 and cache[-1][2] <= ofs <= cache[-1][0]: pass # We've found cache[-1] (same index as cache[1]). else: if ofs: if f.tell() != ofs - 1: # Avoid lseek(2) call if not needed. f.seek(ofs - 1) # Just to figure out where our line starts. f.readline() # Ignore previous line, find our line. # Calling f.tell() is cheap, because Python remembers the lseek(2) retval. fofs = min(size, f.tell()) else: fofs = 0 assert 0 <= ofs <= fofs <= size if cache and cache[0][0] == fofs: cache.reverse() # Move cache[0] to the end since we've just fetched it. if cache[-1][2] > ofs: cache[-1][2] = ofs elif len(cache) > 1 and cache[-1][0] == fofs: if cache[-1][2] > ofs: cache[-1][2] = ofs else: g = True # EOF is always larger than any line we search for. if fofs < size: if not fofs and f.tell(): f.seek(0) line = f.readline() # We read at f.tell() == fofs. if f.tell() > size: line = line[:size - fofs] if line: g = tester(line.rstrip('\n')) if len(cache) > 1: # Don't keep more than 2 items in the cache. del cache[0] cache.append([fofs, g, ofs]) return cache[-1] # Return the most recent item of the cache.
Fortunately, we could make this improvement without changing any other functions. This was not a coincidence, an oracle was guiding our code and interface design. (Another popular explanation is that the real final code was written earlier than the code evolution steps in this article.)
Even without understanding the full logic of the implementation above, we can easily observe that the size of the cache never increases above 2 entries. (Just before it is about to increase, we remove an entry.)
Another observation (albeit a less obvious one), that this is an
LRU
cache, i.e. it discards the least recently used entry on overflow. Thus
it keeps the 2 most recently used entries. The cache.reverse()
calls keep the cache in usage order: whenever we use entry 0, we call
cache.reverse()
to swap it with entry 1. (If there is no other
entry besides entry 0, cache.reverse()
has no effect. How lucky.)
del cache[0]
is used for removal, which removes the LRU entry,
and cache.append(...)
is used for adding, which appends to the
end, making the newly added entry the most recently used.
Having understood these properties, the other aspects of the
implementation is easy to follow. First we try to fetch from the cache by
ofs. If that fails, we call f.readline()
to compute the
fofs. Then if we match the cache entry by fops, and we try to
decrease the ofs of an existing cache entry. If there was no match by
fops, we read the line, compare it, and add the result as a new cache
entry.
One question remains: wouldn't we benefit from more than 2 cache entries?
If we limited the cache size to 3 or 4 instead, code execution speed
wouldn't degrade much, and we could potentially get rid of more
f.readline()
and f.seek()
calls. The obvious
disadvantage is a small increase in complexity. Most of the time we don't
need more than 2 entries, because the binary search operates on lines far
from each other. Except, of course, near the end, when it has already hit
the correct line, but not at the beginning. In that case 2 cache entries
(one for the good line and one for the next or previous line, from which
we're trying to distinguish it) are enough. Even with very short (less than
8 bytes) lines, 13% of the cases the cache eliminated all 3 calls, and in an
additional 20% of the cases only the 2nd f.readline()
call was
eliminated. Measuring and explaining the additional, potentially small,
speed benefits of cache sizes 3 and 4 is left as an exercise to you.
In addition to the Python code, this text file binary search algorithm has been implemented in C as well. The C implementation mostly a direct transliteration of the Python one, for additional reductions in CPU and memory usage, and increased portability for memory-constrained systems (such as Linux-based routers) where a Python interpreter is too heavy-weight. It is also a preferred benchmark subject: if anyone claims to have written a faster text file binary search implementation, its speed should be compared to this C implementation's.
Differences in the C and Python implementations:
bar
and foo
, and everything
in between, one can call bisect_interval(f, "bar", "foo~")
, where
~
is a byte with a character code larger than any other bytes in
the file. If there is no such byte (because the file contains all possible
bytes), the string comparison can be special-cased to return true the end of
x. The C implementation does this.
f.readline()
.
Of course, both implementations support long files (i.e. files longer than 2GB, whose size gets larger than 31 bits) if the underlying operating system and filesystem supports it. Python file objects support these long files by default, and for C _FILE_OFFSET_BITS is #defined to 64, and this is respected by Linux glibc (and possibly other systems as well).
You can name it your work or somebody else's work, at your discretion.
Leave your comment about this article on my blog. Constructive criticism is also welcome. Constructively going into raptures over it is even more welcome.