Currently, Suneido has a database size limit of roughly 1 gb. A few years ago this seemed huge and hardly worth thinking about as a limit. But data and drives continue to increase in size and this limit is starting to look like it could be a problem. Already we have customers with databases approaching 500 mb. My thinking was that we would overcome this limit by moving to 64 bit cpu’s and operating systems. And that’s still an option. But 32 bit cpu’s and operating systems are going to be around for a while longer and not everyone is going to want to move to 64 bit systems. It would be nice to come up with a way to raise the 1gb limit on 32 bit systems.
The current 1 gb size limit is a result of accessing the database as a memory mapped file and mapping the entire file into memory. 32 bit addresses result in a 4 gb address space. Windows normally takes half of this, leaving applications with a 2 gb address space. But in order to rebuild or compact a database you need room for two databases, resulting in the 1 gb limit. (On Windows 95/98/ME the 2gb is shared by all applications, but on NT/XP/2000/2003 each application gets its own 2 gb.)
The big advantage of mapping the entire database file into memory is that it simplifies the code a lot. You don’t have to worry about reading or writing files or caching. All this is handled automatically by the hardware/operating system virtual memory system. The usual way to get around the address space limitation is to only map portions of the file into memory. But how do you manage this so you still get the advantages of memory mapped access and don’t end up mapping records in and out the same way you’d read and write them with conventional file access?
The Insight
My key insight was that the current garbage collection system could be used to determine which portions of the memory mapping are no longer in use, the same way it determines which portions of memory are no longer in use – by determining when they are no longer referenced. But this idea, although key, is only one piece of the solution.
Currently, to convert a database file offset to a memory address, you simply add the base address of the memory map. This is simple and fast. But if we’re mapping sections of the file into different memory areas this has to be handled differently. If we manage the file as fixed size “chunks” then we can have an array of memory addresses for each chunk. (Or NULL if the chunk hasn’t been mapped into memory.) Then to convert a file offset to a memory address we simply index into this array. If the value is NULL, we map that chunk into memory and fill in its address in the table.
Large Records
But what if the record we’re accessing is larger than the chunk size? To keep the rest of the code simple, we want the entire record mapped to a contiguous region of memory. (So we can’t just map multiple chunks independently.) My first thought was that we could make the chunk size large (e.g. 4 mb) and simply say that records can’t exceed this size. Apart from this being an ugly solution, it doesn’t work since even a small record could straddle a chunk boundary. (You can’t just start the chunk at the start of the record because the chunks need to be aligned to allow simple offset to address translation.)
The answer is to map enough chunks at once to cover the entire record. This is easy enough, but it requires passing in the record size to the offset-to-address translation.
Another complication is that we may already have mapped one of the chunks in the range required by the current record. In this case we can simply create a new mapping and store it in the translation array, overwriting the previous, smaller mapping(s). The overlapping mappings will be at different memory addresses, and the garbage collection doesn’t rely on the translation array, so this shouldn’t cause any problems.
Garbage Collection
So now we have a way to get portions of the database file mapped into memory. But the original problem is that the database file may be larger than the address space. So eventually, we need to un-map some chunks so we can map in other ones. That brings us back to the idea that started this proposal – to use the existing garbage collection to find unreferenced chunks.
Suneido uses a conservative garbage collector. It scans memory looking for values that look like pointers to memory, assumes they are valid references, and recycles any memory that is unreferenced. We can modify this scanning to also look for pointers to chunks. Any chunks that are not referenced can be un-mapped so the address space can be reused for other chunks.
The first requirement is for a way to identify pointers to chunks. The garbage collector scans a lot of pointers so this has to be fast. The memory heap is in a contiguous region of memory, so a simple range check suffices. But since we’re mapping chunks independently, we don’t know the range. One solution would be to maintain a minimum and maximum address as we map chunks. The problem with this is that we couldn’t assume the entire range was being used for chunks – something else might have allocated within the range. One approach might be to keep a bitmap (with one bit per chunk of address space) recording which parts of the range are actually being used for chunks.
Suneido’s garbage collector uses a modified mark-sweep algorithm using a bitmap for the marks. A secondary bitmap is used to track the starts of the blocks. We’ll need a similar bitmap to mark referenced chunks, and to track the starts of multiple chunks mapped together.
Least Recently Used
The next question is when to unmap chunks. With memory, the garbage collector recycles unreferenced blocks as soon as it finds them. But with the database, it makes more sense to unmap chunks as late as possible, in case they are needed again. So we don’t unmap until we run out of address space (probably when we reach a preset limit like 500 mb). But even then, we only need to unmap as much as we need. Ideally, we’d do this on a least recently used (LRU) basis. But how do we keep track of what’s been least recently used? The garbage collection process simply tells us which chunks are no longer referenced. The usual way to implement LRU is to keep a list of items. When an item is used, it’s moved to the end of the list, items are re-used from the other end of the list. However, there’s no easy way to keep a list of the unreferenced chunks in such a way that we can easily find and move them when they are re-referenced.
Clock Algorithm
Instead, we can use a “clock” algorithm that approximates LRU. Think of the memory space as organized around a circular clock face. We keep a bitmap that indicates recently used. Bits are set here when a chunk is referenced. The hand of the clock is a pointer to the next memory address to be examined. If the next chunk has been recently used, we clear its recently used bit and skip it. If the chunk has not been recently used, then we unmap it. We continue advancing until we’ve unmapped sufficient space for the current request. (And don’t forget we have to skip over any portions that something else may have allocated within our range.)
Unmapping
If we only ever dealt with single chunks, we could simply unmap one chunk and then map the new one. But because of the need for multiple chunks mapped together, unmapping becomes more complex. In order to map a number of chunks together, we need to unmap either another group of chunks at least as big, or we need to unmap adjacent chunks. The best approach may be to use another bitmap. We can iterate through the LRU list , marking the bitmap, until we get a big enough region. Then we can iterate a second time through the list, unmapping the chunks that are part of that region. It may be worthwhile to optimize the common case of a single chunk by simply finding the first single chunk in the LRU list and using it.
Chunk Size
What is a reasonable chunk size? The lower limit would be one virtual memory page, commonly 4 kb. However, the smaller the chunk size, the larger our translation array and bitmaps will be (for given database and memory sizes). For example, for a 1 gb database with 4 kb chunks, the translation array will need 256 k entries. Each entry is a 4 byte (32 bit) address so the array will be 1 mb. Doubling the chunk size will halve the array size. 16 kb chunks will result in 64 k entries = a 256 k array. The bitmaps are smaller. If we limit the address space to 256 mb (probably no point in using more address space than physical memory) then a bitmap will be 256 mb / 4 kb = 64 k bits / 8 bits per byte = 8 kb. As with the translation array, doubling the chunk size will halve the bitmap size.
4 gb or More?
Wait a minute, we still only have 32 bit file offsets, so we’re still limited to a 4 gb database size. Is all this work and complexity justified for this moderate increase in the limit? We could change the database format to use bigger offsets (e.g. 48 or 64 bits) but that would be a major change and would take up more database space. Luckily, there’s a simpler solution. If we align all the database records to, e.g. 8 byte boundaries, then the least significant three bits of offsets will always be zero. So we can shift the offsets three bits right, giving us three extra bits to use to increase the maximum to 32 gb, which should be sufficient for the near future. We can always increase the alignment to increase the maximum, e.g. to 16 bytes for a maximum of 64 gb. The tradeoff is additional overhead. On average, N byte alignment will result in N / 2 bytes wasted per record. The percentage of overhead will depend on the average record size. For example, with 8 byte alignment, resulting in 4 bytes average overhead, if our average record size is 128 bytes, the overhead will be 4 / 128 or about 3 %. The bigger the average record size, the less the percentage of overhead.