Week 4 Readings
IIR Ch 4
The two block write algorithms are interesting. BSBI is elegant and simple, while SPIMI eliminates an unnecessary pass through the data to sort terms. MapReduce is fast and super scaleable. One aspect not touched here is what to do if the master fails. It is after all, built of the same cheap parts as each parser and inverter. My guess is that there are several layers of redundancy, where each back up master only monitors the performance of the cluster's output to disk, ready to kick in if needed.
Of the two options presented for dynamic indexing, neither seems like an ideal solution. Rebuilding the entire index from scratch means that there will be queries where a better answer could be found, but the user was unfortunate enough to query before the rebuild was completed. For a solution to this problem, the book offers housing both the index on disk and a replacement, temp index in memory. However, this requires each query to make calls to both indexes, merger them, and check the results against a delete list, also in memory. This makes for slower query processing, and that is a cardinal sin!
IIR Ch 5
The benefit of compression is not in saving disk space. Disks are cheap and always getting cheaper. Compression means that more can be cached in memory, meaning less disk seeks, and faster return of results for those queries that are cached. Similarly, each transfer from disk is so much smaller than an uncompressed result, that the time savings are enormous.
It's interesting to think of casefolding and stemming as compression techniques, because they were first presented as measures to assist in recall. But since each also means that one more distinct term will not be stored in the dictionary, this is a compression of the index as a whole.
Heap's law states that the larger a collection, the larger the term dictionary, another indicator that compression is necessary.
The dictionary can be compressed significantly, through a variety of techniques. My favorite of these is front coding. This is similar to JPEG compression, and makes a lot of sense to me. Blocking is not a big improvement over simply stringing all terms, but in a very large dictionary or a very small memory every bit helps.
Variable byte encoding and bit level encoding are a bit more difficult than this book has stated. I do not have a firm understanding of binary math to follow the way that these gaps make things faster.
No comments:
Post a Comment