IS2140 - Readings and Muddiest Points: March 2014

Monday, March 31, 2014

Week 11 Muddiest Point

Week 11 Muddiest Point

Is there any more recent data on the language spoken by Internet users and the language of pages? How do you even collect this data?

Images can be seen as a universal language (well, almost). Can tagging webpages with images be a good way to get at the topic of a page, thereby bypassing the semantic issues and the need for things like Darwish's probabilistic analysis? Has anyone tried this yet?

Wednesday, March 26, 2014

Week 11 Readings

Week 11 Readings

IES Ch 14
Parallel query processing is simple. Increase the number of machines accepting queries, and give each a copy of the index. This makes query speed increase directly proportionally with the number of machines. Of course, if the index is too large to be stored on one system this is not possible. Intra-query parallelism is thus more common, where each machine holds only part of the index and queries on terms in that mini-index are directed to the machine.
Replication seems like a very easy way to introduce fault tolerance. If each query is run on each of the machines, then the chance that every single one of these will fail at once is really small. If one fails, then the next in line can finish the query with no loss of time. Also, this multiple redundancy makes it easy to simply replace the failed machines as needed.
This reading also presents a fantastic buildup of MapReduce, by introducing the paper from its simplest form to the more advanced problems encountered in a large search engine.

Monday, March 24, 2014

Week 10 Muddiest Point

Week 10 Muddiest Point

How can we trust user generated anchor text and hyperlinks, when <meta> tags are so corrupted by the same users? If I have a spam page and try lots of SEO techniques, should my outlinks still contribute to PageRank or a similar link analysis?

Is there any other user generated content that can help us improve search? Are they willing to provide this information? How can we incentivize this process? Have any search engines tried using Mechanical Turks or paying users a very small amount for relevance judgments?