IS2140 - Readings and Muddiest Points

Friday, April 11, 2014

Week 13 Readings

Week 13 Readings

IIR Ch. 13
Text classification is a problem in IR. When you encounter a new document, what other documents are similar and which group of documents should it belong to? The classification algorithms presented here are very similar to the algorithms we use for finding document similarity. Conditional independence is also used in feature selection in classification models, due to the complexity of finding dependence of terms in a learning model, and constantly updating with new data. I like the introduction of many statistical measures in the chapter, such as chi square, because these are complex problems with many possible solutions we should be exposed to.

IIR Ch. 14
The contiguity hypothesis is challenging for me to accept. Even from the simplest example we see, containing the topics Kenya, UK, and China, this hypothesis states that all documents fall into one class only. But a document about trade relations between the UK and China fits into both. So, a new class must be formed. But then this can be subdivided again and again, until each class is only one document! Specificity is critical for this. kNN and Voronoi tessellation seems to be very complex. And I am not sure how to tessellate the vector space; from the reading it seems that the tessellations should be for each document, not k, which is much smaller.

Monday, April 7, 2014

Week 12 Muddiest Point

Week 12 Muddiest Point

It seems that whatever takes the most processing power will end up being the best way to do anything in IR. Does creating a personalized search system require the most processing? Do we also see this model of adaptive IR perform better that modifying queries with personalized terms or reranking returned documents with a user model?

Friday, April 4, 2014

Week 12 Readings

Week 12 Readings

Gauch et al.
I am super interested in this! Especially explicit user profiling, I wonder how systems can get users to want to give this information. The interchangability of implicit and explicit feedback in studies is interesting. One would think that explicit would provide more powerful information, but perhaps this is not true.
There is a lot of research into getting at the semantic concepts behind a word. This is a really difficult research issue because it is not static for a user. Even if you can figure out which bank they mean right now, there is no guarantee that this is the only way they use the word. And that first part is difficult, too. The project that explored selected webpages and compared results to WordNet concepts was pretty cool.
Bloedorn et al.'s suggestion to use a hierarchy is true; in my searching of PubMed sometimes the expanded term higher up the hierarchy is what I later find I am aiming for. There are a lot of different ideas for how to get user feedback and create a profile of user interest based on this feedback.

Pazzani & Billsus
Many of the mathematical techniques for content recommendation are algorithms we have learned to use in general retrieval models. Rocchio feedback, naive Bayes, multivariate Bernoulli, etc. I like the idea that sometimes the explicit feedback a user gives is as somple as clicking a giant thumb-up or thumbs-down button. That's really easy, and if you make the change in results meaningful, that's enough incentive to do it, at least for me. Finally, telling the difference between a funny joke and an unfunny one is impossible with any of these techniques.

Ahn et al.
This paper presents a method for helping users in exploratory search, by creating a model of their current task and integrating this into the results set. I like the idea of highlighting task terms in snippets. It does bring to mind that users may not be used to this, and will not freely enter queries that stray from the task terms. We are trained to mold our queries into what we know Google likes, and entering a Google search without my actual search terms is crazy. In TaskSieve, this is actually a valid way to search.

Monday, March 31, 2014

Week 11 Muddiest Point

Week 11 Muddiest Point

Is there any more recent data on the language spoken by Internet users and the language of pages? How do you even collect this data?

Images can be seen as a universal language (well, almost). Can tagging webpages with images be a good way to get at the topic of a page, thereby bypassing the semantic issues and the need for things like Darwish's probabilistic analysis? Has anyone tried this yet?

Wednesday, March 26, 2014

Week 11 Readings

Week 11 Readings

IES Ch 14
Parallel query processing is simple. Increase the number of machines accepting queries, and give each a copy of the index. This makes query speed increase directly proportionally with the number of machines. Of course, if the index is too large to be stored on one system this is not possible. Intra-query parallelism is thus more common, where each machine holds only part of the index and queries on terms in that mini-index are directed to the machine.
Replication seems like a very easy way to introduce fault tolerance. If each query is run on each of the machines, then the chance that every single one of these will fail at once is really small. If one fails, then the next in line can finish the query with no loss of time. Also, this multiple redundancy makes it easy to simply replace the failed machines as needed.
This reading also presents a fantastic buildup of MapReduce, by introducing the paper from its simplest form to the more advanced problems encountered in a large search engine.

Monday, March 24, 2014

Week 10 Muddiest Point

Week 10 Muddiest Point

How can we trust user generated anchor text and hyperlinks, when <meta> tags are so corrupted by the same users? If I have a spam page and try lots of SEO techniques, should my outlinks still contribute to PageRank or a similar link analysis?

Is there any other user generated content that can help us improve search? Are they willing to provide this information? How can we incentivize this process? Have any search engines tried using Mechanical Turks or paying users a very small amount for relevance judgments?

Friday, February 28, 2014

Week 9 Readings

Week 9 Readings

MIR Chapter 10
The user interface should be as unobtrusive as possible, and help to bring the user clarity in their information seeking behavior. There are more complex metrics to rank systems on than recall and precision alone. A user who has to spend a lot of time with a system may rate it poorly, or some may not be interested in recall at all for their information need.
"An empty screen or a blank entry form does not provide clues to help a user decide how to start the search process" - wow this isn't a negative at all!
I like the venn diagram visualization of the Boolean queries, it seems very useful. I also like that this reading presents old methods that have been tried in search interfaces. Before Google changed every search display to be just one way, so many ideas were attempted, and this is an important reference point to make sure we don't waste our time on something already tried and failed.

SUI Chapter 1
The user interface of search has not changed nearly at all since 1997. Novice users are not naturally inclined to ask keyword questions though, but instead want to ask a natural language question. Interface design needs to be targeted to a certain user group! Then the interface should be designed, tested, and redesigned to find a nice balance to meet this user group. This can be a challenge because some principles of interface design are opposites of each other (keep it the same, and keep it simple), as well as the fact that UI design is not an exact science.

SUI Chapter 11
Concordance visualizations are very interesting! SeeSoft and TextArc look very different from each other, but are both really nice.