CS 572
Information Retrieval and Web Search Engines

Summer Semester, 2011
Location: ZHS163
Time: TTh 3:30pm-5:20pm
Class number: 30220R


Instructor | Overview | Academic IntegrityTextbook and ReadingsAssignments and Exams Project Schedule


Instructor


Overview

The plethora of information and content available on the public Internet wrought forth a boom in the early 2000s in the area of web search engines. Companies such as Altavista, Excite, and Yahoo were the players, and the game had an ambitious objective: for all the unknown web pages that existed in cyberspace: (1) locate them in some fashion (through links, through guessing, etc.); (2) obtain the content from those pages; and (3) make that content available to users who enter in a few search terms into an input box on a web page.

As the web grew (into the billions of pages range circa 2000) , search engines had to become smarter to deal with the data management problems induced with collecting such an archive -- storage, availability, processing, failover, redundancy were only some of the challenges that had to be addressed.

This course will afford the student a complete treatment of web search engines, its foundation, principles, and elements, including that which is described above. The class is centered around paper reading assignments, and an individual presentation that will test comprehension and understanding of the course reading material. A class project will require the student to leverage the search engine techniques learned during the course (e.g., ranking, crawling , content analysis and detection, query models ) to, coupled with programming/implementation effort leveraging modern open source search technologies from Apache, design and implement a component of a real-world search engine.

In addition to foundations, and practical experience with search engines, the class will also introduce the student to the state-of-the-art in search engine research, future trends and state-of-the-practice. Students are expected to attend class regularly, and participate (as directed) in all class discussions, and most importantly, have fun!


Academic Integrity

Students must work independently on all individual assignments; collaborating on individual assignments is considered cheating and will be penalized accordingly. All USC students are responsible for reading and following the USC Student Conduct Code, which prohibits plagiarism. Some examples of behavior that is not allowed are: copying all or part of someone else's work (by hand or by looking at others' files, either secretly or if shown), and submitting it as your own; giving another student in the class a copy of your assignment solution; consulting with another student during an exam; and copying text from published literature without proper attribution. If you have questions about what is allowed, please discuss it with the instructor.

Students who violate University standards of academic integrity are subject to disciplinary sanctions, including failure in the course and suspension from the University. Since dishonesty in any form harms the individual, other students, and the University, policies on academic integrity have been and will be strictly enforced.


Textbook and Readings

Textbook:

Supplemental Readings:


Assignments and Examinations

Name

Description

Weight

Research Paper Presentation

You will thoroughly read and examine one of the course research papers, and then present it in class within a 25 minute time frame (20 mins for talk + 5 for questions).

40%

Course Project

An individual assignment where you will build on the search engine topics in course (information retrieval, ranking, content detection and analysis, indexing, processing, etc.) and make a contribution to one of the existing Apache search technologies (Nutch, Lucene, Solr, Tika, OODT, etc.).

40%

Participation

You are required to attend class, to participate in class discussions, and to ask questions of the presenters each week.

20%

Project Submission Guidelines

Please refer to this document for guidelines on submitting your project.



Schedule (subject to change; check regularly)

Week

Lecture Topic

Readings

Assignments and Deadlines

1

  • J. Cho, N. Shivakumar, H. Garcia-Molina. Finding replicated web collections. ACM SIGMOD Record, Vol. 29, No. 2, pp. 355-366, 2000.
  • S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Hypersearching the Web. Scientific American, June 1999.
  • Tika in Action - Chapter 3
 

2

  • Patrick O'Leary - Guest Lecture

3

 
  • Presentation Selection due
    (June 2, 2011)

4

 
 

5

 
 

6

   
  • Project Proposal due
    (June 23, 2011)

7

   
   

8

  • No Lecture!
   
  • Project Status Reports
  • Project Mid-term Reports due
    (Saturday July 9th, 2011, 11:59:59PM)

9

  • No Lecture this Week! (work on your projects)
   
   

10

   
 
  • Final Projects Due
    Tuesday, July 26, 2011
    11:59:59 PM