CS 572
Information Retrieval and Web Search Engines

Spring Semester, 2015
Location: THH 201
Time: TTh 5:00pm-6:20pm
Class number: 30370D

The plethora of information and content available on the public Internet wrought forth a boom in the early 2000s in the area of web search engines. Companies such as Altavista, Excite, and Yahoo were the players, and the game had an ambitious objective: for all the unknown web pages that existed in cyberspace: (1) locate them in some fashion (through links, through guessing, etc.); (2) obtain the content from those pages; and (3) make that content available to users who enter in a few search terms into an input box on a web page.

As the web grew (into the billions of pages range circa 2000) , search engines had to become smarter to deal with the data management problems induced with collecting such an archive -- storage, availability, processing, failover, redundancy were only some of the challenges that had to be addressed.

This course will afford the student a complete treatment of web search engines, its foundation, principles, and elements, including that which is described above. The class is centered around paper reading assignments, and an individual presentation that will test comprehension and understanding of the course reading material. A class project will require the student to leverage the search engine techniques learned during the course (e.g., ranking, crawling , content analysis and detection, query models ) to, coupled with programming/implementation effort leveraging modern open source search technologies from Apache, design and implement a component of a real-world search engine.

In addition to foundations, and practical experience with search engines, the class will also introduce the student to the state-of-the-art in search engine research, future trends and state-of-the-practice. Students are expected to attend class regularly, and participate (as directed) in all class discussions, and most importantly, have fun!

Academic Integrity

Statement on Academic Conduct and Support Systems

Academic Conduct Plagiarism - presenting someone else.s ideas as your own, either verbatim or recast in your own words - is a serious academic offense with serious consequences. Please familiarize yourself with the discussion of plagiarism in SCampus in Section 11, Behavior Violating University Standards https://scampus.usc.edu/1100-behavior-violating-university-standards-and-appropriate-sanctions/ Other forms of academic dishonesty are equally unacceptable. See additional information in SCampus and university policies on scientific misconduct, http://policy.usc.edu/scientific-misconduct/. Discrimination, sexual assault, and harassment are not tolerated by the university. You are encouraged to report any incidents to the Office of Equity and Diversity http://equity.usc.edu/ or to the Department of Public Safetyhttp://capsnet.usc.edu/department/department-public-safety/online-forms/contact-us. This is important for the safety whole USC community. Another member of the university community - such as a friend, classmate, advisor, or faculty member - can help initiate the report, or can initiate the report on behalf of another person. The Center for Women and Men http://www.usc.edu/student-affairs/cwm/ provides 24/7 confidential support, and the sexual assault resource center webpage sarc@usc.edu describes reporting options and other resources.

Support Systems

A number of USC.s schools provide support for students who need help with scholarly writing. Check with your advisor or program staff to find out more. Students whose primary language is not English should check with the American Language Institute http://dornsife.usc.edu/ali which sponsors courses and workshops specifically for international graduate students. The Office of Disability Services and Programshttp://sait.usc.edu/academicsupport/centerprograms/dsp/home_index.html provides certification for students with disabilities and helps arrange the relevant accommodations. If an officially declared emergency makes travel to campus infeasible, USC Emergency Information http://emergency.usc.edu/ will provide safety and other updates, including ways in which instruction will be continued by means of blackboard, teleconferencing, and other technology.

Statement on Diversity

The diversity of the participants in this course is a valuable source of ideas, problem solving strategies, and engineering creativity. We encourage and support the efforts of all of our students to contribute freely and enthusiastically. We are members of an academic community where it is our shared responsibility to cultivate a climate where all students and individuals are valued and where both they and their ideas are treated with respect, regardless of their differences, visible or invisible.


Textbook and Readings


Supplemental Readings:

Assignments and Examinations





An exam testing your understanding of the lecture materials including crawling, ranking indexing, deduplication, querying, etc.



Assignments where you will build on the search engine topics in course (information retrieval, ranking, content detection and analysis, indexing, processing, etc.) and make a contribution to one of the existing Apache search technologies (Nutch, Lucene, Solr, Tika, OODT, etc.).


Project Submission Guidelines

Please refer to this document for guidelines on submitting your assignments.

Schedule (subject to change; check regularly)


Lecture Topic


Assignments and Deadlines


  • No Class - Professor is at DARPA XDATA / Memex Meetings
  • Tika in Action, Chapter 1


  • Tika in Action, Chapter 2
  • Tika in Action, Chapter 3




  • Tika in Action - Chapter 6
  • Assignment 1 Discussion
  • Tika in Action - Chapters 7, 8




  • Content Detectin and Analysis cont.


  • Class Discussion


  • Discussion of Assignment #2




Spring Recess (March 16-21, 2015)


  • Work on Assignments
  • Discussion of Assignment #2


  • No Class - Continue working on Assignment #2


  • Bostock, Michael, Vadim Ogievetsky, and Jeffrey Heer. "D³ data-driven documents." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2301-2309.


  • Exam