CS 599
Content Detection and Analysis for Big Data

Spring Semester, 2016
Location: THH 101
Time: TTh 5:00pm-6:50pm
Class number: 30058D

Instructor | Overview | Academic IntegrityTextbook and ReadingsAssignments and Exams Project Schedule


Course Producers



This course is designed as an advanced course in data analytics, and big data. The course introduces students to the area of content detection and analysis. This involves understanding of digital file formats, their detection and data extraction from them. Emphasis areas include Document Type Detection; Parsing and extraction; Metadata understanding and analysis; Language Identification and detection from files and finally file formats and representation. The class also has a specific focus on Content Detection and Analysis from large data sets. Datasets used in the course are publicly collected by the instructor or his collaborators involved in national Big Data initiatives including DARPA, NASA and other projects. The course is designed to be accessible to students with experience programming in Java and in Python at an intermediate level. The first half of the course focuses on Java, using the Tika framework as the core technology for instruction. The instructor is the co-inventor of Tika and has deep experience in the technology and in search engines technology from Apache. The second half of the course introduces the students to the use of Python programming for Content Detection and Analysis using Tika, ElasticSearch™, Solr, Nutch and Apache Hadoop™. The course will be a combination of lecture, in-class discussion, readings, group-based assignments and a final exam.

The objective of this course is to train students to be able to understand file formats, their representation, and how to automatically extract information from large datasets of files. Specifically, students successfully completing this course will achieve three main objectives:

  1. Develop sufficient proficiency in the Tika framework to write software capable of automatically identifying files, extracting information from them including their text and metadata and language.
  2. Develop sufficient proficiency in Information Retrieval and Data Extraction techniques with Large Data sets collected from the Web and other places (Intranet, Science Data Sets, Public Data Sets).
  3. Develop sufficient proficiency in Java and Python to write and execute software that is “File Aware” and that automatically extracts text and metadata from large data sets.

The primary teaching methods will be discussion, case studies, and lectures.  Students are expected to perform directed self learning outside of class which encompasses, among other things, a considerable amount of literature review.  In addition, the class will directly leverage open source software and partnerships from the Instructor who sits on the Board of Directors at the Apache Software Foundation. Projects associated with the course make direct contributions to Apache Licensed (“ALv2”) open source software projects at the student’s discretion. Leadership training in open source is provided and encouraged, and students leave with an experience in open source that makes them more marketable to companies and institutions looking to hire in content detection and analysis, Big Data, and Data Science.

In addition to foundations, and practical experience with content detection and analysis, the class will also introduce the student to the state-of-the-art in content detection research, future trends and state-of-the-practice. Students are expected to attend class regularly, and participate (as directed) in all class discussions, and most importantly, have fun!

Academic Integrity

Statement on Academic Conduct and Support Systems

Academic Conduct Plagiarism - presenting someone else.s ideas as your own, either verbatim or recast in your own words - is a serious academic offense with serious consequences. Please familiarize yourself with the discussion of plagiarism in SCampus in Section 11, Behavior Violating University Standards https://scampus.usc.edu/1100-behavior-violating-university-standards-and-appropriate-sanctions/ Other forms of academic dishonesty are equally unacceptable. See additional information in SCampus and university policies on scientific misconduct, http://policy.usc.edu/scientific-misconduct/. Discrimination, sexual assault, and harassment are not tolerated by the university. You are encouraged to report any incidents to the Office of Equity and Diversity http://equity.usc.edu/ or to the Department of Public Safetyhttp://capsnet.usc.edu/department/department-public-safety/online-forms/contact-us. This is important for the safety whole USC community. Another member of the university community - such as a friend, classmate, advisor, or faculty member - can help initiate the report, or can initiate the report on behalf of another person. The Center for Women and Men http://www.usc.edu/student-affairs/cwm/ provides 24/7 confidential support, and the sexual assault resource center webpage sarc@usc.edu describes reporting options and other resources.

Support Systems

A number of USC.s schools provide support for students who need help with scholarly writing. Check with your advisor or program staff to find out more. Students whose primary language is not English should check with the American Language Institute http://dornsife.usc.edu/ali which sponsors courses and workshops specifically for international graduate students. The Office of Disability Services and Programshttp://sait.usc.edu/academicsupport/centerprograms/dsp/home_index.html provides certification for students with disabilities and helps arrange the relevant accommodations. If an officially declared emergency makes travel to campus infeasible, USC Emergency Information http://emergency.usc.edu/ will provide safety and other updates, including ways in which instruction will be continued by means of blackboard, teleconferencing, and other technology.

Statement on Diversity

The diversity of the participants in this course is a valuable source of ideas, problem solving strategies, and engineering creativity. We encourage and support the efforts of all of our students to contribute freely and enthusiastically. We are members of an academic community where it is our shared responsibility to cultivate a climate where all students and individuals are valued and where both they and their ideas are treated with respect, regardless of their differences, visible or invisible.


Textbook and Readings


Assignments and Examinations





An exam testing your understanding of the lecture materials including digital file formats, their detection, parsing and extraction, metadata, language idnetification and translation, etc.



Assignments where you will build on the content detection and and analysiks topics in course and make a contribution to one of the existing Apache content detection and IR technologies (Nutch, Lucene, Solr, Tika, OODT, etc.).


Project Submission Guidelines

Submission guidelines will be specified in each assignment.

Schedule (subject to change; check regularly)


Lecture Topic


Assignments and Deadlines




  • Tika in Action, Chapter 2


  • Team Formation Discussion


  • Tika in Action - Chapter 4
  • Assignment 1 Discussion


  • Tika in Action Chapter 5


  • Tika in Action Chapter 6
  • T. Gowda and C. Mattmann. Clustering Web Pages Based on Structure and Style. Submitted to the 2016 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, June 12-17, 2016.




  • Assignment #2 Discussion
  • Assignment #1 Due (Friday, March 4, 2016 12pm PT)
  • Assignment #2




No Class - Spring Recess (March 14 - 20, 2016)


  • Assignment #2 Review




  • No Class - Professor on Travel
  • No Assigned Reading - Finish your Assignment #2
  • Assignment #2 Due (April 5, 2016 12pm PT)
  • Assignment #3 Discussion


  • Tika in Action, Chapter 13
  • No Class - Professor on Travel
  • Exam (April 28, 2016)
  • Assignment #3 Due (May 3, 2016 12pm PT)