Research Statement

My research is focused on the design, implementation and evolution of Data-Intensive Systems. Data-Intensive systems are large-scale, distributed systems which integrate information from disparate resources.  As data volumes approach near petabyte scales, data-intensive systems are becoming a focus area of research in the distributed system and super-computing communities, yet curiously not in the software engineering communities.  Particularly, lack of design and implementation-level reuse in data-intensive software has yielded a distasteful practice of re-implementing data systems to accompany data collection tasks at research institutions, universities and large corporations throughout the world, rather than leveraging existing frameworks, code and tools.  This practice is a side-effect caused by several problems which bring to light the importance of sound software engineering principles. These problems are:

1.                             Design patterns, and architectures have not been effectively captured (and more important formalized) in the data-intensive systems domain -  The practice of software architecture (an element of the software engineering lifecycle) provides formal methodologies and tools such as Architecture Description Languages, Architecture Modification Languages, a specification of primitive building blocks of software systems design and design heuristics guiding a family of software systems architectures which are codified in the form of architectural styles.  If effectively captured and formalized, I postulate that software engineers and designers will be equipped with the necessary tools and notations to address this important problem.

2.                             Existing middleware technologies for data-intensive systems are difficult to deploy, maintain and evolve – First and foremost, few middleware technologies exist and are deployed currently to support data-intensive systems.  These technologies, when developed, take form as software middlewares, distributed software systems responsible for marshalling data, communicating between software components and standardizing component implementation technologies.  Two such middlewares which I am exploring in the context of my research are the Globus Toolkit/Data Grid Components, and the OODT middleware:

a.       The Globus Toolkit is a middleware implementation infrastructure supporting the construction of virtual organizations – distributed organizational entities sharing computing resources, data, metadata, security infrastructure and the like.  Although originally focused on supporting huge-scale, distributed scientific computation, recently Globus has emerged as a leading contributor to information integration problems across science domains.  This contribution is largely the result of the formulation of the Data Grid, a set of components implemented using the Globus middleware infrastructure, supporting the retrieval, distribution, replication and identification of data and metadata.  

b.      The OODT middleware is both an implementation infrastructure and a software architectural style for designing and implementing data-intensive software systems.  OODT is implemented as a set of software components which use Distributed Object Middlewares (DOMs) such as RMI, CORBA and Web Services to communicate and exchange data across the data systems exposed by the OODT middleware.  OODT software components provide methodologies and schemas for describing data resources, abstracting away the interface to repositories containing the data resources, and mediating queries between source repository schemas and a global schema describing the integrated, data-intensive software system.

Both middleware infrastructures; however, have not been created with sufficient support to effect changes from the architectural level to changes at the implementation level.  This particular level of support comes from explicit architecture-level implementation support reified in the middleware code and is of particular interest to current data-intensive middlewares as they lack focus on this design formalization and capability.  Further, this deficiency has curtailed the widespread adoption of both middlewares across a broader community as the assumption that most scientists can wield the complexity of each respective middleware is particular limiting.  This is validated by Krutchen’s notion that few software engineers are indeed capable software architects.

3.                             Data-Intensive systems have not been implemented, deployed, or evaluated mobile, embedded environments – As the world moves towards pervasive environments, mobile phones, computing on demand, pocket pc’s and the like become the platform of choice, and in some cases, the platform of necessity.  These platforms also bring unique properties (read difficulties) which must be addressed by current software engineers.  These properties include limited bandwidth, connectivity, memory resources and so on.  Of particular interest as well is that many of these environments call for applications which are highly data-intensive in nature.  Imagine a scenario in which scientists, remotely deployed on Mars, must exchange high-resolution imaging data with a team of operational rovers which are positioned 2500 meters in front of the scientists and whose primary task is to assess the land topography which the scientists must travel in.  The rovers themselves describe imaging data using a particular set of data attributes (or metadata), which must be presented to the scientists in a unified view.  Each scientist, using a mobile IPAQ-like computing device, also describes image data given his particular area of expertise, for example; on the one hand, an astro-biologist may describe an image returned by a rover using data attributes describing the micro-organisms which are present in the returned image.  On the other hand, a planetary geologist may describe the returned image using data attributes which describe the different rocks, and land formations present in the returned rover image.  These data integration issues must be addressed in order for the scientists to correctly identify hazards, goals, targets and the like which they will encounter in their journey that the rovers have scouted for them.  An additional focus of my research is to examine how data-intensive systems can be deployed in mobile, embedded environments, and further, how we can use architecture-based design principles to guide their implementation and evolution to give scientists and mobile embedded devices (such as rovers) in the postulated scenario the ability to successfully complete their missions.

My current research is focused on addressing these three issues; namely, I am applying software architectural methodologies, notations and implementation infrastructures to data-intensive systems. For more information, check out the OODT Group at JPL and USC's Software Architecture Research Group.

You can view my Cirriculum Vitae if you’re interested in more information about my work.