Research StatementMy research is focused on the design, implementation and evolution of Data-Intensive Systems. Data-Intensive systems are large-scale, distributed systems which integrate information from disparate resources. As data volumes approach near petabyte scales, data-intensive systems are becoming a focus area of research in the distributed system and super-computing communities, yet curiously not in the software engineering communities. Particularly, lack of design and implementation-level reuse in data-intensive software has yielded a distasteful practice of re-implementing data systems to accompany data collection tasks at research institutions, universities and large corporations throughout the world, rather than leveraging existing frameworks, code and tools. This practice is a side-effect caused by several problems which bring to light the importance of sound software engineering principles. These problems are:
1.
Design patterns, and architectures have not been effectively
captured (and more important formalized) in the data-intensive systems
domain - The practice of software
architecture (an element of the software engineering lifecycle) provides
formal methodologies and tools such as Architecture
Description Languages, Architecture
Modification Languages, a specification
of primitive building blocks of software systems design and design
heuristics guiding a family of software systems architectures which are
codified in the form of architectural
styles. If effectively
captured and formalized, I postulate that software engineers and designers
will be equipped with the necessary tools and notations to address this
important problem.
2.
Existing middleware technologies for data-intensive systems
are difficult to deploy, maintain and evolve – First and foremost, few middleware technologies exist
and are deployed currently to support data-intensive systems. These technologies, when developed, take
form as software middlewares,
distributed software systems responsible for marshalling data, communicating
between software components and standardizing component implementation
technologies. Two such middlewares
which I am exploring in the context of my research are the Globus Toolkit/Data
Grid Components, and the OODT middleware:
a.
The Globus Toolkit is a middleware implementation infrastructure
supporting the construction of virtual organizations – distributed organizational entities
sharing computing resources, data,
metadata, security infrastructure and the like. Although originally focused on supporting
huge-scale, distributed scientific computation, recently Globus has emerged
as a leading contributor to information integration problems across science
domains. This contribution
is largely the result of the formulation of the Data Grid, a set of components implemented using the Globus middleware
infrastructure, supporting the retrieval, distribution, replication and
identification of data and metadata.
b.
The OODT middleware is both an implementation infrastructure
and a software architectural style
for designing and implementing data-intensive software systems. OODT is implemented as a set of software
components which use Distributed
Object Middlewares (DOMs) such as RMI, CORBA
and Web Services to communicate and exchange data across the data systems
exposed by the OODT middleware. OODT
software components provide methodologies and schemas for describing data resources, abstracting
away the interface to repositories containing the data resources,
and mediating queries between source repository schemas and a global schema
describing the integrated, data-intensive software system.
Both middleware
infrastructures; however, have not been created with sufficient support
to effect changes from the architectural
level to changes at the implementation
level. This particular
level of support comes from explicit architecture-level
implementation support reified in the middleware code and is of particular
interest to current data-intensive middlewares as they lack focus on this
design formalization and capability.
Further, this deficiency has curtailed the widespread adoption
of both middlewares across a broader community as the assumption that
most scientists can wield the complexity of each respective middleware
is particular limiting. This is validated by Krutchen’s
notion that few software engineers
are indeed capable software architects.
3. Data-Intensive systems have not been implemented, deployed, or evaluated mobile, embedded environments – As the world moves towards pervasive environments, mobile phones, computing on demand, pocket pc’s and the like become the platform of choice, and in some cases, the platform of necessity. These platforms also bring unique properties (read difficulties) which must be addressed by current software engineers. These properties include limited bandwidth, connectivity, memory resources and so on. Of particular interest as well is that many of these environments call for applications which are highly data-intensive in nature. Imagine a scenario in which scientists, remotely deployed on Mars, must exchange high-resolution imaging data with a team of operational rovers which are positioned 2500 meters in front of the scientists and whose primary task is to assess the land topography which the scientists must travel in. The rovers themselves describe imaging data using a particular set of data attributes (or metadata), which must be presented to the scientists in a unified view. Each scientist, using a mobile IPAQ-like computing device, also describes image data given his particular area of expertise, for example; on the one hand, an astro-biologist may describe an image returned by a rover using data attributes describing the micro-organisms which are present in the returned image. On the other hand, a planetary geologist may describe the returned image using data attributes which describe the different rocks, and land formations present in the returned rover image. These data integration issues must be addressed in order for the scientists to correctly identify hazards, goals, targets and the like which they will encounter in their journey that the rovers have scouted for them. An additional focus of my research is to examine how data-intensive systems can be deployed in mobile, embedded environments, and further, how we can use architecture-based design principles to guide their implementation and evolution to give scientists and mobile embedded devices (such as rovers) in the postulated scenario the ability to successfully complete their missions. My current research is focused on addressing these three issues; namely, I am applying software architectural methodologies, notations and implementation infrastructures to data-intensive systems. For more information, check out the OODT Group at JPL and USC's Software Architecture Research Group. You can view my Cirriculum Vitae if you’re interested in more information about my work. |