Two week's ago, the NSF announced a plan to link computers in four major research centers with a comprehensive infrastructure called the TeraGrid (see article #100736 "$53M DISTRIBUTED TERASCALE FACILITY TO LAUNCH W/ INTEL ARCH" from the August 10, 2001 edition of HPCwire). The project will create the world's first multi-site computer facility, the Distributed Terascale Facility (DTF). NCSA director Dan Reed agreed to answer some questions for HPCwire concerning the purpose and promise of the DTF.

How long has the DTF project been in development? How and by whom was the plan developed?

REED: The DTF and the TeraGrid build on the information infrastructure PACI was created to develop and deploy (e.g., Grid software, scalable commodity clusters, software tools, visualization and data management, and community application codes). The DTF TeraGrid plan was developed jointly by people from NCSA, SDSC, Argonne, and Caltech who have been involved in the NSF PACI program since 1997. As such, it is a natural outgrowth of the Grid computing vision we have been developing over the past four years.

HPCwire: What specific Grand Challenge questions is the DTF being created to address?

REED: The DTF does not target a fixed set of applications. Rather, the size and scope of the DTF will enable scientists and engineers to address a broad range of compute intensive and data intensive problems based on national peer review and resource allocation. However, there are many exemplars of expected use.

For example, the MIMD Lattice Computation (MILC) collaboration is a multi-institutional group that studies lattice QCD. Worldwide, MILC uses more than two million processor hours per year. The MILC collaboration both tests QCD theory and helps interpret experiments on high energy-accelerators. At present, the MILC code's fastest measured single processor performance is on NCSA's Itanium Linux cluster.

Parallel molecular dynamics codes like NAMD are designed for high-performance simulation of large biomolecular systems. Such codes can predict structure and binding energies, determining optimal transitions paths, and examining free energies of transitions

Other scientific areas that will benefit from use of the DTF systems and the TeraGrid include:

  • The study of cosmological dark matter using Tree-Particle-Mesh (TPM) N-body codes

  • Higher resolution, more timely weather forecasts. For example, the Weather Research and Forecast (WRF) model will advance weather prediction, making it possible to predict weather patterns more accurately on a 1-kilometer scale.

  • Biomolecular electrostatics: The DTF will provide the resources for detailed investigation of the assembly and function of microtubule and ribosomal complexes using new "parallel focusing" algorithms for fast elucidation of biomolecular electrostatics on parallel systems.

Also, the DTF will enable a new class of data intensive applications that couple data collection through scientific instruments with data analysis to create new knowledge and digital libraries. Targeted data intensive applications will include the LIGO gravity wave experiments, the proposed National Virtual Observatory (NVO), the Atlas and CMS LHC detectors, and other NSF Major Research Instrumentation (MRE) projects such as NEES

HPCwire: What projects will constitute NCSA's prime focus? Which industrial partners will be cooperating? What will be the most concrete long-term benefits?

REED: NCSA and its Alliance partners have been leaders in Grid software and cluster computing systems. The Itanium Processor Family Linux clusters at the core of the DTF are based on ideas and experiences with NCSA's large-scale IA-32 and Itanium Linux clusters. The DTF's Grid software and tools build on ideas and infrastructure developed by Argonne and USC-ISI.

Intel and IBM are close collaborators on microprocessors and compilers, clusters, and Grid software. Qwest is partnering with the DTF consortium on wide-area networking. NCSA's industrial partners will also continue to work with NCSA on new technologies and their applications

HPCwire: Is the DTF itself significantly scalable? To what extent? Are there currently plans to add centers to the DTF?

REED: We believe the DTF will be the backbone for a national Grid of interconnected facilities. Just as the early ARPAnet and NSFnet anchored the Internet, the DTF TeraGrid will anchor creation of a national and international Grid of shared data archives, computing facilities, and scientific instruments.

Concretely, the DTF will provide a resource that is scalable from the desktop, all the way to the 13.6 total teraflops that will constitute the DTF clusters. This means that researchers will be able to easily port their work from their own PCs or small clusters to our large systems.

HPCwire: In terms of both the computing systems being integrated and the optical network itself, how much existing hardware and technology is being used and how much is being built from the ground up?

REED: The 13.6 TF DTF computing system will include 11.6 teraflops of computing purchased through the NSF DTF agreement and two 1-teraflop Linux clusters already on the floor at NCSA. The NSF Cooperative Agreement with NCSA
paid for the latter clusters, and they will be integrated into the DTF system. We expect to add more cluster capability to the NCSA system in the coming years.

The DTF network will be built by Qwest in cooperation with the four DTF partners. It will connect to Abilene, to international networks via STAR TAP, and to the Illinois and California research communities via CalRen-2 and I-WIRE. .

HPCwire: The TeraGrid will use Linux across Abilene, STAR TAP, & CalRen-2. What principal measures will be implemented to maintain security throughout such a heterogeneous open-source environment?

REED: We will leverage the Globus public key Grid Security Infrastructure (GSI) for integrated TeraGrid security. This incorporates PACI-operated security infrastructure, including Certificate Authorities (CAs), certificate repositories for portal users, and revocation mechanisms, GSI-enabled interfaces to DTF resources, client applications, and libraries. . We will also build on the Globus Community Authorization Service (CAS) for community-based access control to manage access to data, compute, network, and other resources.

HPCwire: Judging by the news releases, strategic administration of the TeraGrid is as distributed as its resources. How will critical operational policy directions be determined, and what is your role in that process?

REED: We will establish a TeraGrid Operations Center (TOC) that will leverage elements of the operations centers at NCSA, SDSC, Argonne, and Caltech. The TOC will establish a set of policies that guide the TeraGrid's operation, usage, and technology transfer. Operationally, TOC staff will provide 24x7 and online support for the TeraGrid, deploying automated monitoring tools for verifying TeraGrid performance and coordinating distributed hardware and software upgrades.

All of the principals (Berman, Foster, Messina, Stevens, and Reed) will work collaboratively as a team to establish coordinated policies. I will serve as the TeraGrid Chief Architect, charged with providing advice and guidance on technical directions related to clusters, networks, and technologies, and on new opportunities for the DTF and its evolution.

HPCwire: Ruzena Bajcsy, NSF assistant director for Computer and Information Science and Engineering, has stated that "the DTF can lead the way toward a ubiquitous 'Cyber-Infrastructure'..." Do you agree that this project is the first step toward the development of such an infrastructure? What is the next step? Please describe your vision of a "ubiquitous Cyber-Infrastructure"?

REED: Yes. The DTF TeraGrid is the first step in developing and deploying a comprehensive computational, data management, and networking infrastructure of unprecedented scale and capability. This is the idea of the TeraGrid--a cyberinfrastructure that integrates distributed scientific instruments, terascale and petascale computing facilities, multiple petabyte data archives, and gigabit (and soon terabit) networks--all widely accessible by scientists and engineers.

The development of such an infrastructure is critical to sustain U.S. competitiveness and to enable new advances in science and engineering. New scientific instruments and high-resolution mobile sensors are flooding us with new data, ranging from full sky surveys in astronomy to ecological and environmental data to genetic sequences. The TeraGrid is the blueprint for the
infrastructure that will allow us to glean insights from this data torrent. Terabytes of data from individual experiments and petabytes from research collaborations will soon be the norm. Simply put, breakthrough science and engineering is critically dependent on a first-class computational and data management infrastructure.

In the long run, the TeraGrid vision will help to transform how we work and our notions of "research" and "computing." We will move away from "island universes" to an ubiquitous fabric where applications execute without explicit reference to place.

As an example, imagine an earthquake engineering system that integrates "teleobservation" and "teleoperation" enabling researchers to control experimental tools--seismographs, cameras, or robots--at remote sites, and provide real-time remote access to data generated by those tools. Combined with video and audio feeds, large-scale computing facilities for integrated simulation, data archives, high-performance networks, and structural models, researchers will be able to improve the seismic design of buildings, bridges, utilities, and other infrastructure.

Many such examples exist of how our understanding of our natural world will be enhanced and accelerated through the use of an integrated infrastructure. Similarly compelling examples exist in fields as diverse as biology and genomics, neuroscience, aircraft design, high-energy physics and astrophysics, and intelligent, mobile environments for IT research.

HPCwire: Does the DTF, in fact, constitute a de facto push by the NSF toward virtual unification of SDSC and NCSA?

REED: NCSA, SDSC, and their two partnerships, the Alliance and NPACI, each contribute unique and complementary skills and technologies to the collaborative development of the DTF TeraGrid. Concurrently, each will continue to separately develop and deploy new computing infrastructure as part of their ongoing PACI missions.

HPCwire: How would you characterize your leadership of NCSA? How does it differ from that of your predecessor, Larry Smarr? What are your greatest challenges at this time, and how are you dealing with them?

REED: NCSA and the Alliance are about enabling breakthrough science and engineering via advanced computing infrastructure. That is a long and rich tradition that both Larry and I believe in passionately. NCSA's role is not only to support today's computational science research but also to "invent the future" by developing those technologies that will make today's scientific dreams tomorrow's reality. The TeraGrid is THE NEXT MAJOR STEP along that path, one that leads NCSA and the Alliance to petaflops, terabit networks, hundreds of petabytes and ubiquitous mobile sensors. We're continuing to invent the revolution that will transform science and engineering research.