Advancing Water Resources Research and Management |
| Symposium on Water Resources and the World Wide Web |
|---|
| Seattle, Washington, December 5-9, 1999 |
A database of environmental documents about an Urban Estuary, with a WWW-based, geographic interface
Kirk R. Barrett (1), Richard Holowczak (2) and Francisco J. Artigas (1)
Contents
Abstract
Keywords
Author Contact Information
Introduction
Background
Objective
Relationship to other efforts
Implementation
Database structure and development
GIS coverage development
Application Architecture
Text Query Interface
Geographic Query Interface
Discussion
Alternative approaches
Future issues
Closure
Acknowledgements
References
Figures
The Hackensack Meadowlands Development Commission (HMDC) oversees the Hackensack Meadowlands District, a heavily impacted but ecologically valuable estuary in northeastern New Jersey. The HMDC has hundreds of papers and reports by government agencies, consultants, professors and students on environmental issues within the estuary, collected over many years. Researchers, regulators, consultants and the general public are interested in these documents, but, formerly, there was no systematic way to look up a document. This paper describes the creation of a database of metadata on these documents and a WWW-based, geographic-interface to the database.
HMDC and the Meadowlands Environmental Research Institute (MERI, whose mission includes creating advanced information management tools to aid HMDC and MERI researchers) designed the database. Besides document title, author(s), and date, the database contains other valuable metadata about each document, such as biota, water bodies, chemicals, media, land use type, habitat type, meteorology and hydrology, along with scans of the abstract and a figure. To capture geographic information, geo-registered coordinates of sampling/analysis locations were entered via HMDC's geographic information system.
MERI built a WWW-based interface to the database, integrating web server, database server and GIS server technologies. Through a map interface, a user can obtain a list of documents that studied a particular area. Conversely, through a text interface, a user can obtain a list of documents that report on, for example, a land use or cover type or a specific water body, with sampling/analysis locations displayed on a map.
: digital libraries, WWW, GIS, metadata1- Kirk R. Barrett and Francisco J. Artigas
Meadowlands Environmental Research Institute
Center for Information Management, Integration and Connectivity (CIMIC)
Rutgers University
180 University Avenue, Room 202 , Newark, New Jersey 07102
Telephone 973-353-5026 Fax 973-353-5003
Email: meri@cimic.rutgers.edu
2 - Richard Holowczak
Department of Computer Information Systems Baruch College,
City University of New York
17 Lexington Ave.
New York, NY 10010
E-Mail: richard_holowczak@baruch.cuny.edu
Background
The Hackensack Meadowlands District encompasses a large part of the estuary fed by the Hackensack River in northeastern New Jersey. The 82 sq. km (32 sq. mile) District is regulated by the Hackensack Meadowlands Development Commission (HMDC), a state agency created in 1968. The Meadowlands are heavily affected by past industrial waste discharge and solid waste disposal. There are about 520 ha (1300 ac) of landfills. [HMDC, 1999]
In addition to the River, the area is crisscrossed with numerous tidal creeks and human-made channels, with extensive tidal mud flats. Sediments are contaminated in many areas [USACE, 1995; Berman and Bartha, 1986; Konsevick, 1993]. About 3400 hectare (8400 acres) of wetlands remain in the district, many of them ecologically degraded and/or on contaminated soils. The HMDC has an active wetland restoration/enhancement program, affecting several hundred hectares. With large tracts of open space near the heart of metropolitan New York City, the Meadowlands are under intense development pressure and are carefully watched by environmental advocates.
As such, the Meadowlands are environmentally complex and ecologically, socially, economically and politically important. Accordingly, hundreds of environmental studies have been performed in the Meadowlands over the last 30 years, ranging from university theses to development permit applications to site remediation reports to internal HMDC reports. These documents are housed at HMDC offices; most of them are "gray literature", having never been widely published. A host of researchers, regulators, consultants and the general public is interested in these documents, but, formerly, there was no systematic way to look up a document, nor to learn about its contents.
The Meadowlands Environmental Research Institute (MERI) was established in 1998 by by HMDC and Rutgers University's Center for Information Management, Integration and Connectivity (CIMIC), MERI's mission includes creating advanced information management tools to aid HMDC and MERI researchers.
The objective of the project described herein was to create of a database about these documents, going well beyond a conventional electronic catalog (e. g., limited to author, title, date, subject) to include much valuable metadata about each document. Such data includes the locations, animals, plants and chemicals studied. Furthermore, we wanted to store the specific locations that were sampled or analyzed by a specific document. Future researchers will want to know what studies have been previously done on a particular area, or, conversely, what locations have been studied for a particular topic (e. g., mercury in sediments). This information will facilitate, for example, comparative studies of change over time and comparisons of one location to another.
Accessing this geographic data called for a geographic interface, so the database could be queried by location. In addition, we wanted the interface and database to be available through the World Wide Web, so it would be accessible to a large user base.
From a database professional's prospective, a database of metadata about documents is not novel. However, few comprehensive metadatabases exist in the environmental field. For example, the USEPA National Publications Catalog (http://www.epa.gov/ncepihom/catalog.html) allows searching only by title and subject. USGS Toxic Substances Hydrology Program On-line Reports (http://toxics.usgs.gov/cgi-bin/dwm/bib_search.cgi) is similar, but it also allows a search by site from a pick list. USGS Selected Water-Resources Abstracts (http://water.usgs.gov/swra/index.html) are searchable for any text, and by "hydrologic unit" or state.
Some commercial products also include searchable abstracts or full-text, but this is not equivalent to a metadata description. Others do provide some metadata. For example, Aquatic Sciences and Fisheries Abstracts includes an "Environmental Regime" attribute (http://www.libraries.rutgers.edu/fldguide/espmdb.htm). This allows a user to query for articles that studied brackish systems, for example.
The geographic query interface to an electronic catalog of documents is not a conceptual breakthrough, but it does appear to be a fairly novel approach. Few such interfaces seem to exist at present in the digital library field, and those that were or are being developed are major, large budget projects.
The Alexandria digital library project (http://alexandria.sdc.ucsb.edu/) provides such facilities to zoom, pan, and dynamically define a search area on a map. Another notable application is the California Dams database, implemented by the Berkeley Digital Library Project (http://elib.cs.berkeley.edu/). This application allows the user to zoom, pan, and query individual dams by clicking on a symbol on a map.
There are numerous geographic interfaces to environmental data, notably, the USGS's interface to its network of stream gauging stations (http://water.usgs.gov/realtime.html). They use the clickable image map approach to select a particular station. Since the number of stations is static, this approach is appropriate. If the number of stations were dynamic (like the number of documents in a database), this approach would require continuous regeneration of the image map.
Database structure and development
A conventional electronic catalogue would include only standard information such as title, author, and publication date. To be more useful, we decided to include additional metadata on each document, specifying a document's contents in terms of study areas, specific sampling locations, biota, water bodies, chemicals, media, land use type and habitat type.
Guidelines and standards for environmental metadata were reviewed [Michener, et al., 1997], including that of the National Biological Information Infrastructure (http://www.nbii.gov/standards/metadata.html). Since these were developed for actual numeric datasets and not for a database about documents, they were not entirely relevant. Still, they provided a useful reference in deciding what attributes to use to describe a document.
The attribute list was created by water resource and environmental professionals, aimed at what users would like to know about the documents in the database. The goal was to provide the "who, what, when, where and why" about a document, so a user could know what locations and subjects were addressed in a document. A report of the attributes of a typical record in the document database is presented in Figure 1.
A single-table database structure was initially developed by water resources professionals with limited formal database training, with the goal of quick structure development and simplicity of data entry. It was understood that any problems with the database structure could be addressed later by database experts. Microsoft Access was used to create the database because it was available to the database technician.
The database, which presently contains 250 records, was populated in three months by a half-time undergraduate technician. The technician collected paper copies of the documents in the offices of HMDC staff. Each document was assigned a unique document number, formed by the year of the document's publication and the numeric sequence in which documents from that year were processed (e. g., 1988-001 was the first 1988 document processed; 1988-002 was the second, and so on). She perused each document to determine its content relative to the database fields, and entered the data accordingly.
In addition to this data entry, she selected the most significant page of text (the abstract if available; otherwise the table of contents, conclusions or similar) and scanned the page, using optical character recognition software to save the results as a text file. Saving in text format (instead of a graphic format, the native format from scanning) allows the file to be searchable for text strings. She also scanned the most significant figure (the site/sampling map, if present, keeping in the spirit of metadata) if applicable, saving in a graphic format. The abstract and figure are in separate files, with the file names keyed to the document number, thereby associating them with their source document (i. e., the abstract from document 1988-001 was named "1988-001ab.txt").
The technician also captured the specific sampling or analysis locations mentioned in document, if any (about two-thirds of the documents had none; instead, the document discussed locations in terms of more general place names, such as the name of a creek, a wetland or a property). First, she photocopied a sampling location map from the document, or transcribed the locations onto another map. Then, she took these maps to HMDC's geographic information system (GIS) center and entered the locations onto a map of the District (geo-registered to the New Jersey State Plane projection) using ARC/VIEW software. Using a mouse, at each on-screen location corresponding to a paper map location, a "point" feature was entered, along with the document number associated with that point, creating a GIS coverage.
Quality control was provided by an environmental manager at HMDC, who was the most familiar with the collection of documents, and the MERI project manager. They periodically reviewed printed reports of the database and flagged errors for correction by the technician.
Although it afforded quick data entry, the single table design would not support the type of flexible queries required by end users. We designed a normalized database schema (Figure 2) with a flexible structure that allows a variable number of authors, study areas/locations, land types and study parameters to be associated with each document. The normalized schema was first implemented in MS Access and a series of SQL (structured query language) queries were used to populate it using data from the original single table.
Then, the entire schema was exported to an Oracle database to facilitate a web-based interface. The Oracle database management system (DBMS) is an "industrial strength" system capable of serving a large number of users via the web. The remainder of the applications was built around the Oracle database.
Looking ahead to the geographic query interface to the database, we decided on a grid-based query approach because of its relatively straightforward implementation, while maintaining utility. A regular grid with cell size 580m x 580m (297 cells total) was overlaid onto a base map of the Meadowlands, creating a polygon coverage. Each grid cell was assigned a number (cell_id).
An intersection operation between the coverage with document numbers and sampling locations and the grid/cell_id coverage created a new table relating document number and cell_id.
The application architecture is shown in Figure 3. The application is implemented as a distributed system that integrates the document metadata stored in an relational (Oracle) database, GIS data (the map of the Meadowlands district, sampling locations and grid) stored in ESRI ARC/INFO formats and conventional HTML web pages that provide a convenient single point of entry into the application. Version 8 of the Oracle DBMS was used with the WebDB web-interface to access the database from a web browser.
There are two query interfaces to the database: text and geographic.
The text interface uses a conventional HTML form on a web page to allow the user to formulate a query. The query conditions are accepted by WebDB and passed to stored procedures (written in Oracle’s proprietary PL/SQL language) in the Oracle DBMS. The procedures parse the incoming request, query the database and format the query results as HTML tables with embedded hyperlinks. The HTML is then sent back to the web browser for display.
The user can provide example keywords that match either the title or a description of the document. A portion of or all of an author’s name can be supplied. A date range covering the document’s publication date can be specified. Any combination of specific study areas and land types can be selected. The results of the query can be sorted by publication date, document number or document title. Figure 4 shows the text query screen.
Once the results of a query are displayed, the user may click on several types of hyperlinks to view related documents. Figure 5 shows the results from a query on keyword "Metal" restricted to the Sawmill Creek and Berry’s Creek.
Clicking on an author’s name will retrieve all documents written (or co-authored) by that author. Clicking on a study area will retrieve all documents that studied that area. Clicking on a Land Type will retrieve all documents that studied similar land types. Finally, on the left hand side, there are three links: Show Map, Show Abstract and Show Figure. Clicking on the Show Abstract and Show Figure will cause the respective abstract and figure to display. Clicking on the Show Map link invokes the geographic interface described below to display the sampling locations (if any) that are referenced in the document.
A noteworthy feature of the application is its geographic interface that allows a user to obtain a list of documents that sampled a particular location, and to display the sampling locations for a particular document. This interface uses Arc/Info GIS products: a base map of the Meadowlands and the coordinates of sampling locations.
In the geographic interface, the user is presented with a map of the Meadowlands district (Figure 6). Superimposed on the map are the grid cells and sampling location indicator symbols. Each symbol indicates that a document sampled or analyzed at that location.
The interface has two modes of operation: zoom/pan and select. In the zoom/pan mode, a user can zoom in the map to look at a smaller region more closely. In the select mode, the user selects a particular cell for a query.
Access to the ARC/INFO GIS coverages is handled through the ESRI MapObjects server. MapObjects integrates with a host’s web server and accepts requests from a web browser. MapObjects routines parse the request and perform an appropriate action. In the zoom/pan mode, the results of the zoom/pan action in MapObjects are converted to a GIF image that is then sent back for display by the web browser.
In select mode, a MapObjects routine reads the number of the cell (cell_id) that the user has clicked. The cell_id is passed to an Oracle-stored procedure that incorporates it into an SQL query. The procedure searches the table with document numbers and cell_ids for records that contain the cell_id in question, returning the list of relevant documents, which are displayed as in the text interface query.
As mentioned above, the grid-based approach was selected primarily for its straightforward implementation. Since the Meadowlands District is relatively small (82 sq. km or 32 sq. mile) a single grid resolution was sufficient. A medium-sized grid avoids retrieving many geographically irrelevant documents (which would happen with a large grid), while also avoiding the tedious and error prone task of selecting small cells.
In this database, the number of documents (around 250) and locations (219) are manageable. If these numbers were very large, then single-resolution grid approach may not be practical. A query on a single cell could return an unmanageably large number of documents.
If the grid-based query approach were to be applied at a larger scale (say thousands of sq. km) or if the number of documents or sampling locations was very large, then multiple grid resolutions would probably be necessary, so a user could retrieve documents associated with a large area (e. g., New Jersey) and a small one (e. g., a wetland in the Meadowlands).
A more general approach would be to allow the user to define a search area by clicking-and-dragging a circle, square, or, most generally, a polygon with user-specified vertices. This interface would have been much more difficult to implement.
An alternative (or complimentary) query approach to either of the above approaches is using the boundaries of features (i. e., creeks, wetlands, landfills, property names, etc.) as a search area. In this approach, a user would select (click on) on a feature, and the documents with sampling locations within that feature would be returned. Although this approach is more intuitive than the grid-based approach, it presents numerous difficulties. First, it requires the boundaries of each feature, some of which are likely not well defined (e. g., a wetland boundary, a creek with many small tributaries), readily available or overlapping. Also, it would limit the ability to query overlapping features or subareas within a feature.
The feature-based approach would help address the biggest problem with the application right now, that is, that two-thirds of the documents do not list sampling specific locations and are therefore not accessible through the geographic interface. However, finding these documents requires a "query by feature" not a "query by location". That is, the query would have to search through the metadatabase for a match with the selected feature, not search based on the sampling locations.
Unfortunately, users will probably have trouble appreciating the difference between a feature-based query and a location-based query. Making all documents accessible through a location-based query would require adding specific study location(s) to documents that do not have any. Using the centroids of the features a document studied is an obvious approach for doing so. However, for documents that study large or irregularly shaped areas, this may not be sufficient.
A hybrid approach would be to provide a list of features, each with its own list of associated grid cells. A user could select one of these features on the map, which would automatically select the associated grid cells.
Another approach could have avoided the Map Objects programming task altogether. This approach would use a clickable image map, with each location indicator symbol hot-linked to a document record. Creating this image map would be extremely tedious, although it would be possible to automate. This approach would be difficult to update because it would require recreating the image map each time a new document with new location coordinates is added. Also, it would not be possible to return all document records for documents with overlaid locations. And, with today's browser technology, it would not be possible to zoom into the map.
In the future, it is desirable for all documents to provide sampling/analysis locations whenever appropriate. Furthermore, since the accuracy of the sampling locations in the database is limited by the accuracy in the source document (which is presently highly variable and not easily assessed), the locations should be geo-registered. With today's low-cost global positioning systems, this is not a highly burdensome request.
Ideally, the entire text of a document would be accessible through the web interface. However, this would presently require laborious page scanning. In the future, if documents were submitted electronically, it would facilitate making full text available.
Placing the catalog on the web has raised the possibility of a large number of requests for full copies of documents. This has not been the case in the past, so there is no staff or procedure set up to handle a large number of requests. Currently, the database is under password protection until such can be developed and implemented. The password is available to HMDC staff, and can be requested and granted on a case-by-case basis to others.
The development of database about HMDC's environmental documents, complete with valuable metadata, abstracts, figures, and sampling locations has already proven valuable to MERI researchers. The geographic interface with a grid cell-based query approach was relatively straight-forward to implement and works effectively.
Such a database of documents and interface would undoubtedly be useful for other organizations. It is probably only a matter of time before more organizations implement this approach. We hope our efforts will inspire other organizations to embark on similar efforts.
We would like to acknowledge the efforts of Ms. Lily Konsevick, Mr. Mustafa Kilic, Ms. Soon Ae Chun, and Mr. Edward Konsevick. We thank HMDC for their financial support.
1. HMDC. 1999. Land Use Update 1999. Geographical Information Systems Unit, Hackensack Meadowlands Development Commission, Lyndhurst. New Jersey.
2. US Army Corps of Engineers. 1995. Hackensack Meadowlands Draft Environmental Impact Statement/Special Area Management Plan. US Army Corps of Engineers.
3. Berman, M. and R. Bartha. 1986. Control of the methylation process in a mercury-polluted aquatic sediment. Environmental Pollution, Series B. v 11, pp. 41-53.
4. Konsevick, E. 1993. Accumulation of Chromium in Blue Crabs (Callinectes sapidus) from the Hackensack River, Hudson County, New Jersey: Final Report. Hackensack Meadowlands Development Commission. 1993. 104 pp.
5. Michener, W.K., J.W. Brunt, J.J. Helly, T.B. Kirchner and S.G. Stafford. 1997. Non-geospatial metadata for the ecological sciences. Ecological Applications 7(1):330-342.
Figure 1: Report of a typical record
Figure 2: Normalized database schema
Figure 3: Application architecture
Figure 5: Result of a text query
Figure 6: Meadowlands district with indicators of sampling location (yellow squares)
and grid cells
![]() | |
| Symposium TOC | AWRA Home page |
Maintainer: AWRA Webserver Team
Copyright © 1999 American Water Resources Association