New Approaches to Sharing Mathematics and Science Databases

 

Len Simutis, Director

Eisenhower National Clearinghouse

For Mathematics and Science Education

The Ohio State University

USA

email: len@enc.org

 

Abstract: Prior to the emergence of the World Wide Web, descriptions of mathematics and science education resources were almost always stored in one of two formats: as structured records in databases with defined fields to describe the resources, or in structured lists, such as those used for digests or bibliographies. In both instances, there was a defined pattern or structure to the descriptions that allowed for retrieval of the information using traditional search schemes for databases, or as a formatting aid to visually scan information in list form. When the World Wide Web emerged, it appeared that structured databases had been superceded by the combination of browsing from a set of briefly-annotated links, or searching the universe of web resources from indexes built from millions of web pages. For the user, however, searching the web has been frustrating and unrewarding, particularly as the number and range of web resources have grown. Fortunately, a new database tools and descriptors have been developed to make web search engines more effective. This paper describes these new tools, including metatagging schema adapted for use in mathematics and science education. The paper also describes emerging collaborative efforts for searches across distributed web resources to help deal with qualitative issues concerning the accuracy and appropriateness of materials for particular educational audiences and uses.

 

Searching the web: the current situation

The current situation for searching the World Wide Web for specific resources is not a happy one. Each person has their own favorite example of the frustrations for searching for relevant resources on the web. Six months ago, mine was to pick any of the major search engines and try to find resources to help a middle school teacher deal more effectively with equity issues in teaching mathematics. So the teacher enters "middle school mathematics equity" in the search engine window, and discovers that the list displayed includes links to pages on home mortgages near the top of the list. Recently, the situation has improved with many search engines, so that among the 25million hits returned from such a search, less than 10% have to do with mortgages or finance, and they no longer appear near the top of the list. No doubt the search engines are weighting the results based on commonality of terms in the search, with educational terms given precedence to investment or finance in this instance. Still, no one can seriously consider reviewing even more than the first 100 or so results without getting tired or dismayed by the range of resources identified. Following even these 100 links will consume considerable time, and not necessarily identify materials of immediate interest.

But what if the teacher wished to be more specific, such as 7th grade geometry concepts dealing with female gender equity issues? With a more specific search, the results are even less on target, generating returns on the first page of the result list on women's sports equity and faculty pay issues along with help wanted ads in Nebraska and a Virginia court case. Many search engines now display a hierarchical browse structure above the list of web sites in the result set with active links. An example of these hierarchical browse structures is depicted in Figure 1. The purpose of the browse structure is to help the user locate resources which may more specifically address the search terms. By inspection, however, one will see that there are a variety of paths to follow, but none which overlap the three key areas--grade level (7th), subject matter (geometry) and issue (gender equity. Many of the topics are unrelated to the search, such as Social sciences > Women's studies, and Business > Gender diversity training. Of particular interest is to note how the

____________________________________________________________________________________

Directory Topics Jump to > Web Search Results - Search Box

Sports > Sports > Gender equity in sports

Women > Education & careers > Gender equity in education

Cultures & lifestyles >

--Gender equity --Women's political issues

Education > K-12 > K-12 schools > Middle schools > Middle schools in North America

Science & nature > Mathematics > Geometry

Social sciences > Women's studies

Education problems & issues > Equity issues

Family > Kids > At school > Math > Geometry for kids

Business > At work > Training > Gender diversity training

_____________________________________________________________________________________

Figure 1: browse structure returned with search for "7th grade geometry gender equity."

 

subject matter--geometry-- has been embedded in the browse structure Science & nature > Mathematics > Geometry. How are these browse structures created? In one of two ways--brute force classification by the cyber equivalent of library catalogers who assign attributes classifying various sites by subject or keyword, and/or by machine analysis of text to derive subject or topical attributes. In essence, additional subject or keyword terms are added to the terms indexed for the site to assure that the site will be retrieved in the browse structure. Why must this be done? Because a web site may have materials of interest concerning the topic of gender equity, but the terms may not appear explicitly at the site, and so would not show up in the list of terms gathered by the search engine indexing schemes. The search retrieval software is thus augmented to first search the terms created by the search engine's catalogers, then the indexed pages themselves.

The search engine developers--from Yahoo to Lycos to Inktomi--have now spun into diversified for-profit ventures with a variety of information products and services. Netscape and Yahoo, for example, display the hierarchical browse structures for major search categories on their home page. While education is one of the launch points, it is increasingly surrounded by searches for commercial goods and services that can generate advertising revenue (and pay fees to rise higher on the retrieval lists from searches). One can expect that more time will be spent classifying commercial sites over educational sites given the increased opportunities for revenue--a logical, though not very beneficial decision for educators and the students and parents they serve. Ideally, it is in the best interest of those who produce the information to provide the subjects and keywords which most beneficially describe a company's or organization's web resources, and not rely on others to do so. As will be seen later, software tools are being developed to accomplish this desirable objective. But first, a look at structured databases, and whether they have similar limitations from the user's perspectives.

Structured databases from a user's perspective

The most familiar front-end to structured information databases is now the online library catalog. In this environment, one can search bibliographic records by author, title or subject using preformatted search screens provided by the software vendor. Since there are a variety of vendors, with software that emphasize particular retrieval and display features, the user often has to deal with learning different software to use multiple systems. Librarians have recognized this dilemma for some time, and have developed the Z39.50 protocol to allow a single search front-end to retrieve bibliographic records from multiple databases. Thus, if a library has Z39.50 compatible software, one could search multiple distributed catalogs with a single search by title or author. The Z39.50 protocol addresses many important issues for the user searching bibliographic databases, but it falls short of dealing with the complexities of full-text documents in electronic format on the web precisely because bibliographic records are highly structured--basically they take the same form and are displayed in the same format--author, title, publisher etc.--while electronic documents are considerably less structured. There is no consistent format, for example, for curriculum materials or lesson plans in electronic format comparable to formats consistently applied to bibliographic records.

What is the current situation for searching for curriculum materials, lesson plans or other educational resources within structured databases? Using the "7th grade geometry gender equity" example, the user must currently conduct searches using unique search and display interfaces developed by various information providers. So, for example, a user could visit the Eisenhower National Clearinghouse and search using the Resource Finder provided there, or search the AskERIC database using a variety of search and display interfaces, or visit the Math Forum at Swarthmore and search their database. Each of these databases has similar, though not overlapping content in most instances, but the way in which the information is searched and displayed varies considerably. From a user's perspective, they would like a single result set that would satisfy the search criteria, and not have to visit (at least) three different sites to gather appropriate and useful information. It would be helpful if the Z39.50 protocol could be of assistance in this situation, but unfortunately, the protocol is not easily extensible in its current form to deal with the extended records that make up databases such as ENC and the Math Forum. ENC, for example, includes the entire table of contents for materials, as well as information on audience, standards, and grade level, none of which are included in the Z39.50 protocol. The Math Forum, AskERIC, ENC and dozens of other databases devoted to curriculum materials or other educational resources go well beyond traditional bibliographic formats in order to provide more immediate and useful information for teachers and others.

Often, a deterrent to the use of structured data accessible via the web is the presentation of a separate search interface, particularly one which attempts to take advantage of the ability to make complex searches from a rich, large database. For most web users, a search engine is a one-line interface used to type in a few keywords. It is not one that involves pull-down menus, browse lists of subjects or keywords, or the ability to construct Boolean searches. Compare for example, the main search page at Yahoo, www.yahoo.com along with the associated browse list structure, with the advanced search page for the ENC Resource Finder, watt.enc.org/main2.html. The former is a familiar front-end to a search page, either for web-wide searches for searches of individual sites--see for example, ENC's site search page: enc.org/rf/nf_index.htm#site. This simple format invites a brief listing of terms of interest, but leads, as discussed earlier, to a large number of returned hits, many of which may be irrelevant or inappropriate. The latter front-end to the ENC database of curriculum resources provides sophisticated features and increased likelihood for retrieving relevant resources, but it is often daunting to many web users. One alternative interface for those who use the web is to create a browse structure for retrieving records from a structured database--see, for example, watt.enc.org/cgi-bin/tree0.pl but the tradeoff is reduced functionality for constructing precise searches.

New developments to improve search capabilities

We know enough firsthand about the shortcomings of current approaches to searching the web--albeit with acknowledgment for the tremendous increase in information discovery capability over the last five years. What does the future hold for addressing some of the problems identified above in searching web pages and structured databases to address the increasingly sophisticated, yet time-sensitive needs of education users? One of the first developments to address these problems has been in the works for several years, and can be described under the general heading of metadata. Metadata is analogous to data definitions for fielded data sets. Thus, author, title, document type, media are all carefully defined, and those who adopt the metadata agree to follow the data definitions. For the domain of instructional and other educational resources, the Dublin Core metadata standards are particularly useful and appropriate. The Dublin Core grew out of decades of experience by librarians developing and using the MARC format for cataloging library resources. The Dublin Core was developed under the leadership of the Online Computer Library Center in Dublin, Ohio. The Dublin Core essentially creates a framework for developing a catalog of electronic resources. Information concerning the Dublin Core can be found at purl.org/DC. Using extensions to the Dublin Core, metadata tags can be created in web documents and databases so that standard descriptors such as subject, grade level, instructional methods, and standards can be employed in a consistent fashion.

The second technical development is the next generation of web document markup language, called XML. XML is designed to replace HTML with a considerable number of new features and enhancements. Like the Dublin Core, XML is based upon years of previous research and application in the development and use of SGML, originally developed by IBM. Detailed information on the XML standard can be found at www.w3.org/XML. XML uses Document Type Definitions (DTD) derived from SGML to allow for web documents to be consistently encoded with metadata. Thus, a teacher who has created a lesson plan for a science class would use XML to encode metadata describing subject, grade level, equipment required, method of assessment, duration of the lesson etc. Once encoded, a search engine could be programmed to read the encoded data fields and assign higher weights or specific categories when conducting searches. So, a metatag such as grade level would include the numeric entries for the appropriate grade or grades, and the search engine would then be able to return meaningful results even if the text of the lesson plan itself did not refer to grade level.

The third technical development to improve search capabilities is RDF, or Resource Description Framework. RDF is basically an agreed-upon schema for describing electronic resources that can be used by search engines and other programs to parse the metadata included in XML electronic documents as described and adopted by particular communities of users. RDF makes is possible for a search engine to know that the metatag "grade" may have specific meanings for different audiences that make use metadata. For example, "grade" in an educators' schema will be different from "grade" used by civil engineers or diamond merchants. RDF allows for communities to develop agreed-upon meaning and syntax for metadata that is appropriate for the intended users. Metadata, XML and RDF will work together to provide the document preparation and processing environment to return more meaningful results from search engines.

Although XML and RDF are not yet implemented in commercial browsers or search engines, efforts are underway to build metadata structures that will take full advantage of these capabilities as they become available. One project which incorporates many of the desired characteristics is GEM, the Gateway to Educational Materials, funded by the U. S. Department of Education and led by Syracuse University. GEM takes elements developed in the Dublin Core and provides extensions to more fully describe educational resources. Vocabulary for the various metadata fields is derived from ERIC descriptors, the Eisenhower National Clearinghouse (ENC), and through the work of GEM taskforces to develop consistent document descriptions. Organizations who contribute resources to the GEM virtual collection agree to adopt GEM metadata protocols. In the case of ENC, for example, its entire collection of over 11,000 detailed descriptions of math and science curriculum resources was exported from its current database with GEM metadata. Others will create new GEM descriptions using GEM cataloging software or by direct inclusion into tag fields in HTML files. Currently the GEM Consortium is composed of 40 organizations that have contributed approximately 3,000 resource descriptions to the GEM collection. Growth of such a consortium requires additional institutional commitment of educational resources, and the time and training necessary to employ the GEM metadata schema, but momentum is certainly in favor of continued growth, given the long-term advantages for educational users. A similar effort is underway with the sponsorship of EDUCAUSE for the development of metadata for use with college and university resources. Information about this metadata schema, called the Instructional Management System (IMS) is available at www.imsproject.org.

In order to take advantage of the benefits of metadata, it is not essential that all metadata elements in a particular schema are used. For example, a particular developer of educational materials may choose to only include a subset of metadata tags; but as long as this subset is encoded following established schema, then the search will be enhanced by their inclusion. Similarly, a developer may choose to use two or more schema in a particular document--one for educational materials, and perhaps one for commercial distribution and income flows. This is particularly likely to occur if commercial developers choose to use the IMF schema, since many of its features are tied to for-fee transactions. As various RDF schema are developed, it is essential that they be developed from a common "core" of appropriate metadata for the discipline or field, just as the Dublin Core has emerged for educational and research resources in electronic format. It is obvious that the various schema must be developed in concert, and that communities of interest must engage seriously and openly in developing metadata that will be useful and accurate for the resources created by and for the community.

One of the major impediments to the use of metadata is arriving at agreements concerning the vocabulary to be used consistently in the metadata fields. Some are at the nuisance level: is it "computer-based" instruction, or "computer-assisted" instruction? Others are far more critical, particularly in the development of subject matter hierarchies that describe subject domains. Some terms may be tied to passing educational trends, others may the basis of fundamental debate in a discipline. All require consistent cataloging and periodic analysis to assure currency and accuracy. Just as authors prefer articles for journals using prescribed formats, it is likely that over time, the vocabulary and procedures for imbedding metadata in electronic resources will become a byproduct of document creation. But until that time arrives, a fair investment in time and training will be required for documents already on the web as well as those to be added in the future. This cataloging can be done by third parties with the necessary subject matter knowledge and expertise to describe the electronic documents to meet the metadata standards. In addition, organizations like GEM and IMS are already far along in development of cataloging tools to make this process more consistent and less costly. Organizations with large document collections may also choose to employ metadata so that documents for internal as well as external use are more effectively located. As more and more student projects and teaching materials are produced in electronic formats, metadata can be a useful way of archiving these resources as well for use on school intranets.

In effect, metadata is an effort to transform the web from being a very large collection of unstructured "flat" files and isolated database records that can only be searched with great difficulty, into a structured database environment with superior search and retrieval capabilities. It also is an effort to increase the likelihood that trusted and useful electronic resources will be made available by content providers and developers so that less time is spent wading through perhaps interesting but unrelated resources to find the few that are of interest. We are, for better or worse, still in the Paleolithic stage of the web, struggling to be hunters and gatherers of information as best we can, following paths defined by others, breaking twigs along the way, or at least creating bookmarks so that we won't be lost when we return. Metadata is a logical and necessary next step to increase our search and retrieval capabilities as the scale and scope of electronic information resources continue to grow quickly. The use of metadata and associated schema will make it possible for search engines to make more intelligent use of keywords entered into search screens, so that through analysis of these terms in the context of specific schema, the likelihood of retrieving relevant resources will be increased.