|
Home Page
|
Lexical Scanner Design Issues General Approach At the highest conceptual level the Growth House search engine is a metacontent manager providing an intelligent gateway to multiple content sources and backend databases on the net. Currently the search engine processes user search requests and returns URLs of relevant resources. Because the major biomedical databases on the net generally have well-defined search vocabularies and interfaces, it is feasible for the Growth House search engine to translate poorly-structured user search requests into well-structured search requests and then give the user the option to either request a more targeted search of the Growth House database itself, or pass the request in well-structured form to another biomedical database for remote execution. The main function of the parsing program will be to examine short, natural language English text search strings and parse them to determine if certain recognized strings occur in the input. If so, the user will be provided with feedback on what was found, concurrently with returning a search result based on the original input. The user may then re-execute the search with refinement suggestions made by the lexical scanner. If the parsing program detects strings which suggest that a search at a particular biomedical database would be productive, it will generate return page HTML based on the specific Database Interface (DBI) protocol published by the data source. Because there are many possible places to route searches, a first step is to evaluate which of the databases are most useful for palliative care and analyze the vocabulary systems which they employ. For example, if the program found possible MeSH headings related to the search they will be shown to the user with an option to pass the search directly to NLM for execution through an interface such as the Internet Grateful Med Request Scanner. This is possible due to the well-defined interface which NLM provides to its various databases. Some lexical scanning is now in place in the current production version of the Growth House search engine. The present lexical scanning features mainly look for structurally-malformed search strings. Examples of existing lexical scans include inappropriate use of control characters such as + and - as well as automatic removal of certain characters and substrings which could be used by malicious users to attempt to subvert system security controls. The program will make use of a database of semantic relations defining associations between strings found in user input and known terms. The success of the project depends heavily on being able to use domain-specific terms. Experiments in natural language processing generally show that it is difficult to perform complex language comprehension tasks unless the language domain can be limited in some way. In this case we have the benefit of access to extensive computer logs of actual user searches showing common English words and phrases which the database will need to know. These logs are collected automatically by the Growth House search engine and do not contain any identifying information about the user who did the search. Growth House server logs additionally show the original search strings used on major search engines such as Infoseek, Altavista, HotBot, and others, which resulted in the user following a link from that search engine to Growth House. Regular content analysis of these logs is currently done by Growth House to optimize our internal database and will permit us to construct a semantic database which is optimized for comprehension of questions about end of life care. Because Growth House has relationships with palliative care specialists internationally, we will also test some semantic relations to perform multilingual translations. These substitutions can be accomplished fairly easily for languages which can be represented using an ASCII Latin character set. Examples of simple semantic relations would include recognition of British versus American usage, or translation of terms between languages like French and English to the extent that the character sets are similar. Search results will be presented in English. Overcoming The Stateless Web Paradigm Because the web operates on a stateless model, effective iterative searching requires some method to store improvement strategies between search requests. The most common methods currently seen on the web for achieving this are:
Delivery Platform Current thinking is to develop the application in Perl using standard cgi interface techniques to permit maximum integration with the existing Growth House search engine. An alternative would be move toward a standalone SQL solution with a different back end database for the semantic relations. Hardware and telecommunications will be outsouced and are scalable as needed. The much larger delivery environment used by Internet Grateful Med is described here. Examples Of Design Issues
Malicious users sometimes attempt to circumvent system security by passing certain strings to cgi programs in hopes that the strings may permit them to execute system-level commands on the host server. To prevent this, all incoming strings are passed first through a security filter which examines them and removes potentially hostile substrings. This preliminary security filtration takes place prior to lexical scanning for search optimization purposes. Security filtration is performed by the main search engine program itself as part of its extraction of control strings from the incoming web requests, not by the lexical scanning subroutine.
|