Modular Bibliometrics information system with proprietary software

Modular Bibliometrics Information System with Proprietary Software.

Gilberto Sotolongo-Aguilar *, Carlos Suárez-Balseiro **, Maria V. Guzmán-Sánchez * (contactos)

Contents:

Introduction
Background and purposes
Problems with home-made aplications and advantage of propietary sofware.
Methods
Results and Discussion
Procedure
Some Comments
Results
Conclusions
Bibliography

INTRODUCTION

Bibliometric information systems are the workbench of Science and Technology (S & T) indicators research. As an important part of this field of endeavor, it requires a flexible design in order to obtain accurate and customized indicators as well as to incorporate new features resulted from the latest developments. This paper describes an open and flexible Bibliometric information system. The system complies with a simple modular design, connectivity for desktop work and low cost. It is useful for practical work as well as for education and training. Limitations are mainly due to available memory although one million documents are the maximum possible volume for treatment.

BACKGROUND AND PURPOSE

Today, the need to become oriented inside of huge volumes of information generated by so many different means and supports, becomes a challenge for the information professionals. The birth and development of new disciplines as "data mining" and "knowledge discovery" ",(Fayyad et al., 1996; Adriaans y Zantinge, 1996; Cabena et al., 1997 Dhar y Stein, 1997; Swanson y Smalheiser, 1997) specifically oriented towards the quest and interpretation of new knowledge by means of the research of the information production and consumption processes (IPCP) , shows a clear hint of the increasing importance of everything related to the quantitative and qualitative analysis of huge corpora data.

Bearing in mind the former, bibliometrics techniques, with the aim of the study, classification and assessment of information production and consumption of scientific information by means of quantitative methods and statistical treatment of data, becomes one of the fundamental tools available by the information professional for his quest of indicators. Allowing he or she a "critical appraisal" of scientific research, as well as the interaction among researchers, institutions and knowledge areas.

The former has conditioned the increase of efforts for the systematization and standardization of methods and tools used in bibliometrics. For Glanzel (1996) bibliometris is a complex discipline, although ins classified among the social sciences it is narrowly conditioned by pure and technological sciences. Reason why any methodological characterization on the one hand requiered well documented methods and data processing, a clear description of the sources and exact definition of indicators; on the other hand, an effective selection and integration of the applied technologies. Ravichandra Rao (1996) asserts that there is not an unique method that could be applied to any research by means of bibliometrics techniques, but different procedures for different problems. Grivel, Polanco y Kaplan (1997) emphasizes what they call the "informatic infrastructure" upon which bibliometrics could develop all its potential. For these authors bibliometrics should have not only a methodology characterized by an adequate mathematical representation but also by means of an effective "informatic arquitecture". In this same direction is the work of Katz y Hicks (1997), Small (1998) y Sotolongo-Aguilar, Guzmán-Sánchez y García-Díaz (1998) and others. They show projects that points to the integration of different informatic tools by means of proprietary software of public domain; bulding-up platforms that fits the needs of different approaches of biblometrics research. Probably one of the most exiting product-project is DATAVIEW from CRRM, that could be classify as a "very high integrated" software.

In this paper we report one of the components of an ongoing research devoted to the definition and assessment different stages of a procedure for studying the production and consumption of information by means of bibliometric techniques. The former is supported in the integration of different software widely spread and easy to use with the following aim:

Consolidate an informatic infrastructure applied to the realization of studies by means of bibliometric techniques
Elaborate a scientifically based proposal of methods generalization, and other elements to be considered, for the study of information production and consumption processes by means of bibliometrics techniques.

PROBLEMS WITH HOME-MADE APLICATIONS. ADVANTAGE OF PROPRIETARY SOFTWARE

In bibliometrics research every body has experimented the need of bulding in-house applications. This is a fact. The problem arrives when generalization should be done. In-hose applications are rarely well-documented and its use by others becomes difficult. The results are that only the members of the team are able to replicate the use of such application. The standardization "stall".

On the other hand, proprietary software is well documented, and the validation of techniques is obvious. Besides that, many teams of developers are continuously improving the performance of such software.

METHODS

Six modules based on proprietary software integrate the system. Each of them performs defined functions. The modules are the following:

Bibliographic Searches
File Conversion & Handling
Bibliographic Reference Management
Basic Statistics
Basic Bibliometric Analysis
Advanced Statistics.

Modules interact according to the following procedure:

Bibliographic Searches, are conducted online or on CD-ROM. Resulting files are downloaded and converted by module (2) File Conversion & Handling. Resulting files are the input to module (3) Bibliographic Reference Management, where the standardization of the database is performed. Different fields under study or a combination of them are exported and saved as text files. Afterwards in module (4) Basic Statistics, those files are processed. All possible basic statistics could be performed based on frequency analysis special feature with built-in functions. The input for module (5) Basic Bibliometric Analysis, is prepared in module (3) where also the preparation of the input for module (6) Advanced Statistics, take place. Different scenarios could be implemented, varying elements inside each module.

RESULTS & DISCUSSION

SCENARIO

A possible software scenario associated to each module could be the following:

Bibliographic Searches depend on the topic of research e.g. in biomedicine SPIRS, WINSPIRS, both from Silver Platter, PubMed and Internet GratefulMed or The Query E-mail Retrieval System from NLM, etc.
File Conversion & Handling, is based on BiblioLink II, (PBS Inc.) allowing the conversion of virtually any downloaded file from all CD-ROM or online service; this program ins the natural partner of Pro-Cite. As a complement MS WORD is used for text processing including some file modifications.
Bibliographic Reference Manager, this module is integrated by Pro-Cite (DOS,Windows´95) from PBS, Inc., very good for managing bibliographic references allowing standardization of data.
Basic Statistics, the functions of this module are performed by MS EXCEL and its complement xlStat. This gave the possibility to profit from all the built-in features of this program including graph and functions features.
Basic Bibliometric Analysis is performed by means of The Bibliometric ToolBox, a free-ware offered by Terrence Brooks; this software performs all the basic biblometric analysis (Bradford, Lotka, Zipf) and also offers a set of complementary value-added bibliometric data. Finally
Advanced Statistics, is based on STATISTICA StatSoft Inc., used mainly for correlations as well as for cluster and multidimensional analysis.

PROCEDURE

The above mentioned scenario operates according the following procedure: Bibliographic searches are conducted online or on CD-ROM. Resulting files are downloaded and treated by BilIolink II converting it according to a selected configuration that depends o host and fields to be studied. The resulting converted file is already in Pro-Cite format having the possibility to switch directly to the bibliographic reference management features of Pro-Cite. Here standardization of the database is conducted. Many different treatments could take place including the building of Authority Lists with the contents of different fields including an Authority List of all word in any field or in the whole database. The different fields under study or a combination of them are exported and saved as text files. Afterwards, EXCEL imports those files. All possible basic statistics could be performed based on frequency analysis aides by the Pivot Table feature of EXCEL complemented by built-in Analysis Functions available in the Tools Menu. Frequency tables obtained from the Pivot Table are copied and pasted in a text file in Word, next the table is converts to text saving data with paragraph-marks as separators. The resulting text file is the input for Basic Bibliometrics processing by The Bibliometrics ToolBox. All the analysis performed by The Bilbiometric ToolBox could be recorded in a text file. With EXCEL also is possible to built the matrices that produce the input for Cluster Analysis, Factor Analysis and Multidimensional Scaling. Those matrices are exported as EXCEL sheets and afterwards imported by STATISTICA and finally processed.

SOME COMENTS

In regards with bibliographic searches in biomedicine we have been using The Query E-mail Retrieval System from NLM. This is a very nice retrieval engine by e mail and works very well. In the case of the bibliographic reference management software, we have used extensively ProCite beginning with version 2.02 (MS DOS) up to the latest available 4.01 (for Windows). The advantage with the later is that it integrates its companion file-conversion software BiblioLink II. It also works very smoothly. Other reference management software have been tested e.g. EndNote and Reference Manager including the latest versions. All of them, with their advantages and disadvantages fit in this model. Up t nearly 40 reference management software in the market could be eligible for this tasks. Statistical packages are another important component. Undoubtedly EXCEL is widely use and complies very well with many bibliometric tasks. The very good complement, as it was already mentioned, is xlStat with many useful features for cluster analysis and multidimensional scaling.

Although not yet tested by this team the incorporation of DATAVIEW to the platform presented here seams to fit very well. Its "very high integration" combining it, for example, to a reference management software could return very good results.

RESULTS

This system platform guarantee a comprehensive trazability of all data from the first data downloaded, to the last chart obtained. At the same time, consistent results are attained by means of the reproducibility at all the steps performed as was described above. Bibliographic data in the database could be treated for building up bibliographies.

Bibliometric output data of the system includes, among others, the following indicators:

all Bibliometric distributions (Bradford, Lotka, and Zipf with different alternatives);

Frequency tables of all the fields processed including graphs;

Activity Indices according to different criteria;
co-occurrence matrices of all the multitext fields processed and the correspondent Cluster Analysis, Factor Analysis and Multidimensional Scaling, including Bibliometric maps.

CONCLUSIONS

The benefits resulting from the follow-up of the development outlined in this paper could be twofold. On the one hand, integrating public domain software in a flexible modular design, comprehensive automated processing and data representation stages of research could be achieved in contrast to the cumbersome tasks that should be performed by other means. On the other hand, this platform is supported on software widely used that are regularly updated and upgraded; in contrast with adhoc software that becomes outdated very rapidly.

The described Bibliometric information system has shown to be a working platform that could be up-graded, flexible and with utility performance. Improvements are foreseen. Participation on the testing of this platform is welcome, as well as new ideas for incorporating modules or improving the existing.

BIBLIOGRAPHY

Adriaans, P., Zantinge, D. Data Minin. Addison-Wesley, 1996.

Cabena, P. et al. Discovering Data Mining: From Concept to Implementation. Prentice Hall, 1997.

Dhar, V., Stein, R. Seven Methods for Transforming Corporate Data into Business Intelligence. Prentice Hall, 1997.

Fayyad, U.M. et al. Advances in Knowledge Discovery and Data Mining. Massachussett Institute of Technology, 1996.

Glanzel, W. The need for standards in bibliometric research and technology. Scientometrics 35(2): 167-176, 1996.

Grivel, L., Polanco, X., Kaplan, A. A computer system for big scientometrics at the age of the world wide web. Scientometrics, 40 (3): 493-506, 1997.

Katz, J.S., Hicks, D. Desktop Scientometrics. Scientometrics 38 (1): 141-153, 1997.

Ravichandra Rao, I.K. Methodological and conceptual questions of bibliometric standards. Scientometrics 35(2): 265-270, 1996.

Small, H. A general framework for creating large-scale maps of science in two or three dimensions: the SCIVIZ system. Scientometrics 41(1-2): 125-133, 1998.

Sotolongo-Aguilar, G, Guzmán-Sanchez, M.V., García-Díaz, I. Bibliometric Information System For Desktop Research. 5^th. International Conference on Science and Technology Indicators: Use of S&T Indicators for Science Policy and Decision-Making 4-6 June 1998, Hinxton, Cambridge, England.

Swanson, D.R., Smalheiser, N.R. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence 91: 183-203, 1997.

Los autores:

* The Finlay Institute; Address: Calle 212 #3112, e/31 y 37, Lisa, Habana, CUBA; Mailing Address: POBox 16017, Cod. 11600 Habana, CUBA . Phone: 53-7-336212, 53-7-212280 (work); 53-7-215639 (home); Fax: 53-7-336075, 53-7-336754; E Mail: finlayci@infomed.sld.cu

** Universidad de La Habana, Facultad de Comunicación; E mail: csbgv@bib.uc3m.es Carlos Web Pages