SUNCAT is based on the bibliographic serials records of a Contributing Library and the associated holdings. In order to create a clean search result, the records for the same title are put into one set, with the holdings for all records visible at the point of display attached to the best bibliographic record (preferred record). This means that the user, having located a journal in the database, is shown a coherent display of all the holdings information for the contributing libraries.
Building a union catalogue therefore involves establishing which of the submitted records from Contributing Libraries should be matched as they represent the same entity. The outcome is the creation of a large number of sets, with each set containing all the matched bibliographic records submitted by the Contributing Libraries, along with the associated holdings information. Where there is no match of a record with any other submitted record, the set will contain one record.
Updating a union catalogue involves running a complex set of processes. The key steps are the following:
1. For each incoming record the identification of records already in the database for possible matching with each incoming record. Possible matched records are copied into the ‘Candidate Pool’. The selection of candidates involves utilising 3 direct indexes (LCCN, ISSN/ISBN and a keyword index (normalised index)
2. Running of processes to ascertain with which records (if any) in the candidate pool the incoming record matches. This involves running a matching algorithm which compares key elements (title, ISSN etc.) in the incoming record with each record in the candidate pool and assigning numerical values depending on whether or not there is a match between the data elements. The values for each element are totalled and, when the pre-determined threshold value is reached, a match is declared. The matched records are held in the ‘Matched documents buffer’. More information on the matching algorithm is provided here.
3. Recording the details of matched records held in the ‘Matched documents buffer’ on the incoming record and the record(s) already in the database. This is done by each record in the database having an associated table (z120) and this contains information such as the number of records with which it has matched and the system numbers for these matched records.
EDINA had requested some improvements to the processes used to carry out the above and these improvements were included in the Aleph version 20 upgrade carried out in late 2013. The specific improvements are:
· New Candidates Pool Size Parameter. The allowed values are now between 100 and 500. This has been initially set at 250 meaning that if 250 or fewer records are identified, the process moves onto the matching stage. If more than 250 candidates are identified the software deems the incoming record not to match and it therefore becomes a single record set.
· Candidate Pool Common titles. A new routine has been added to help reduce the number of candidate titles. All candidate titles are compared with a list of common titles and if there is a match the candidate title is excluded from the candidate pool.
· Matched Documents Buffer. The size of this has been increased from 100 to 500. This expansion is necessary as the number of SUNCAT Contributing Libraries increases.
As part of the ongoing SUNCAT software redevelopment, work has already begun on the design of a new, improved matching algorithm. More information on the approach being taken will be given in a future blog posting.