The new SUNCAT interface is available at http://suncat.ac.uk/ – this became the main SUNCAT interface in March 2014 but the old interface is still available. This new interface is built on a different platform and will therefore exhibit some differences in behaviour. We have discussed some of these in previous technical blog posts, and updated help and support documentation will also clarify the changes. In this post we will give some attention to how search results are ranked according to their relevance to the search terms.
One of the features of the Solr search server which we use to query the data, is that when we perform a search, the results that we get back include a relevancy score, or rank.
“Relevancy is the quality of results returned from a query, encompassing both what documents are found, and their relative ranking (the order that they are returned to the user).”
The scores are normalised to fall on a scale between 0 and 1, but you don’t need to worry about the scores as we don’t show them to you – we just use them to inform the resulting ordering that we show you. You can read more about relevancy scoring at http://wiki.apache.org/solr/SolrRelevancyFAQ.
By default the returned results are listed in order of relevance, with the most relevant first. This is what is reflected in the position column. Note that while the other sortable columns can be ordered in ascending or descending order, we do not allow the position column to be ordered in ascending order (i.e. from least relevant to most relevant). If you click on the position column header, the results will be ordered in descending order of relevancy.
We have defined relevancy so that things like punctuation and capitalisation don’t affect a result’s score.
Boosting allows us to modify scores; so we give matches on a particular search index (field) more weight than others.
We can boost the importance to the search of a particular search field, or of particular documents when we put them into Solr, or of a particular clause within a query used to search the data. SUNCAT currently performs a variety of boosting:
- Boost a result (significantly) if the search term matches exactly.
- Boost a result where the search terms occur close together (within 3 words of each other).
- When searches are made on the Title Keywords field, results are boosted if the search terms occur in the 245 MARC field (Title Statement), particularly any of the sub-fields $a (Title), $b (Remainder of title), $n (Number of part/section of a work) or $p (Name of part/section of a work).
So for example, searching for “Journal hellenic studies” in the Title Keywords field would produce results including Journal of Hellenic Studies as expected, and also Archaeological reports (which has “Journal of Hellenic Studies” in the Added Title field). However the former would appear higher up in the results because the search terms occur in the main title in the 245 field.
Here are some sample searches for British trees; first, a search for records with any of the words “British” and “trees”.
There are more than 36,000 results. From around the 300th result and towards the end you will see many results which have been returned because they contain the word “British”, and less related to trees. This is possibly not what you were interested in, but much like using a search engine, you can ignore the results at the end, because the most relevant ones are shown to you first. You aren’t forced to make a more accurate search, though you can if necessary.
If we search for records with all (both!) of the words “British” and “trees”, we will get fewer results:
There are only two results that include both words. The Basic search feature uses this interpretation by default, searching for all the specified terms.
You could search for the quoted phrase “British trees” but this produces nothing as the exact phrase does not occur anywhere:
Another aspect which affects the scoring of results is whether a search term has been stemmed. For example, when you enter the word “British”, it will be stemmed so that Solr will look for variations on it, such as “Brit” and “Britain”. Matches on the variations will have less influence over the score than precise matches to “British”.
It can be hard to unravel exactly what causes a particular record to get a higher score than another, because of the variety of factors and weightings that go into its calculation. The relevancy can be affected by the exactness of word matches, by their frequency, by how similar the words in the record are to words in the search term, how close together they are, what fields they appear in, and a variety of other factors it is possible to bring to bear on the scoring algorithm.
In deciding what aspects of the results should be considered most important, it is necessary to make trade-offs. The challenge is to make the results as intuitively sensible as possible, but it is not always possible to infer and reflect the exact intentions of the user – and sometimes particular combinations of boosting and searching on particular fields may give apparently counter-intuitive positioning to some results. Search algorithms are inherently heuristic and are an attempt to provide meaningful results to a simple query. In general, the more accurate and complete the underlying MARC records, the better the resulting scoring will be, much like trying to raise a website’s profile in a search engine.
The Advanced search feature provides more options, and more control over how search terms are interpreted, so that you can really pin down what you are searching for – but the basic search should in most cases provide a quick and effective doorway to the wealth of information in SUNCAT!