Earlier today Imperial College submitted its annual report on compliance with the Research Council’s open access policy to RCUK. The RCUK OA policy envisages a five year journey after which 100% of RCUK funded scholarly papers should be available as open access in 2018. To support the transition to open access, RCUK have set up a block grant that makes funds available to institutions to cover the cost for article processing charges (APCs) and other OA-related expenses. Funds are awarded in relation to RCUK research funding for institutions, and Imperial College has the second largest allocation, just behind Cambridge and followed by UCL. The annual reports to RCUK give an overview over institutional spend and on compliance.
The headline figures for the 2015/2016 College report are:
£1,051,130 block grant spend from April 2015 to March 2016
89% overall compliance, split in 31% via the gold and 58% via the green route
570 article processing charges paid at an average cost of ~£1,800
The top five publishers are: Elsevier, Wiley, Nature, ACS and OUP
Like every year when discussing the RCUK report figures I think it is important to highlight that compliance rates between universities cannot meaningfully be compared without understanding the data sources and methods used. Just to give one example: the College could also have reported 81% green and 8% gold from the same data.
Why do I caution against directly comparing the numbers? For starters, research-intensive universities find it difficult to establish what 100% is. With hundreds, or in the case of Imperial College many thousand papers published every year we rely on academics to manually notify us for each paper who the funder is. Even though the College has made much progress improving its processes and data over the past few years we have to acknowledge that data collected through such a process will never be complete or fully accurate. For the College report we decided, like in the previous years, to base our analysis on outputs we know to have been RCUK-funded. For this year the size of the sample was 1,923 papers (compared to 1,326 in 2014). With a different sample the numbers would have been different, and other universities may have taken a different approach to analysing the data.
Sadly, it is currently not easy to establish whether an output was made available open access. Publishers do not usually add licensing information to metadata, and searching for manuscripts deposited in external repositories is possible but not necessarily accurate. The process we used for analysis was:
Cross-reference the sample with the journal list from the Directory of Open Access Journals; class every article published in a full OA journal as compliant ‘gold’.
Take the remaining articles and cross-reference with the list of articles for which the College Library has paid an APC; class all those articles as compliant ‘gold’.
Take the remaining articles and cross-reference with the outputs from ResearchFish that show a CC BY license; class all those articles as compliant ‘gold’.
Take the remaining articles and cross-reference with list of outputs deposited in the College repository Spiral; class all those articles as compliant ‘green’.
Take the remaining articles and cross-reference with list of outputs that have a Europe PubMed Central ID; class all those articles as compliant ‘green’.
As in previous years we also put remaining outputs through Cottage Labs Lantern tool but this showed no additional open access outputs. The main reason for that, I suspect, is the high compliance via the green route: some 81% of outputs in the sample had been deposited to the College repository Spiral or to Europe PMC. As the College prefers green over hybrid gold it would have been in line with our policy to report them as green, but as the RCUK prefers gold OA we decided to report all outputs know as gold as such, like in previous years.
I could write more about reporting issues around open access, but as I have done that on a few other occasions I refer those who haven’t suffered enough to my previous posts.
One other caveat should be raised for those planning to analyse the APC spend in comparison with previous years: The APC article level data is based on APCs paid during the reporting period. This differs from the APC data reported in the previous period which was based on APC applications published. There are, therefore, a small number of records duplicated from the previous year. These have been identified in the notes column.
In what is hopefully not going to become a long series I am today dealing with the joys of compliance reporting in the context of HEFCE’s Policy for open access in the post-2014 Research Excellence Framework (REF). The policy requires that conference papers and journal articles that will be submitted to the next REF – a research assessment through which funding is allocated to UK universities – have to be deposited in a repository within three months of acceptance for publication. Outputs that are published as open access (“gold OA”) are also eligible, and during the first year of the policy the deposit deadline has been extended to three months of publication. The policy comes in force on 1st April and considering the importance of the REF the UK higher education sector is now pondering the question: how compliant are we?
As far as Imperial College is concerned, I can give two answers: ‘100%’ and ‘we don’t know’.
‘100%’ is the correct answer as until 1 April all College outputs remain eligible for the next REF. While correct, the answer is not very helpful when trying to assess the risks of non-compliance and for understanding where to focus communications activities. Therefore we have recently gone through a number crunching exercise to work out how compliant we would be if the policy had been in force since May last year. In May 2015 we made a new workflow available to academics, allowing them to deposit outputs ‘on acceptance’. The same workflow allows academics to apply for article processing charges for open access, should they wish to.
You would imagine that with ten months of data we would be able to give an answer to the question for ‘trial’ compliance, but we cannot, at least not reliably. In order to assess compliance we need to know the type of output, date of acceptance (to work out if the output falls under the policy), the date of deposit and the date of publication (to calculate if the output was deposited within three months). Additionally it would help to know whether the output has been made open access through the publisher (gold/immediate open access).
Below are eight issues that prevent us from calculating compliance:
Publisher data feeds do not provide the date of acceptance
Publishers do not usually include the date of acceptance in their data feeds, therefore we have to rely on authors manually entering the correct date on deposit. Corresponding authors would usually be alerted to acceptance, but co-authors will not always find out about acceptance, or there may be a substantial delay.
Deposit systems do not always require date of acceptance
The issue above is made worse by not all deposit systems requiring academics to enter the date of acceptance. In Symplectic Elements, the system used by Imperial, the date is mandatory only in the ‘on acceptance’ workflow; when authors deposit an output that is already registered in the system as published there is currently no requirement to add the date – resulting in the output listed as non-compliant even if it was deposited in time. Some subject repositories do not even include fields for date of acceptance.
Difficulties with establishing the status of conference proceedings
Policy requirements only apply to conference proceedings with an ISSN. Because of the complexities with the publishing of conference proceedings we often cannot establish whether an output falls under the policy, or at least there is a delay (and possible additional manual effort).
Delays in receiving the date of publication
It takes a while for publication metadata to make it from publishers’ into institutional systems. During this time (weeks, sometimes months) outputs cannot be classed as compliant.
Publisher data feeds do not always provide the date of publication
This may come as a surprise to some, but a significant amount of metadata records do not state the full date of publication. The year is usually included, but metadata records for 18% of 2015 College outputs did not specify year or month. This percentage will be much higher for other universities as the STEM journals (in which most College outputs are published) tend to have better metadata than arts, humanities and social sciences journals.
Publisher data feeds usually do not provide the ‘first online’ date
Technically, even where a full publication date is provided the information may not be sufficient to establish compliance. To get around the problem that publishers define publication dates differently, HEFCE’s policy states that outputs have to be deposited within three months of when the output was first published online. This information is not usually included in our data feeds.
Publisher data feeds do not usually provide licence information
Last year, Library Services at Imperial College processed some 1,000 article processing charges (APCs) for open access. We know that these outputs would meet the policy requirements. However, when the corresponding author is not based at Imperial College – last year around 55% of papers had external co-authors – we have no record on whether they requested that the output be made open access by a publisher. For full open access journals we can work this out by cross-referencing the Directory of Open Access Journals. However, for ‘hybrid’ journals (where open access is an (often expensive) option) we cannot track this as publisher metadata does not usually include licence information.
We cannot reliably track deposits in external repositories
Considering the effort universities across the UK in particular have put into raising awareness of open access there is a chance that outputs co-authored with academics in other institutions have been deposited in their institutional repository. Sadly, we cannot reliably track this due to issues with the metadata. If all authors and repositories used the ORCID identifier it would be easier, but even then institutional repositories would have to track the ORCID iD of all authors involved in a paper, not just those based at their university. If we had DOIs for all outputs in the repositories it would be much easier to identify external deposits.
Considering the issues above, reliably establishing ‘compliance’ is at this stage a largely manual effort that would take too much staff time for an institution that annually publishes some 10,000 articles and conference proceedings – certainly while the policy is not yet in force. Even come April I would rate such an activity as perhaps not the best use of public money. Arguably, publisher metadata should include at least the (correct) date of publication and also the licence, although I cannot see a reason not to include the date of acceptance. If we had that, reporting would be much easier. If we had DOIs for all outputs (delivered close to acceptance) it would be even easier as we could track deposits in external repositories reliably.
Therefore I call on all publishers: if you want to help your authors to meet funder requirements, improve your metadata. This should be in everyone’s interest.
What we can report on with confidence is the number of deposits (excluding theses) to our repository Spiral during 2015: 5,511. Please note: 2015 is the year of deposit, not necessarily year of publication.
Earlier today Imperial College London submitted its open access compliance report to RCUK. Like most UK universities, the College is in receipt of an annual open access block grant from RCUK. The funds are made available to support universities in meeting the requirements of the RCUK open access policy, in particular meeting the cost of article processing charges (APC) to make articles open access through the publisher. RCUK allocate funds in relation to research effort and Imperial College receives the second largest grant – £1,353,480 for 2014/15 (Cambridge is #1 with £1,355,073). The report, based on a template developed by Jisc, details how the money has been spent and provides headline compliance figures. It has been put together by the College Library and the Research Office, with support from ICT.
The headline figure is an estimated 31% compliance via the gold and 38% compliance via the green route; we also provide details on APCs for 350 open access articles processed by the College Library. However, before you delve further into the spreadsheet or start comparing these figures to other universities I would like to draw your attention to some of the inherent issues with these reports and figures.
First of all you may notice that the numbers do not seem to add up. We report an APC spend of £597,029 and yet the 350 APCs add up to £679,721.08. The reason for this apparent mismatch is that the first figure is for the period from April 2014 to March 2015, as requested in the spreadsheet, whereas the APCs are reported to RCUK until August 2015.
Secondly, the number of APCs does not equal 31% of the outputs we report on. This is because some of the articles originating from RCUK funding have been paid for by other institutions, usually because the principal investigator was based there and not at Imperial College.
Most importantly though I would caution against directly comparing compliance figures between universities – unless you know exactly how they have been calculated. The biggest challenge, especially for large research intensive universities, is establishing what 100% is: how many outputs are related to RCUK funding? Currently there is no reliable way to derive funder information from article metadata, even where authors report the funders to the publisher. RCUK-funded authors are asked to report outputs to the research councils, but the reporting period does not overlap with the OA reporting period. That means even if all authors would reliable link all outputs to all relevant grants (this is a manual process) the information would not be sufficient to report on. Earlier this year Imperial College introduced a new workflow (for depositing outputs on acceptance) that encourages authors to link outputs and funding, but it will be a while until we can be reasonably confident that close enough to 100% of outputs are linked to all relevant grants.
Why do we not just manually go through all articles and speak to the authors? It is a question of scale – College academics publish between 10,000-12,000 articles and conference proceedings per year. We estimate that some 4,000 of these outputs may be linked to RCUK funding.
So how did we come to the compliance figures reported to RCUK? We analysed a sample of some 1,500 outputs we know to be linked to RCUK funding. Sadly, there is currently no reliable way to automatically establish the open access status of an output as publishers do not usually add licence information to output metadata and tracking outputs in repositories also creates problems. We do of course know how many outputs the College Library paid an APC for and also which outputs were deposited into the College repository Spiral. We do not know where other universities have paid an APC for an article, or where an author may have used departmental or other funds to pay an APC.
We were able to identify additional open access outputs by cross-referencing our data with the list from the Directory of Open Access Journals (DOAJ) and the Europe PubMed Central database. Even so we will have missed outputs, for example papers deposited into repositories like arXiv. We do track arXiv deposits, but there is currently no way of telling what version has been deposited. Even if we knew the version, deposits in repositories pose another problem: where an APC has been paid and the output deposited, do we report it as green or gold OA? In the case of RCUK we have decided to mark it as gold, as that is the preferred route for the UK research councils, but others may have decided differently.
I could go on much longer, but I hope the above gives you an idea of the issues that universities face when reporting on open access. Should you still want to compare university open access reports, make sure to check the data source and methods. The good news is that in the future these reports should become more meaningful, in particular when publishers and system vendors add funder, institutional and author identifiers (such as ORCID) to output metadata.
Finally, I would like to highlight two issues we raised with RCUK when submitting the report:
Many points made by the College in last year’s submission regarding policy implementation are still valid (see paragraphs 35 ff.). The College has made good progress in delivering support infrastructure (significantly reducing processing time for gold and green OA), but concerns about the wider policy landscape and publisher support for open access remain. In particular, we would like to highlight two points:
Hybrid open access remains significantly more expensive than full OA (~50% more per APC), even without taking into account “double dipping”. Processing APCs for hybrid journals continues to require more resource, i.e. in relation to licensing and invoicing. The Finch report saw hybrid as a means of transitioning from a subscription to a full OA model, but there is very little evidence of that transition taking place. The majority of OA funds are still spent on hybrid.
Differences in funder policies make it harder for academics to understand how to comply and increase the workload for support services. RCUK is encouraged to harmonise policy requirements with other funders, in particular with the Policy for open access in the post-2014 Research Excellence Framework. We note that HEFCE have made changes to align policies with regards to gold OA and we would encourage RCUK to consider a similar step for green OA.