The forest chatter has been clamorous since Microsoft’s announcement to retire Microsoft Academic (MAG) at the end of 2021. Like many others, at CORE, we have used MAG for a number of tasks including data quality enhancement and enrichment, to obtain citation data, for our research in semantic typing of citations and to enrich MAG and Microsoft Academic Search by supplying direct links to full text content (in a similar way we do for PubMed).
At the same time as discussing how to move forward, we’ve received a lot of questions and helpful input from current MAG users and, as many others out there begin throwing their hat into the MAG-replacement ring, we thought it was time to discuss how CORE intends to contribute to the path forward.
CORE already fulfils a number of the solutions supplied by MAG, but critically CORE is not a direct replacement for MAG. In fact, there is no direct replacement for MAG. As OurResearch clearly says in their announcement – a “perfect” replacement is very hard to come by. However, as a team with expertise in processing large quantities of scientific literature, text mining it, and having developed the harvesting and research papers aggregation tools of CORE, CORE is best placed to contribute to a global effort developing a solution responding to the community needs and requirements.
What does CORE replace?
Although we don’t have direct replacements for all of MAG’s users, we do have a number of components of the solution and additional complementary data that was never part of MAG and that formed the reasons behind our cooperation with Kunsan Wang’s MAG team.
- More than 200 million metadata records including 93 million abstracts from across the scholarly network of repositories and journals which is continuously growing.
- Coverage of many records that are not available through the usual publishing routes but are deposited in institutional repositories. A diffused treasure trove that spans from old book chapters to technical reports and thesis (note that the majority of our records are not registered in Crossref making CORE highly complementary).
- A global point of view allowing us to recognise different versions of a specific canonical manuscript available from a range of sources.
- Unique data on deposit dates, enabling CORE to track and monitor compliance with open access policies, such as the REF, UKRI or Plan S, which ask for research papers to be deposited into repositories and typically do so within a specific time limit.
- 25 million full texts ready to use for text and data mining; here CORE provides something that MAG was never able to and never intended to offer. In fact, we had a fruitful collaboration with the MAG team by cross-referencing MAG records which power Microsoft Academic Search directly to CORE to access the full text.
- A recommendation engine that uses our large collection of text to recommend the most interesting papers.
- Integrated data enrichment features for papers with full-texts based on machine learning models. For instance, we use our system to detect research paper types (presentation, paper, thesis), detect the languages of the manuscripts, etc. and we also work with some of the key players specialising in important text and data mining use-cases.
What is CORE not suited to now?
(If you want to talk to us about how to fill any of these gaps, then we will be all ears!)
- Bibliometrics especially regarding citation behaviour. We don’t have data about who’s citing what. However, we can be used as a source of raw reference data for citation extraction. We are very keen in this respect to collaborate with Open Citations where we see sufficient complementarity and a logical processing pipeline that could enhance open citation data.
- Author disambiguation, we are largely leaving this task to ORCID to solve, although we might be tempted to close the gap in some situations where ORCID IDs are sparse.
- Conference series and instances we have available as text only.
What are we actively working on?
- Methods for detecting citation intent and purpose embedded and applied on citation data. We have actually recently organised the second shared task delivering SotA methods in this respect and are actively working on going beyond this SotA to further increase performance.1
- Affiliation discovery from full text: we have a great deal of experience in matching affiliations and authors by analysing the full text. We haven’t yet used it at scale on the full CORE dataset but it is definitely something we have been working on and where we can significantly improve over the sparse data in Crossref. This is important for a wide range of use cases.
- Full-scale near-duplicate detection to recognise different versions of manuscripts: We already have in place the first version of this system which makes use of identifiers. In the next iteration, we will be able to match metadata and full texts which do not have clear unique persistent identifiers (PIDs) or where different PIDs are assigned to the same document, a situation that is taking place more and more often.
So what are we saying? If you’re looking to build a MAG replacement, we can help. If you’re scouting round for a MAG replacement tool, CORE might be the right place to come. As always, it just depends on your needs, so come and contact us and we will figure it out together.
- Suchetha Nambanoor Kunnath, David Pride, Bikash Gyawali, and Petr Knoth. 2020. Overview of the 2020 WOSP 3C citation context classification task. In Proceedings of the 8th International Workshop on Mining Scientific Publications, pages 75–83, Wuhan, China. Association for Computational Linguistics.
The CORE Team