Studies in Science of Science ›› 2025, Vol. 43 ›› Issue (11): 2324-2335.

Previous Articles     Next Articles

Research on the Early Identification Method of Breakthrough Papers Based on Machine Learning

  

  • Received:2024-11-05 Revised:2025-02-27 Online:2025-11-15 Published:2025-11-15

基于机器学习的突破性论文早期识别方法研究

李欣1,仲晓霏1,2,高宁1,2,程浩伦1,2   

  1. 1. 北京工业大学
    2.
  • 通讯作者: 李欣
  • 基金资助:
    国家自然基金面上项目

Abstract: Breakthrough research is groundbreaking, forward-looking and leading scientific research that may lead to profound changes in scientific paradigms. Breakthrough papers are an important carrier and manifestation of breakthrough research. Early identification of breakthrough papers plays an important role in leading the frontier exploration of scientific research, guiding the efficient allocation of corporate R&D resources, and supporting the forward-looking layout decision-making of the government's innovation strategy. Therefore, how to efficiently and accurately identify breakthrough papers in the process of scientific and technological innovation has become a research hotspot in academic community. However, most of the current research on the identification of breakthrough papers focuses on some indicator characteristic data of the papers themselves. The selected indicators lack a close connection with the essential characteristics of scientific breakthroughs, and there is still a lack of an indicator system for identifying breakthrough papers from the perspective of knowledge breakthrough, knowledge innovation, and knowledge interdisciplinary. In addition, when using bibliometric methods to identify breakthrough papers, most of the existing identification methods are based on citation analysis methods, which has a time lag in early identifying breakthrough papers. Therefore, in view of the shortcomings in the current research on early identifying breakthrough papers, in this paper, we proposed a method for early identification of breakthrough papers based on machine learning. The research ideas of this method are as follows: firstly, according to the nature of breakthrough papers, an indicator system for identifying breakthrough papers was constructed from three dimensions: knowledge breakthrough, knowledge innovation, and knowledge interdisciplinary. At the same time, the characteristics of the subject of knowledge discovery or creation will also have an important impact on whether the knowledge is breakthrough. Thus, in order to be able to identify breakthrough papers more comprehensively and accurately, we added the characteristics of the authors of the papers when constructing the indicator system for identifying breakthrough papers, and constructed a comprehensive identification indicator system including the text characteristics, quantitative characteristics, and author characteristics of breakthrough papers. Secondly, we introduced machine learning methods into the research on the identification of breakthrough papers. By using the nonlinear relationship pattern acquisition ability of machine learning, we extracted the correlation relationship pattern between the characteristics of breakthrough papers and their breakthrough, and used this pattern to identify breakthrough papers at an early stage. This method solves the time lag problem of identifying breakthrough papers. Finally, taking the biomedical field as an example, we verified the feasibility and effectiveness of this method. The results of the case study in the biomedical field show that starting from the essence of breakthrough research, integrating the text characteristics, quantitative characteristics and author characteristics of the paper can more comprehensively and systematically characterize the breakthrough information of the paper; the use of machine learning algorithms can effectively solve the time lag problem in the identification of breakthrough papers and can achieve early identification of breakthrough papers. The breakthrough papers identification method based on machine learning provides a new research method for the early identification of breakthrough papers.

摘要: 突破性研究是具有开创性、前瞻性和引领性的科学研究,可能会引领科学范式的深刻变革。突破性论文是突破性研究的重要载体和表现形式。尽早识别突破性论文对于引领科学研究前沿探索、指导企业研发资源的高效配置以及支撑政府创新战略的前瞻性布局决策具有重要作用。针对目前突破性论文识别研究中存在的不足,即突破性论文识别指标与科学突破性本质缺乏紧密联系,以及引文分析存在的滞后性等问题,提出一种基于机器学习的突破性论文早期识别方法。该方法首先根据突破性论文的本质,从知识突破性、创新性和学科交叉性出发,围绕突破性论文的文本特征、计量特征和作者特征三个维度,构建识别突破性论文的评价指标体系;其次,通过构建机器学习模型获取论文特征与其突破性之间的关联模式,并利用此模式来早期识别突破性论文,解决突破性论文识别的时滞性问题;最后,以生物医学领域为例,验证了该方法的可行性与有效性,为突破性论文的早期识别研究提供了新的研究方法。