Abstract:

This paper addresses the critical task of entity linking for scientific terms in Russian texts to Wikipedia, a process vital for transforming unstructured text into structured knowledge. We introduce a novel algorithm and conduct a comparative analysis between RuBERT-tiny2 and spaCy, evaluating their performance across varying context window sizes and numbers of links. Our findings indicate that RuBERT-tiny2 excels with larger context windows, leveraging its deep semantic understanding for superior disambiguation, though its performance degrades beyond 100 tokens due to noise. Conversely, the spaCy-based approach demonstrates greater robustness in limited-context scenarios. This highlights a trade-off: while complex models like RuBERT-tiny2 are highly context-dependent, simpler models remain competitive when contextual information is sparse. Error analysis reveals three primary failure modes: search errors (absence of correct entities), ranking errors (suboptimal semantic scoring), and annotation errors (ambiguities in ground truth). The study underscores the direct impact of knowledge base quality on system performance and suggests implementing semantic similarity thresholds to mitigate overconfident false links. We conclude that future advancements in entity linking necessitate not only improved algorithms but also enhancements in candidate retrieval, query formulation, ranking strategies, and robust handling of ambiguous annotations, advocating for adaptive thresholding, dynamic context selection, and domain-specific knowledge integration.

Issue
Pages:
65-80
File: