This paper describes a pipeline for extracting the author’s terms and definitions from mathematical texts. We used two models: one, for detecting mathematical formulas to clear text from noise and the other, for converting images into LaTeX formulas to restore the deleted formulas. Experimental data show that noise clearing is an essential step, because it improves all quality metrics. To recognize the author’s terminology, we applied the rule-based syntactic approach. The idea of “negative” rules shown here increases final precision significantly, though does not essentially reduce recall. In general, the results obtained are quite good, because so far there are no other solutions for recognizing the authors’terms in mathematical texts, and the quality metrics are comparable to more general methods that would work worse in the mathematical domain.
Abstract:
Keywords:
DOI:
10.31144/BNCC.CS.2542-1972.2023.N47.P43-51
Issue
Pages:
43-51
File:
turnaev-a.-apanovich-z_0_0.pdf
(287.45 KB)