Python如何读取由LaTeX生成的带有公式的PDF文件

4
考虑以下文章。
https://arxiv.org/pdf/2101.05907.pdf

这是一篇典型的学术论文,PDF文件中只有两张图片。

下面的代码用于从论文中提取文本和公式:

#Related code explanation: https://dev59.com/yKTia4cB1Zd3GeqP7AAP
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)

#Related code explanation: https://dev59.com/1FcO5IYBdhLWcg3wchMB
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)

#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())

然而,结果并不完全正确。
Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire

ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi

erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di

erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi

erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi

erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
@ 
(
x
;
t
)
@
t
=

1
2
m
@
2
 
(
x
;
t
)
@
x
2
+
V
(
x
;
t
)
 
(
x
;
t
)
(1)
arXiv:2101.05907v1  [quant-ph]  14 Jan 2021

如下所示:

  1. 间距,例如标题,消失了,导致字符串没有意义。
  2. LaTeX公式错误,并且在第二页上变得更糟。

如何修复这个问题并正确地从生成的LaTeX PDF文件中提取文本和公式?


你解决了你的问题吗?我找到了一个解决我的问题的PDF文件,但是方程超出了范围,我无法用PyPDF2解析它:/ - Guilherme Correa
确切地说,您对此输入PDF中的数学公式希望得到什么样的文本输出?您是否希望获得类似于$\hbar=1$的LaTeX源代码?这对于某些提取器可能有效,但除此之外的内容(如下标、分数、求和符号、根号)可能无法正常工作。实现这样的提取非常困难(因为PDF中没有相关的结构信息),生成的代码将很脆弱(即它不适用于所有PDF),而且可能没有足够的购买者来支持这样的功能。 - pts
@pts提取器可以将其转换成标准语言,如LaTeX,或者实际上任何语言,如Microsoft Unicode,都能起到作用。如果PDF自身的表示代码(由PDF阅读器用于在屏幕上打印方程式的)也可以使用,因为它们之间只是一个线性映射关系。 - ShoutOutAndCalculate
1个回答

3
与此同时,PyPDF2已被弃用。请使用pypdf(我是两者的维护者;请参阅迁移指南)。
对于方程式,我们没有特定的解决方案,但可以进行一般的文本提取。
import io
import requests
from pypdf import PdfReader

# Download content
url = "https://arxiv.org/pdf/2101.05907.pdf"
r = requests.get(url)
f = io.BytesIO(r.content)

# Extract text
reader = PdfReader(f)
print(reader.pages[0].extract_text())

最后一段是

enter image description here

pypdf给出的结果是:
The main equation in quantum mechanics is the Schrodinger equation, that in one dimension and for a potential V(x;t)
is written as (for simplicity, we set }=1)
i@ (x;t)
@t=1
2m@2 (x;t)
@x2+V(x;t) (x;t) (1)

您能看到文本没问题,但是所有的数学字符/方程结构都没有很好地表示出来。

数学文本提取在很长一段时间内肯定仍然不够理想,但我已经提交了一个工单来改进文本提取(部分、phi,可能还有hbar):https://github.com/py-pdf/pypdf/issues/2009

另请参阅:为什么文本提取如此困难。总结一下:pypdf在提取希腊字母方面有望变得更好

pypdf==3.13.0的完整提取

Bohm potential for the time dependent harmonic oscillator
Francisco Soto-Eguibar1, Felipe A. Asenjo2, Sergio A. Hojman3and H ´ector M.
Moya-Cessa1
1Instituto Nacional de Astrof´ ısica, ´Optica y Electr´ onica, Calle Luis Enrique Erro No. 1, Santa Mar´ ıa Tonanzintla,
Puebla, 72840, Mexico.
2Facultad de Ingenier´ ıa y Ciencias, Universidad Adolfo Ib´ a˜ nez, Santiago 7491169, Chile.
3Departamento de Ciencias, Facultad de Artes Liberales, Universidad Adolfo Ib´ a˜ nez, Santiago 7491169, Chile.
Departamento de F´ ısica, Facultad de Ciencias, Universidad de Chile, Santiago 7800003, Chile.
Centro de Recursos Educativos Avanzados, CREA, Santiago 7500018, Chile.
Abstract. In the Madelung-Bohm approach to quantum mechanics, we consider a (time dependent) phase that depends quadrati-
cally on position and show that it leads to a Bohm potential that corresponds to a time dependent harmonic oscillator, provided the
time dependent term in the phase obeys an Ermakov equation.
Introduction
Harmonic oscillators are the building blocks in several branches of physics, from classical mechanics to quantum
mechanical systems. In particular, for quantum mechanical systems, wavefunctions have been reconstructed as is the
case for quantized fields in cavities [1] and for ion-laser interactions [2]. Extensions from single harmonic oscillators
to time dependent harmonic oscillators may be found in shortcuts to adiabaticity [3], quantized fields propagating in
dielectric media [4], Casimir e 
                                ect [5] and ion-laser interactions [6], where the time dependence is necessary in order
to trap the ion.
Time dependent harmonic oscillators have been extensively studied and several invariants have been obtained [7, 8, 9,
10, 11]. Also algebraic methods to obtain the evolution operator have been shown [12]. They have been solved under
various scenarios such as time dependent mass [12, 13, 14], time dependent frequency [15, 11] and applications of
invariant methods have been studied in di 
                                          erent regimes [16]. Such invariants may be used to control quantum noise
[17] and to study the propagation of light in waveguide arrays [18, 19]. Harmonic oscillators may be used in more
general systems such as waveguide arrays [20, 21, 22].
In this contribution, we use an operator approach to solve the one-dimensional Schr ¨odinger equation in the Bohm-
Madelung formalism of quantum mechanics. This formalism has been used to solve the Schr ¨odinger equation for
di
  erent systems by taking the advantage of their non-vanishing Bohm potentials [23, 24, 25, 26]. Along this work,
we show that a time dependent harmonic oscillator may be obtained by choosing a position dependent quadratic time
dependent phase and a Gaussian amplitude for the wavefunction. We solve the probability equation by using operator
techniques. As an example we give a rational function of time for the time dependent frequency and show that the
Bohm potential has di 
                      erent behavior for that functionality because an auxiliary function needed in the scheme,
namely the functions that solves the Ermakov equation, presents two di 
                                                                       erent solutions.
One-dimensional Madelung-Bohm approach
The main equation in quantum mechanics is the Schrodinger equation, that in one dimension and for a potential V(x;t)
is written as (for simplicity, we set }=1)
i@ (x;t)
@t=1
2m@2 (x;t)
@x2+V(x;t) (x;t) (1)arXiv:2101.05907v1  [quant-ph]  14 Jan 2021

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接