I have several PDF files, from various companies, which are scans of
historical books which have had some OCR (optical character recognition)
done on them. The result is a semi-searchable book, but of course OCR is an
imperfect science, so the resulting text is kind of flaky, and the search
capabilities provided by the standard Adobe Reader aren't really up to the
task of figuring out what the text is "supposed" to be.
For example - I'm searching for Daniel Cooper, and where that text appears
in the original document, the OCR gives me
Danjek Cooqer
Daniel Coo er
Danjel Coopes
Most of these PDF files are "locked" (meaning you can't easily extract the
images to run them through another OCR program) and in any case the scanned
images aren't of high enough resolution that another OCR program would do
any better. To be fair, the original books are often not in great shape, so
it's not the fault of the scanner or the company providing the PDFs that the
converted text isn't perfect.
Is there a program that can do searches on PDF files, that (1) knows a
little about the mistakes OCR software commonly makes, or (2) lets you
specify the text your searching for with a "fuzziness" factor, so it catches
things similar to the searched-for text?
Thanks and Happy Holidays,
Chris
Better Search of PDF Files?
Moderator: MOD_nyhetsgrupper
-
Keith nuttle
Re: Better Search of PDF Files?
Chris Shearer Cooper wrote:
MS Office? On my HP there is a demo version that includes Microsoft
Office document image writer. I works like a printer. You open the
document in it native file pdf, jpg, etc. and print the file to the
"Printer" It works quite well. In fact it almost worth paying for.
Unless the file is locked to printing it should overcome most locks they
place on the file.
Once in the image writer file format you can cut and past and search on
fragments of words
--
Keith Nuttle
3110 Marquette Court
Indianapolis, IN 46268
317-802-0699
I have several PDF files, from various companies, which are scans of
historical books which have had some OCR (optical character recognition)
done on them. The result is a semi-searchable book, but of course OCR is an
imperfect science, so the resulting text is kind of flaky, and the search
capabilities provided by the standard Adobe Reader aren't really up to the
task of figuring out what the text is "supposed" to be.
For example - I'm searching for Daniel Cooper, and where that text appears
in the original document, the OCR gives me
Danjek Cooqer
Daniel Coo er
Danjel Coopes
Most of these PDF files are "locked" (meaning you can't easily extract the
images to run them through another OCR program) and in any case the scanned
images aren't of high enough resolution that another OCR program would do
any better. To be fair, the original books are often not in great shape, so
it's not the fault of the scanner or the company providing the PDFs that the
converted text isn't perfect.
Is there a program that can do searches on PDF files, that (1) knows a
little about the mistakes OCR software commonly makes, or (2) lets you
specify the text your searching for with a "fuzziness" factor, so it catches
things similar to the searched-for text?
Thanks and Happy Holidays,
Chris
Do you have MS Office or a computer that has a introduction version of
MS Office? On my HP there is a demo version that includes Microsoft
Office document image writer. I works like a printer. You open the
document in it native file pdf, jpg, etc. and print the file to the
"Printer" It works quite well. In fact it almost worth paying for.
Unless the file is locked to printing it should overcome most locks they
place on the file.
Once in the image writer file format you can cut and past and search on
fragments of words
--
Keith Nuttle
3110 Marquette Court
Indianapolis, IN 46268
317-802-0699
-
Christopher Jahn
Re: Better Search of PDF Files?
Keith nuttle <keith_nuttle@sbcglobal.net> wrote in
news:IrYbj.999$se5.661@nlpi069.nbdc.sbc.com:
No, it won't. You'll only be creating a completely unsearchable
document by creating a PDF of a PDF. The PRINT TO PDF option
basically creates an image of the document that is inserted into
a PDF page.
Scanning the new pdf will only reveal the image, and not the
components of the image.
--
}:-) Christopher Jahn
{:-( http://manormaniac.blogspot.com/
Delicious and nutritious, tastes like chicken!
news:IrYbj.999$se5.661@nlpi069.nbdc.sbc.com:
Chris Shearer Cooper wrote:
I have several PDF files, from various companies, which are
scans of historical books which have had some OCR (optical
character recognition) done on them. The result is a
semi-searchable book, but of course OCR is an imperfect
science, so the resulting text is kind of flaky, and the
search capabilities provided by the standard Adobe Reader
aren't really up to the task of figuring out what the text is
"supposed" to be.
For example - I'm searching for Daniel Cooper, and where that
text appears in the original document, the OCR gives me
Danjek Cooqer
Daniel Coo er
Danjel Coopes
Most of these PDF files are "locked" (meaning you can't
easily extract the images to run them through another OCR
program) and in any case the scanned images aren't of high
enough resolution that another OCR program would do any
better. To be fair, the original books are often not in
great shape, so it's not the fault of the scanner or the
company providing the PDFs that the converted text isn't
perfect.
Is there a program that can do searches on PDF files, that
(1) knows a little about the mistakes OCR software commonly
makes, or (2) lets you specify the text your searching for
with a "fuzziness" factor, so it catches things similar to
the searched-for text?
Thanks and Happy Holidays,
Chris
Do you have MS Office or a computer that has a introduction
version of MS Office? On my HP there is a demo version that
includes Microsoft Office document image writer. I works like
a printer. You open the document in it native file pdf, jpg,
etc. and print the file to the "Printer" It works quite well.
In fact it almost worth paying for.
Unless the file is locked to printing it should overcome most
locks they place on the file.
No, it won't. You'll only be creating a completely unsearchable
document by creating a PDF of a PDF. The PRINT TO PDF option
basically creates an image of the document that is inserted into
a PDF page.
Scanning the new pdf will only reveal the image, and not the
components of the image.
--
}:-) Christopher Jahn
{:-( http://manormaniac.blogspot.com/
Delicious and nutritious, tastes like chicken!