Picking out text from a screenshot

Gene E. Bloch · Aug 13, 2013

I've just put FreeOCR to a rigorous test, and it passed with flying colours.

Ed

Good to know - thanks.

My all-in-one came with OCR, so I'm OK for now, but who knows what the
future holds?

Paul · Aug 13, 2013

Ed said:
I've just put FreeOCR to a rigorous test, and it passed with flying
colours.

Ed

So when you do a screen capture on your Windows 7, do you
see clean, fringe free letters ? That makes a difference.

You should post a link to a picture of your source material,
just so we can see why it passed with flying colors.

Paul

Ed Cryer · Aug 14, 2013

Paul said:
So when you do a screen capture on your Windows 7, do you
see clean, fringe free letters ? That makes a difference.

You should post a link to a picture of your source material,
just so we can see why it passed with flying colors.

Paul

Here you go;
http://tinyurl.com/k3x3hdg
http://tinyurl.com/m9vu5cr

Ed

Robin Bignall · Aug 14, 2013

Here you go;
http://tinyurl.com/k3x3hdg
http://tinyurl.com/m9vu5cr

Nice demo.

mick · Aug 14, 2013

Gene said:
I've just put FreeOCR to a rigorous test, and it passed with flying colours.

Ed

That is an excellent bit of software Ed. I had just added six A4 pages
of text with a few small images to my website as .jpg files a couple of
days ago because I could not be bothered to re-type it all. I really
wanted just the text but without the white paper background. I
downloaded FreeOCR and it has allowed me to grab the text then
copy/paste it straight into the website on a transparent background ;-)

My HP scanner and OCR had made such a mess of reading the text that it
was not worth changing all the mistakes, so FreeOCR is a godsend,
thanks for the heads up Ed.

Paul · Aug 14, 2013

Ed said:
Here you go;
http://tinyurl.com/k3x3hdg
http://tinyurl.com/m9vu5cr

Ed

What's interesting, is I do see fringes around the letters!
And it's not clear to me, what those fringes are for. It doesn't
look like ClearType. Almost like an attempt at a drop shadow.

And considering that, FreeOCR did do a good job. Probably better
than my ancient Paper Capture could manage.

My Paper Capture should have been able to capture "l" and "i"
letters, as they were pretty clean (solid enough black, to detect).
But it didn't.

Paul

Paul · Aug 14, 2013

mick said:
That is an excellent bit of software Ed. I had just added six A4 pages
of text with a few small images to my website as .jpg files a couple of
days ago because I could not be bothered to re-type it all. I really
wanted just the text but without the white paper background. I
downloaded FreeOCR and it has allowed me to grab the text then
copy/paste it straight into the website on a transparent background ;-)

My HP scanner and OCR had made such a mess of reading the text that it
was not worth changing all the mistakes, so FreeOCR is a godsend, thanks
for the heads up Ed.

It's interesting tracing down the history.

http://www.paperfile.net/

"OCR Engine

The included Tesseract OCR PDF engine is an open source product released
by Google. It was developed at Hewlett Packard Laboratories between 1985
and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy
contest organized by University of Nevada in Las Vegas. The Tesseract
engine source code is now maintained by Google and the project can be
found here: http://code.google.com/p/tesseract-ocr/

What I was looking for, is leads to other projects that use it.

http://code.google.com/p/tesseract-ocr/wiki/3rdParty

Quite a funny development history, in terms of all the things that use it.

I was hoping there'd be something purpose-built to just digitize the screen.
As that would eliminate a few steps.

Paul

J. P. Gilliver (John) · Aug 14, 2013

Metspitzer said:
On Mon, 12 Aug 2013 22:45:23 +0100, "J. P. Gilliver (John)"

Highlight and copy is all I want to do. Is there a way to do that
with a jpg image?

I meant, can't you highlight and copy at the point you're doing the
screen capture, rather than just doing a screen capture.

Win7 defaults to Windows photo viewer. What should I be using?

It seems from what others have been saying is that (once you've got it
as an image), 7 has nothing built in. Lots of suggestions; I've
installed the IrfanView plugin, on the basis that I've generally found
anything to do with IV very easy to use, but haven't had occasion to try
it yet. Abbyy was good too last time I tried it, but that was some years
ago.

J. P. Gilliver (John) · Aug 14, 2013

OCR. Got it.
Thanks

I did a test, and you can see a "partial" result here.

http://imageshack.us/a/img849/3530/mak3.png

There is a problem with your idea. The problem with screen
captures, is things like ClearType. If your OS has
ClearType enabled, it puts "color fringes" around
the letters.[/QUOTE]

Well, though it's got snipped, screen capture was brought up by someone
with reference to the programmes used by the blind. They often turn
ClearType and similar off, as it gives them no benefit - though screen
capture usually doesn't try to work by OCR anyway, see below.
[]

I chose a couple ways to capture the web page. One was "Export to PDF",
which avoids ClearType and renders the web page into a PDF. That
gives a clean copy of the screen. I converted the PDF to an image, so
I could pretend that test file, came from a paper scanner.

Except that it would have perfect alignment. (Most modern OCR can handle
slight misalignment anyway, but does have to work harder.)
[]

Summary: Screen capture sucks as an information source, unless you're
very careful to turn off any screen anti-aliasing method.

[]
I think there you're using screen capture to mean just what it ought to,
i. e. grabbing a screenshot as an image. Screen capture - usually still
called screen readers, though that's inaccurate these days - as used by
the blind, usually tries to go behind and intercept whatever was writing
to screen (much like your "make PDF"). In fact I don't think any of the
common ones (JAWS, Window-Eyes, NVDR) use OCR at all.

Paul · Aug 14, 2013

J. P. Gilliver (John) said:
(much like your "make PDF")

I did not pass the PDF directly to my ancient OCR.

Print to PDF --> BMP --> PDF --> Acrobat 4 Paper Capture

The purpose of the first step (Print to PDF), was to get
a copy of the same page I used for the screen capture test.
Being a print, there'd be no ClearType.

Converting the PDF to BMP, is to remove all the text from the
PDF document. So that the second PDF document, just consists
of one big page image. That is what the Acrobat 4 Paper Capture
expects as input. The Paper Capture is intended (normally),
for usage with a scanner and its scan to PDF, and I was
working on a way to get a "clean" or "reference" copy of
the test page, into Paper Capture. And the Paper Capture
got that reference copy all correct.

The screen capture on the other hand, would go like this:

GIMP Acquire Screen ---> BMP ---> LibreOffice ---> Export to PDF
--> Acrobat 4 Paper Capture

The LibreOffice step, was to avoid information loss. Some of
my other tools, "de-res" the image, and I needed a way to fix that.

And that route, had zero recognition. And likely because of the
"noise" around each character caused by ClearType.

The test that Ed did, the characters still had some noise around
them (corona-like), but the OCR Ed used, didn't have a problem
with it. And I don't know why that noise is there. It's not
the same pattern as ClearType.

When I've (attempted) to use the Paper Capture for serious work,
I would sometimes "threshold" a scanned image, to try to get rid
of some of the noise. Even so, the error rate with that OCR was
pretty bad. The "O" versus "0" problem being one of them.

I've had a couple other OCR packages (Omnipage might have been one
of them), but somewhere along the way, I decided not to re-install
them on the next computer. That probably had something to do
with a certain "training" test, where I invested significant time
trying to train out the error rate - and the results ended up
being worse than the installation defaults. At some point, after
seeing the results, I ended up saying something like "why am I
doing this"

I keep the Acrobat 4 installed, because I occasionally need the
services of Distiller (.ps to PDF). Having the Paper Capture to
torture once in a while, is a bonus.

Paul

Ed Cryer · Aug 14, 2013

Paul said:
What's interesting, is I do see fringes around the letters!
And it's not clear to me, what those fringes are for. It doesn't
look like ClearType. Almost like an attempt at a drop shadow.

And considering that, FreeOCR did do a good job. Probably better
than my ancient Paper Capture could manage.

My Paper Capture should have been able to capture "l" and "i"
letters, as they were pretty clean (solid enough black, to detect).
But it didn't.

Paul

I've downloaded the Irfanview Kadmos OCR plugin, and tested it on the
same .jpg.
http://tinyurl.com/qyegfn9

Not bad, but a couple of strange mistakes with "a".

Ed

Metspitzer · Aug 14, 2013

I meant, can't you highlight and copy at the point you're doing the
screen capture, rather than just doing a screen capture.

Nope. I have two programs I use that do not allow you to highlight
text. That would be easier.

Paul · Aug 14, 2013

Ed said:
I've downloaded the Irfanview Kadmos OCR plugin, and tested it on the
same .jpg.
http://tinyurl.com/qyegfn9

Not bad, but a couple of strange mistakes with "a".

Ed

Maybe it's one of those OCRs that uses "context" to aid in the conversion.
Like, attempting to make a working sentence out of the words.

Paul

Ed Cryer · Aug 14, 2013

Paul said:
Maybe it's one of those OCRs that uses "context" to aid in the conversion.
Like, attempting to make a working sentence out of the words.

Paul

Your "working sentence" is (I believe) still way beyond IT to produce an
algorithm. Perhaps not quite as complicated as 3D recognition, but the
open nature of language presents real problems.

Anyway, even if it were using some basic rules of grammar and syntax,
you'd expect a noun to be preceded by an "a" rather than the weird
squiggles that it's come up with.

Ed

Picking out text from a screenshot

Gene E. Bloch

Paul

Ed Cryer

Robin Bignall

mick

Paul

Paul

J. P. Gilliver (John)

J. P. Gilliver (John)

Paul

Ed Cryer

Metspitzer

Paul

Ed Cryer