Activity - @IDIC@mastodon.social @dangillmor@mastodon.social...

demerara, 8 months ago

@IDIC @dangillmor

For years a hobby of mine was to take stuff like this, run it through OCR to get editable text, and end up with a proper flowing text ebook. Unfortunately, my old eyes don't let me do much of this any more.

I took a sample page from the first catalog, and experimented with it. The image quality is excellent for OCR, my setup gives very few scannos.

But because of the layout, this would have to be done almost paragraph by paragraph, and then formatted into something ereaders/tablets/phones could handle. And all the images! It would be a real labour of love to do just one of these volumes.

There is no automated way to take pages like these and just stuff them through OCR. The result would be horrible. Maybe that is something AI could learn to do!

If anyone would like to give it a try but doesn't know how, I'd be happy to share my experience. Tools I use, all on Linux: OCRFeeder with Tesseract, LibreOffice Writer, and for images, Gimp and ImageMagick, then Calibre Editor for epub making and html editing.

You would probably be looking at several hundred hours or more for one of these volumes.

#OCR #ebooks @bookstodon

reply

report

activity

copy /kbin url

copy original url

Loading...