Please support VectorLinux!

Author Topic: pdf image to text (ocr software) SOLVED  (Read 4141 times)

sledgehammer

  • Vectorian
  • ****
  • Posts: 1451
pdf image to text (ocr software) SOLVED
« on: February 04, 2008, 11:21:54 am »
I can extract text from pdf files created by a wordprocessor using pdfedit.  However,  can't extract text from pdf files which were created by scanning from a copier.  I believe I can convert pdf images to tif or jpeg using gimp, but can't extract text from them either.

If anyone has had luck using character recognition software on VL 5.8, I would appreciate some tips.

John
« Last Edit: April 14, 2008, 04:22:58 pm by sledgehammer »
VL7.0 xfce4 Samsung RF511

bad_gui

  • Member
  • *
  • Posts: 61
Re: pdf image to text (ocr software)
« Reply #1 on: February 29, 2008, 07:06:30 pm »

sledgehammer

  • Vectorian
  • ****
  • Posts: 1451
Re: pdf image to text (ocr software)
« Reply #2 on: March 03, 2008, 10:48:19 pm »
Though I can't yet get to the wiki for some reason, it looks like its just what I need.  Thanks.  I will download it and try it soon.

John
VL7.0 xfce4 Samsung RF511

sledgehammer

  • Vectorian
  • ****
  • Posts: 1451
Re: pdf image to text (ocr software)
« Reply #3 on: March 24, 2008, 10:01:24 pm »
So far I have been unable to install tesseract.  However, when time permits (or necessity requires) I am going to try the following site, which says it has tesseract already installed. 

http://www.abillionbillion.com/about/document-management-for-everyone

If anyone has tried it, I would appreciate a heads up.
VL7.0 xfce4 Samsung RF511

sledgehammer

  • Vectorian
  • ****
  • Posts: 1451
Re: pdf image to text (ocr software)
« Reply #4 on: April 05, 2008, 02:57:41 pm »
Thanks bad_gui. 

Thanks a million!  I originally had trouble installing tesseract (didn't know make file had to be run as root).  Eventually I got on to abillionbillion.com and followed the directions there and it installed!  Then I had to learn to import pdfs into gimp as black and white, not color, and save as tiff (tesseract only renders tif files).  Then simple: run tesseract from the prompt and it OCR's perfectly!

Thanks, thanks, thanks, thanks! 

John
VL7.0 xfce4 Samsung RF511

never_stop_learning

  • Vectorite
  • ***
  • Posts: 263
    • CigarWeekly
Re: pdf image to text (ocr software)
« Reply #5 on: April 05, 2008, 03:16:02 pm »
I happen to be visiting sledgehammer watching the UCLA - Memphis game and can attest to his exuberance..... We thought we were going to have to get the defibrillator out..... ;) ;D
Laptop: IBM X60s (Centrino/Duo, 2gb ram, 80gb hd) VL 6.0 Std
Netbook: HP Mini (Intel Atom 1ghz, 2gb ram, 16gb SSD + 8gb flash ) VL 6.0 Std
Desktop: Dell Dimension 5150 (P4 3ghz, 2gb ram, 80gb hd) VL 6.0 Std
Wife's Desktop: Gateway (P4 2ghz, 1gb ram, 80gb hd) VL 6.0 Std

789

  • Member
  • *
  • Posts: 26
Re: pdf image to text (ocr software) SOLVED
« Reply #6 on: May 22, 2009, 09:16:13 am »
>>>>>Warning: this topic has not been posted in for at least 120 days.
>>>>>Unless you're sure you want to reply, please consider starting a new topic.
___________
Would there be, somewhere, a compiled version of this TesserAct available for download ?

sledgehammer

  • Vectorian
  • ****
  • Posts: 1451
Re: pdf image to text (ocr software) SOLVED
« Reply #7 on: May 22, 2009, 09:26:49 am »
VL7.0 xfce4 Samsung RF511

789

  • Member
  • *
  • Posts: 26
Re: pdf image to text (ocr software) SOLVED
« Reply #8 on: May 22, 2009, 10:01:15 am »
Me no compile ... it is simpler to boot NT and use OmniPage
Is there a compiled version available for download ?