2 different blocks of text are combined. Can I separate them if I know what 1 is?

I used several pdf β†’ text methods to extract text from PDF documents. For one particular type of PDF I don't have, PyPDF or pdfMiner do a great job of extracting text. However, http://www.convertpdftotext.net/ does this (almost) perfectly.

I found that the pdf I use has transparent text in it, and it merges with other text.

Some examples of text blocks that I return are as follows:

12324  35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12          Corrective             Object of Corrective                                                                                                                   
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position  
            C HAActRionT    N  Y  -NJ   - S A  N  D Y    H OO    K  ATcO tionLI T TLE EGG HARBOR.  Page/Side: N/A 
(Temp) indicates that the chart correction action is temporary in nature.  Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.       
 Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d  B Theuoy  5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W 
to     40-24-48.585N 074-00-05.967W 

and

12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W

and I found that "ghost text" ALWAYS :

 Corrective             Object of Corrective              Position
    Action                         Action

(Temp) , . 000 true.
. (NM), .

, , , ( -), :

12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A 
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W

( ). , ( /) python. pyPDF, , . , , , . .

.

EDIT: , , .

, , .

+3
2

, - - , , , . , , - .: -)

- . ;

<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>

, , , ( "Shinnecock Bay to East Rockaway Inlet" ) (, "State", "Boat", Daybeacon '), , levenshtein .

poppler, pdftotext -layout, PDF . .

+1

, " ..." ,

, - . , , , . .

():

 def findPaths(mangledText, pattern, path)
      if len(pattern)==0:  # end of pattern
           return [path]
      else:
           nextLetter= pattern[0]
           locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
           allPaths = []
           for loc in locations:
               paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
               allPaths.Extend(paths)
           return allPaths # if no locations for the next letters exist, allPaths will be emtpy

( , , )

  allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )

allPossiblePaths , . , , , .

+1

All Articles