Symbolic representation of patterns in strings and the search for "similar" substructures

The string "abab" can be considered as a pattern of indexed characters "0101". And the string "bcbc" will also be represented by "0101". It is pretty elegant and makes for strong comparisons, but it quickly falls apart from perfect cases.

"babcbc" will be "010202". If I wanted to notice that it contains a pattern equal to β€œ0101” (part of bcbc), I can only think of some process of normalizing each index to β€œre-represent” a substring from n to length symbolically for comparison, and this It gets complicated if I try to understand if there is something in common with "babcbc" and "dababd" (010202 vs 012120). So inefficient!

How can this be done effectively, taking care of all possible enclosed cases? Note that I'm looking for similar patterns, not similar substrings in the actual text.

+5
source share
3 answers

Your algorithm has loss of information from compression of the original row data set, so I'm not sure that you can restore the complete data set without doing much more work than comparing the original row. In addition, while your dataset becomes easier to read, it takes up as much space as the original line, and the line difference map (where the values ​​are the distance between the previous character and the current character) can have more comparable set information.

, , , . - O (n * m), n m - . LCS on SO Wikipedia. , ( - abeab eabab ), LCS, .

, . LCS, , k- , . LCS, , k . .

, k, . , LCS k. , , , k 1 , , 4D , LCS. , 4D, , , LCS, , . , LCS , .

, , / , .

0

min (K, ), K - , babcbc dababd - KK2K22 KKK225. , .

+1

, Prolog , :

:-dynamic pattern_i/3.

test:-
  retractall(pattern_i(_,_,_)),
  add_pattern(abab),
  add_pattern(bcbc),
  add_pattern(babcbc),
  add_pattern(dababd),
  show_similarities.

show_similarities:-
  call(pattern_i(Word, Pattern, Maps)),
  match_pattern(Word, Pattern, Maps),
  fail.
show_similarities.

match_pattern(Word, Pattern, Maps):-
  all_dif(Maps), % all variables should be unique
  call(pattern_i(MWord, MPattern, MMaps)),
  Word\=MWord,
  all_dif(MMaps),
  append([_, Pattern, _], MPattern), % Matches patterns
  writeln(words(Word, MWord)),
  write('mapping: '),
  match_pattern1(Maps, MMaps). % Prints mappings

match_pattern1([], _):-
  nl,nl.
match_pattern1([Char-Char|Maps], MMaps):-
  select(MChar-Char, MMaps, NMMaps),
  write(Char), write('='), write(MChar), write(' '),
  !,
  match_pattern1(Maps, NMMaps).

add_pattern(Word):-
  word_to_pattern(Word, Pattern, Maps),
  assertz(pattern_i(Word, Pattern, Maps)).

word_to_pattern(Word, Pattern, Maps):-
  atom_chars(Word, Chars),
  chars_to_pattern(Chars, [], Pattern, Maps).

chars_to_pattern([], Maps, [], RMaps):-
  reverse(Maps, RMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
  member(Char-PChar, Maps),
  !,
  chars_to_pattern(Tail, Maps, Pattern, NMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
  chars_to_pattern(Tail, [Char-PChar|Maps], Pattern, NMaps).

all_dif([]).
all_dif([_-Var|Maps]):-
  all_dif(Var, Maps),
  all_dif(Maps).

all_dif(_, []).
all_dif(Var, [_-MVar|Maps]):-
  dif(Var, MVar),
  all_dif(Var, Maps).

:

  • , char . : abcbc [X, Y, Z, Y, Z].
  • , . , abcbc zxzx, [X, Y, Z, Y, Z] [H, G, H, G]. , (H = Y, G = Z)
  • ( ), . , z = b, x = c

( abab, bcbc, babcbc, dababd):

?- test.

words(abab,bcbc)
mapping: a=b b=c 

words(abab,babcbc)
mapping: a=b b=c 

words(abab,dababd)
mapping: a=a b=b 

words(bcbc,abab)
mapping: b=a c=b 

words(bcbc,babcbc)
mapping: b=b c=c 

words(bcbc,dababd)
mapping: b=a c=b 
0

All Articles