How to group unknown text messages using an algorithm?

Below is an example of a dataset that I need to combine together, if you look carefully, these are basically similar text strings, but with very little difference in the presence of an identifier or a person identifier.

Unexpected error:java.lang.RuntimeException:Data not found for person 1X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 2X99999123 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 31X9393912 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 36X9393912 . Clear set not defined . Dump
Exception in thread "main" javax.crypto.BadPaddingException: ID 1 Given final block not properly padded
Exception in thread "main" javax.crypto.BadPaddingException: ID 2 Given final block not properly padded
Unexpected error:java.lang.RuntimeException:Data not found for person 5 . Clear set not defined . Dump
Unexpected error:java.lang.RuntimeException:Data not found for person 6 . Clear set not defined . Dump
Exception in thread "main" java.lang.NullPointerException at TripleDESTest.encrypt(TripleDESTest.java:18)

I want to group them so that the end result looks like

Unexpected error:java.lang.RuntimeException:Data not found - 6
Exception in thread "main" javax.crypto.BadPaddingException - 2
Exception in thread "main" java.lang.NullPointerException at - 1

Is there an available API or algorithm to handle such cases?

Thanks at Advance. cheers shakti

+3
source share
3 answers

This question has been marked as machine learning, so I’m going to suggest a classification approach.

tokenize - , .

, , () C4.5 - . , > 1.

, "" ! , .

, , , .

:

  • , , , msg (NPE ) - , , , BadPaddingException.
  • - weka - java
  • , , , 10- ?
+2

, , , ... , ....

, , , , StringTokenizer...

0

If you know the message format, the easiest way is to use a regular expression and count matches.

Regular expressions are fully supported in Java, and their use is certainly faster than the clustering algorithm.

0
source

All Articles