I need to develop an algorithm that can find the position of a data item in some hierarchy. I have a hierarchy that classifies the elements of a certain data set. The hierarchy is taxonomic - the top element is the most general class that corresponds to any element of the data set, the deeper elements contain more specific classes that correspond to some subset of the data set.
For example, consider the hierarchy of yachts. We have a cool yacht upstairs. At the next level, we have a sailing yacht and a motor yacht. A sailing yacht consists of two children - yachts on cruise yachts and racing yachts. Cruisers can be divided into manufacturers, for example, Bavaria Yachts and Dufour Yachts. Then each of these classes can be divided into hull type, length, sails, etc.
This is an example from a dataset:
Drive Class Manufacturer Hull type Len Sails Area ... Model
Sailing Cruiser Bavaria Yachts Mono-hull 25ft 560sqft ... Bavaria 32
Sailing Cruiser Dufour Yachts Mono-hull 27ft 580sqft ... Dufour 32 Classic
I can easily match each pattern to a hierarchy by searching in depth order.
This is a simple search problem at first glance, but there are some difficulties.
First difficulty: data elements do not necessarily contain all elements. It is well known that a data element does not contain 10 to 50 percent of the elements. Many of these elements are not very significant, for example, the Drive yacht can only be Motor or Sail, so it does not bring much information (only 1 bit). These elements can be easily determined using more significant elements, for example, if we know the model of a yacht, we can enclose all other elements (or fields) of the data element.
The second difficulty: some elements may differ between different data elements, even if they correspond to the same place in the hierarchy (the same model of a yacht). For example, the Sails area can vary a lot because boat owners change their yacht platform differently or simply round the area.
, . . - , . , , , . , , Juliet 23, .
, . , 4 Juliet 23 , 25%.
, . , , . , , .