Class DuplicateByteSequenceSpotter
The minimum required length for a duplicate sequence detected is 6 bytes.
The design goals are to maximize speed of lookup while minimizing the space required to do so. This has led to a hybrid solution for representing the bytes that make up a sequence in the trie.
If we have 6 bytes in sequence e.g. abcdef then they are represented as object nodes in the tree as follows:
(a)-(b)-(c)-(def as an int)
 DuplicateByteSequenceSpotter.RootTreeNode objects are used for the first two levels of the tree
 (representing bytes a and b in the example sequence). The combinations of
 objects at these 2 levels are few so internally these objects allocate an
 array of 256 child node objects to quickly address children by indexing
 directly into the densely packed array using a byte value. The third level in
 the tree holds DuplicateByteSequenceSpotter.LightweightTreeNode nodes that have few children
 (typically much less than 256) and so use a dynamically-grown array to hold
 child nodes as simple int primitives. These ints represent the final 3 bytes
 of a sequence and also hold a count of the number of times the entire sequence
 path has been visited (count is a single byte).
 
The Trie grows indefinitely as more content is added and while theoretically it could be massive (a 6-depth tree could produce 256^6 nodes) non-random content e.g English text contains fewer variations.
In future we may look at using one of these strategies when memory is tight:
- auto-pruning methods to remove less-visited parts of the tree
 - auto-reset to wipe the whole tree and restart when a memory threshold is reached
 - halting any growth of the tree
 
- 
Field Summary
Fields - 
Constructor Summary
Constructors - 
Method Summary
Modifier and TypeMethodDescriptionshortaddByte(byte b)Add a byte to the sequence.longint[]intvoidReset the sequence detection logic to avoid any continuation of the immediately previous bytes. 
- 
Field Details
- 
TREE_DEPTH
public static final int TREE_DEPTH- See Also:
 - Constant Field Values
 
 - 
MAX_HIT_COUNT
public static final int MAX_HIT_COUNT- See Also:
 - Constant Field Values
 
 
 - 
 - 
Constructor Details
- 
DuplicateByteSequenceSpotter
public DuplicateByteSequenceSpotter() 
 - 
 - 
Method Details
- 
startNewSequence
public void startNewSequence()Reset the sequence detection logic to avoid any continuation of the immediately previous bytes. A minimum of dupSequenceSize bytes need to be added before any new duplicate sequences will be reported. Hit counts are not reset by calling this method. - 
addByte
public short addByte(byte b)Add a byte to the sequence.- Parameters:
 b- the next byte in a sequence- Returns:
 - number of times this byte and the preceding 6 bytes have been seen before as a sequence (only counts up to 255)
 
 - 
getEstimatedSizeInBytes
public final long getEstimatedSizeInBytes() - 
getNodesAllocatedByDepth
public int[] getNodesAllocatedByDepth()- Returns:
 - Performance info - the number of nodes allocated at each depth
 
 - 
getNodesResizedByDepth
public int getNodesResizedByDepth()- Returns:
 - Performance info - the number of resizing of children arrays, at each depth
 
 
 -