Knowledge compression is maybe an important characteristic of contemporary computation, enabling environment friendly storage and transmission of knowledge. Some of the well-known compression algorithms is Huffman coding. On this publish, we’re going to introduce a sophisticated model: a block-based, 2-symbol, two-pass Huffman algorithm in Golang. It could actually convey additional enhancements concerning the rise of compression effectivity in particular kinds of knowledge, as it is going to consider pairs of symbols as an alternative of particular person ones.
Algorithm Overview
The 2-pass Huffman algorithm in blocks of two symbols is an extension of the traditional Huffman coding. It processes enter knowledge in pairs of bytes, probably providing higher compression ratios for sure kinds of knowledge. Let’s break down the encoding course of step-by-step:
1. First Move: Frequency Counting
- Learn the enter file in blocks of two bytes (16 bits).
- Retailer these blocks in a map construction, the place:
- The hot button is the 2-byte block (represented as an
int16
). - The worth is a construction containing frequency info and different metadata.
- The hot button is the 2-byte block (represented as an
2. Construct the Huffman Tree
- Create an inventory of components from the map.
- Type the listing based mostly on the frequency of every block.
- Iteratively mix the 2 least frequent components:
- Create a brand new node that turns into the father or mother of those two components.
- The frequency of the brand new node is the sum of its kids’s frequencies.
- Retailer references to the unique two components throughout the new node.
- Repeat the mix course of till just one aspect (the foundation) stays.
3. Generate Huffman Codes
- Ranging from the foundation of the Huffman tree, traverse the tree:
- Assign
0
for left branches and1
for proper branches. - When a leaf node is reached, the trail from root to leaf turns into the code for that block.
- Assign
- Retailer the ensuing codes for every unique 2-byte block.
4. Second Move: Encoding
- Learn the enter file once more, this time changing every 2-byte block with its corresponding Huffman code.
- Write the encoded knowledge to the output file, little by little.
5. File Construction
The encoded file has a selected construction to permit for correct decoding:
- First 4 bytes: Variety of nodes within the Huffman tree
- Subsequent byte: Variety of helpful bits within the final byte of encoded knowledge
- Subsequent byte: Flag indicating if a zero byte was added to the tip of the enter file
- Tree construction: Encoded illustration of the Huffman tree
- Encoded knowledge: The compressed content material
Golang Implementation
Let’s dive into the important thing parts of the Golang implementation:
// Block represents a node within the Huffman tree
kind Block struct {
Worth uint16 // The two-byte block worth
Frequency int // Frequency of the block within the enter
Left *Block // Left youngster within the Huffman tree
Proper *Block // Proper youngster within the Huffman tree
Code string // Huffman code for this block
}
// EncodeFile orchestrates the complete encoding course of
func EncodeFile(inputPath, outputPath string) error {
// Depend block frequencies
blocks, err := countBlockFrequencies(inputPath)
if err != nil {
return err
}
// Construct the Huffman tree
root := buildHuffmanTree(blocks)
// Generate Huffman codes for every block
generateCodes(root, "")
// Encode the file utilizing the generated codes
return writeEncodedFile(inputPath, outputPath, root, blocks)
}
// countBlockFrequencies reads the enter file and counts the frequency of every 2-byte block
func countBlockFrequencies(inputPath string) (map[uint16]*Block, error) {
blocks := make(map[uint16]*Block)
file, err := os.Open(inputPath)
if err != nil {
return nil, err
}
defer file.Shut()
reader := bufio.NewReader(file)
for {
block, err := reader.ReadUint16(binary.BigEndian)
if err != nil {
if err == io.EOF {
break
}
return nil, err
}
if _, exists := blocks[block]; !exists {
blocks[block] = &Block{Worth: block, Frequency: 1}
} else {
blocks[block].Frequency++
}
}
return blocks, nil
}
// buildHuffmanTree constructs the Huffman tree from the frequency map
func buildHuffmanTree(blocks map[uint16]*Block) *Block {
pq := make(PriorityQueue, 0, len(blocks))
for _, block := vary blocks {
heap.Push(&pq, block)
}
for pq.Len() > 1 {
left := heap.Pop(&pq).(*Block)
proper := heap.Pop(&pq).(*Block)
father or mother := &Block{
Frequency: left.Frequency + proper.Frequency,
Left: left,
Proper: proper,
}
heap.Push(&pq, father or mother)
}
return heap.Pop(&pq).(*Block)
}
// generateCodes assigns Huffman codes to every block
func generateCodes(node *Block, code string) {
if node.Left == nil && node.Proper == nil {
node.Code = code
return
}
generateCodes(node.Left, code+"0")
generateCodes(node.Proper, code+"1")
}
// writeEncodedFile writes the compressed knowledge to the output file
func writeEncodedFile(inputPath, outputPath string, root *Block, blocks map[uint16]*Block) error {
inputFile, err := os.Open(inputPath)
if err != nil {
return err
}
defer inputFile.Shut()
outputFile, err := os.Create(outputPath)
if err != nil {
return err
}
defer outputFile.Shut()
// Write file header
binary.Write(outputFile, binary.BigEndian, uint32(len(blocks)))
// ... (write different header info)
// Write tree construction
writeTreeStructure(outputFile, root)
// Encode and write knowledge
reader := bufio.NewReader(inputFile)
author := bitio.NewWriter(outputFile)
for {
block, err := reader.ReadUint16(binary.BigEndian)
if err != nil {
if err == io.EOF {
break
}
return err
}
code := blocks[block].Code
for _, bit := vary code {
if bit == '0' {
author.WriteBit(bitio.Zero)
} else {
author.WriteBit(bitio.One)
}
}
}
author.Shut()
return nil
}
This implementation offers a strong basis for the two-pass Huffman algorithm in Golang. The EncodeFile
operate orchestrates the complete encoding course of, from frequency counting to writing the encoded file.
Decoding Course of
The decoding course of primarily reverses the encoding steps:
- Learn the file header to acquire:
- Variety of nodes within the Huffman tree
- Variety of helpful bits within the final byte of encoded knowledge
- The presence of a “zero byte” on the finish of the unique file
- Reconstruct the Huffman tree:
- Learn the encoded tree construction little by little.
- When a
1
is encountered, learn the subsequent 16 bits to get the unique block worth. - When a 0 is encountered, create an inner node.
- Learn the encoded knowledge:
- Course of the information little by little, traversing the Huffman tree.
- When a leaf node is reached, output the corresponding 2-byte block.
- Proceed till all knowledge is processed.
- Deal with the final byte:
- Use the details about helpful bits to keep away from outputting further knowledge.
- If a “zero byte” was added throughout encoding, take away it from the decoded output.
This is a skeleton of the decoding operate in Golang:
func DecodeFile(inputPath, outputPath string) error {
inputFile, err := os.Open(inputPath)
if err != nil {
return err
}
defer inputFile.Shut()
outputFile, err := os.Create(outputPath)
if err != nil {
return err
}
defer outputFile.Shut()
// Learn file header
var nodeCount uint32
binary.Learn(inputFile, binary.BigEndian, &nodeCount)
// ... (learn different header info)
// Reconstruct Huffman tree
root := reconstructTree(inputFile)
// Decode knowledge
reader := bitio.NewReader(inputFile)
author := bufio.NewWriter(outputFile)
currentNode := root
for {
bit, err := reader.ReadBit()
if err != nil {
if err == io.EOF {
break
}
return err
}
if bit == bitio.Zero {
currentNode = currentNode.Left
} else {
currentNode = currentNode.Proper
}
if currentNode.Left == nil && currentNode.Proper == nil {
binary.Write(author, binary.BigEndian, currentNode.Worth)
currentNode = root
}
}
author.Flush()
// Deal with final byte and nil byte if essential
// ...
return nil
}
Efficiency Evaluation
The supplied take a look at outcomes present the algorithm’s efficiency on numerous file sorts. Let’s analyze a few of these outcomes:
- Textual content information (books, papers, supply code):
- Achieved compression ratios of round 8-9 bits per phrase
- Instance:
e book.txt
compressed to eight.14 bits per phrase - Entropy values:
- H(X) = 4.53 (first-order entropy)
- H(X|X) = 3.58 (second-order entropy)
- Picture file (
image
):- Distinctive compression: 2.39 bits per phrase
- Entropy values a lot decrease than textual content information:
- H(X) = 1.21
- H(X|X) = 0.82
- Supply code information (
program1
,program2
,program3
):- Compression ratios between 8.00 and eight.80 bits per phrase
- Typically decrease entropy values in comparison with pure language textual content
- Geometric knowledge (
geometric
):- Larger compression ratio: 9.22 bits per phrase
- Larger entropy values:
- H(X) = 5.65
- H(X|X) = 4.26
These outcomes reveal that the two-pass Huffman algorithm in blocks of two symbols performs in a different way throughout numerous kinds of knowledge:
- It is significantly efficient for picture knowledge, attaining vital compression.
- For textual content and supply code, it offers average compression, with efficiency various based mostly on the content material’s complexity and redundancy.
- Geometric knowledge appears to be more difficult to compress with this technique.
The entropy values present insights into the theoretical limits of compression for every file kind. The distinction between the achieved compression (bits per phrase) and the first-order entropy H(X) signifies the potential for additional optimization.
For additional reference and verification, all information used within the efficiency evaluation, together with textual content information, picture information, supply code, and geometric knowledge, will be accessed by way of the next Google Drive link. This hyperlink offers direct entry to the information talked about on this report, permitting for a extra detailed overview of the compression outcomes and entropy values throughout numerous file sorts.
Conclusion
Huffman’s two-pass algorithm in blocks of two symbols is an fascinating strategy for conducting knowledge compression. It could yield higher compression ratios for sure kinds of knowledge in comparison with traditional single-symbol Huffman coding. The Golang implementation right here has proven effectiveness for a variety of file sorts and sizes.
Key Takeaways
- The perfect efficiency of the algorithm is on picture knowledge, which justifies utilizing it within the pipelines of picture compression.
- As regards to textual content and supply code, it offers average compression, which is an efficient stability between effectivity and complexity. Block-based permits catching some context-meaningful pairs of symbols-what will be higher in some situations. Simple to implement and combine into present Golang tasks.
- The block-based strategy permits for capturing some context (pairs of symbols), which may result in higher compression in sure situations.
- The implementation is versatile and will be simply built-in into present Golang tasks.
As with all compression technique, the effectiveness can differ relying on the enter knowledge traits. Additional optimizations and variations might probably improve its efficiency for particular use instances:
- Experimenting with totally different block sizes (e.g., 3 or 4 bytes) may yield higher outcomes for sure knowledge sorts.
- Implementing adaptive block sizes based mostly on knowledge traits might optimize compression ratios.
- Parallel processing of blocks might enhance encoding and decoding velocity for big information.
In conclusion, this two-pass Huffman implementation in Golang offers a strong basis for exploring superior knowledge compression methods. Its block-based strategy gives a singular stability between compression efficiency and algorithmic complexity, making it a invaluable software within the knowledge compression toolkit.
References
- Kudryashov, B.D. Data Principle. Larger Faculty of Economics, 2009.