Aho-Corasick Algorithm

Aho-Corasick Algorithm

Hey there, fellow tech enthusiasts! Let's dive into the fascinating Aho-Corasick algorithm—a super-efficient way to search for multiple patterns in a text. Whether you’re a programmer, data scientist, or just a curious mind, this algorithm is something you’ll want to know about. I’ll break it down into simple terms, show you how it works, and even give you code snippets in C# and Java to try out. Ready? Let's go!


What’s the Aho-Corasick Algorithm All About?

Imagine you have a bunch of keywords you want to find in a long piece of text. Instead of searching for each keyword one by one, the Aho-Corasick algorithm lets you search for all of them at once. This method is super efficient and does the job in linear time, meaning it processes each character of the text just once!

The Basics

Here’s the magic in three steps:

1. Build a Trie: Think of this as a tree where each branch represents a character in your keywords. By the end, you’ll have a tree that shows all the possible paths (keywords) you’re looking for. For instance - Here I have created a trie for keywords "his", "hers" and "she"

Article content
keywords - "his", "hers", "she"


2. Add Failure Links: This sounds more complicated than it is. If you hit a dead end while searching, these links tell you where to go next, so you don’t have to start over. It’s like having a GPS that redirects you when you miss a turn.

In this we identify the failure links. It is nothing but the longest proper suffix of the current string which is also a prefix of one of the keywords. For instance- The failure link from "h" in "she" to "h" in "his" allows the algorithm to efficiently redirect and continue searching in "his" when a mismatch occurs after matching "h" in "she".

Article content
with failure links (----)

Failure links are crucial for optimizing pattern matching efficiency by allowing the algorithm to recover quickly from mismatches. They enable the trie structure to handle multiple keywords effectively.

3. Search the Text: With your trie and failure links in place, you can now scan through your text. Each character in the text will guide you through the trie, and you’ll know immediately if you’ve found one of your keywords.

For example if our input text is "sushis" then at index 3 (1-based indexding) we will be at node s (in she) then h (in she) but now it fails as for she the next node is e but text has i(at index 5) so we traverse the failure link at h (in she) and reaches h(in his/hers). Finally we find the keyword "his" is present in the input text

Let's See It in Action

Here's a C# implementation to get you started:

using System;
using System.Collections.Generic;

public class AhoCorasick
{
    private class Node
    {
        public Dictionary<char, Node> Children = new Dictionary<char, Node>();
        public Node Fail;
        public List<string> Output = new List<string>();
    }

    private Node root;

    public void BuildTrie(List<string> patterns)
    {
        root = new Node();
        foreach (var pattern in patterns)
        {
            var current = root;
            foreach (var c in pattern)
            {
                if (!current.Children.ContainsKey(c))
                {
                    current.Children[c] = new Node();
                }
                current = current.Children[c];
            }
            current.Output.Add(pattern);
        }

        BuildFailureLinks();
    }

    private void BuildFailureLinks()
    {
        var queue = new Queue<Node>();
        foreach (var child in root.Children.Values)
        {
            child.Fail = root;
            queue.Enqueue(child);
        }

        while (queue.Count > 0)
        {
            var current = queue.Dequeue();
            foreach (var kvp in current.Children)
            {
                var c = kvp.Key;
                var child = kvp.Value;
                var fail = current.Fail;
                while (fail != null && !fail.Children.ContainsKey(c))
                {
                    fail = fail.Fail;
                }
                if (fail == null)
                {
                    child.Fail = root;
                }
                else
                {
                    child.Fail = fail.Children[c];
                    child.Output.AddRange(child.Fail.Output);
                }
                queue.Enqueue(child);
            }
        }
    }

    public List<(int, List<string>)> Search(string text)
    {
        var current = root;
        var results = new List<(int, List<string>)>();

        for (int i = 0; i < text.Length; i++)
        {
            while (current != null && !current.Children.ContainsKey(text[i]))
            {
                current = current.Fail;
            }
            if (current == null)
            {
                current = root;
                continue;
            }
            current = current.Children[text[i]];
            if (current.Output.Count > 0)
            {
                results.Add((i, new List<string>(current.Output)));
            }
        }
        return results;
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        var patterns = new List<string> { "he", "she", "his", "hers" };
        var text = "ushers";
        var ac = new AhoCorasick();
        ac.BuildTrie(patterns);
        var matches = ac.Search(text);

        foreach (var match in matches)
        {
            Console.WriteLine($"Pattern found at index {match.Item1}: {string.Join(", ", match.Item2)}");
        }
    }
}
        

And here’s how you can do it in Java:

import java.util.*;

public class AhoCorasick {
    private static class Node {
        Map<Character, Node> children = new HashMap<>();
        Node fail;
        List<String> output = new ArrayList<>();
    }

    private Node root;

    public void buildTrie(List<String> patterns) {
        root = new Node();
        for (String pattern : patterns) {
            Node current = root;
            for (char c : pattern.toCharArray()) {
                current = current.children.computeIfAbsent(c, k -> new Node());
            }
            current.output.add(pattern);
        }
        buildFailureLinks();
    }

    private void buildFailureLinks() {
        Queue<Node> queue = new LinkedList<>();
        for (Node child : root.children.values()) {
            child.fail = root;
            queue.add(child);
        }

        while (!queue.isEmpty()) {
            Node current = queue.poll();
            for (Map.Entry<Character, Node> entry : current.children.entrySet()) {
                char c = entry.getKey();
                Node child = entry.getValue();
                Node fail = current.fail;
                while (fail != null && !fail.children.containsKey(c)) {
                    fail = fail.fail;
                }
                if (fail == null) {
                    child.fail = root;
                } else {
                    child.fail = fail.children.get(c);
                    child.output.addAll(child.fail.output);
                }
                queue.add(child);
            }
        }
    }

    public List<Map.Entry<Integer, List<String>>> search(String text) {
        Node current = root;
        List<Map.Entry<Integer, List<String>>> results = new ArrayList<>();

        for (int i = 0; i < text.length(); i++) {
            while (current != null && !current.children.containsKey(text.charAt(i))) {
                current = current.fail;
            }
            if (current == null) {
                current = root;
                continue;
            }
            current = current.children.get(text.charAt(i));
            if (!current.output.isEmpty()) {
                results.add(new AbstractMap.SimpleEntry<>(i, new ArrayList<>(current.output)));
            }
        }
        return results;
    }

    public static void main(String[] args) {
        List<String> patterns = Arrays.asList("he", "she", "his", "hers");
        String text = "ushers";
        AhoCorasick ac = new AhoCorasick();
        ac.buildTrie(patterns);
        List<Map.Entry<Integer, List<String>>> matches = ac.search(text);

        for (Map.Entry<Integer, List<String>> match : matches) {
            System.out.println("Pattern found at index " + match.getKey() + ": " + match.getValue());
        }
    }
}        

Complexity of the Algorithm

The beauty of the Aho-Corasick algorithm lies in its efficiency:

  • Trie Construction: Building the trie from the list of patterns takes O(m) time, where m is the total length of all the patterns combined.
  • Failure Link Construction: Creating the failure links also takes O(m) time.
  • Searching the Text: Searching through the text is O(n), where n is the length of the text.

Thus, the overall complexity of the algorithm is O(n+m). This makes it exceptionally fast for multi-pattern matching.

Cool Applications

  • Text Processing:

Search Engines: Quickly find multiple keywords in large documents.

Spam Filtering: Detect spammy phrases in emails.

  • Cybersecurity:

Intrusion Detection: Spot known attack patterns in network traffic.

Antivirus Software: Identify malware signatures in files.

  • Bioinformatics:

DNA Sequence Analysis: Locate multiple genetic patterns in DNA sequences.


Wrapping Up

The Aho-Corasick algorithm is like a Swiss Army knife for pattern matching. It’s quick, efficient, and incredibly useful across different fields. Now that you’ve got the basics down, try playing around with the code and see how it can help you in your projects. Happy coding!

To view or add a comment, sign in

More articles by Prerna Agarwal

Others also viewed

Explore content categories