The Group anagrams problem is a classic introduction to using hashing for efficient grouping. It is a favorite in interviews because it tests whether you can replace repeated pairwise comparisons with a single, consistent representation for each string. While checking anagrams directly may seem straightforward, this problem highlights why computing a canonical key (such as a character-frequency map or a sorted string) is crucial for efficient scaling.
Problem statement
Given an array of strings strs, your task is to group together strings that are anagrams of each other. You can return the group in any order.
Constraints:
- 1 <=
strs.length<= 104 - 0 <=
strs[i].length<= 100 strs[i]consists of lowercase English letters.
Examples
| Input | Output | Explanation |
| [“arc”, “car”, “elbow”, “below”, “state”, “taste”, “cat”] | [[“arc”, “car”], [“elbow”, “below”],[“state”, “taste”],[“cat”]] | “arc” and “car” both share the letters {a, c, r} so they form one group. “elbow” and “below” share {b,e,l,o,w} so they form another. “state” and “taste” share {a, e, s, t, t} so they form another. “cat” stands alone. |
| [“abc”, “bca”, “cab”, “xyz”, “zyx”, “zxy”, “foo”] | [[“abc”, “bca”, “cab”], [“xyz”, “zyx”, “zxy”], [“foo”]] | “abc”, “bca”, and “cab” all share the letters {a, b, c}, so they form one group. “xyz”, “zyx”, and “zxy” share {x, y, z}, so they form another. “foo” stands alone. |
| [“aa”, “aa”, “a”, “bb”, “b”, “b”] | [[“aa”, “aa”], [“a”], [“bb”], [“b”, “b”]] | The two “aa” strings share the letters {a, a}, so they form one group. The two “b” strings share {b}, so they form another anagram. “a” and “bb” don’t share the exact same multiset of letters with any other string, so each stands alone. |
Naive approach
A simple approach is to compare every pair of strings and decide if they are anagrams. That means repeatedly checking the same strings against many others. For example, given the strings: ["eat", "tea", "tan", "ate"], you end up comparing each string with every other string, i.e.,
- Comparing
"eat"with"tea" - Comparing
"eat"with"tan" - Comparing
"eat"with"ate" - Comparing
"tea"with"tan" - Comparing
"tea"with"ate" - Comparing
"tan"with"ate"
That is 6 comparisons for just 4 strings. In general, if there are n strings, comparing every pair takes n(n−1)/2 pair checks, which takes O(n²) time. With up to 104 strings, the number of comparisons becomes enormous, and that’s before accounting for the cost of each anagram check.
Each pair check is also non-trivial: to confirm two strings are anagrams, you must process their characters, either by building frequency counts or by sorting. That makes the total runtime:
- If you check anagrams using character counts, the overall cost is O(n² x k) time and O(1) extra space, where n is the number of strings and k is the average string length.
- If you check anagrams by sorting each pair of strings, the cost becomes O(n² x k log k) time. This method typically requires creating sorted versions of the strings, which can use O(k) additional space per comparison depending on the implementation.
Even if you optimize by sorting strings before comparing them, the approach is still wasteful if you don’t reuse results. You repeatedly rebuild the same sorted form (the anagram signature) for the same strings across multiple comparisons, thereby repeating the most expensive part of the work.
Optimal solution using Hash Maps
To overcome repeated comparisons, we need to redefine what it means to “match” two strings. Anagrams become straightforward to group once we convert each word into a canonical signature, a representation that’s guaranteed to be identical for all rearrangements of the same letters. Instead of repeatedly asking whether two strings are anagrams, we compute a single “identity” for each string that captures only its character makeup.
Once every string has a stable identity, grouping becomes a simple collection task. We compute the key for a string and drop it into a hash map bucket for that key. A clean key is the string with its characters sorted: "eat", "tea", and "ate" all normalize to "aet", so they automatically end up together under the "aet" entry without any pairwise checking.
Step-by-step algorithm
This algorithm is implemented as follows:
- Create a dictionary
groups. - For each
wordin the input stringstrs:- Initialize a list
word_frequencyof size 26 with all0‘s. - For each
characterinword:- Turn each
characterinto a number from 0 to 25 usingord(character) - ord('a'), whereord()in Python returns the ASCII value of a character (e.g., 97 for'a', 98 for'b'). This subtraction gives us alphabetical offsets:'a'maps to 0,'b'to 1, etc. Use that number as an index inword_frequencyand increment by1to count that letter’s occurrence. - Initialize a tuple
keybyword_frequency, so it can be used as a dictionary key.
- Turn each
- Append the string
wordtogroups[key].
- Initialize a list
- Once all words are processed, return the grouped anagrams as a list.
To better understand the solution, let’s walk through the algorithm using the illustration below:
Executable Python implementation
The code below implements the algorithm described above.
Code for the group anagrams problem
Time complexity
The time complexity of this solution is O(n x k log k), where n is the number of strings and k is the average length of a string. For each string, we sort its characters, which costs O(k log k), and we do this once per string.
Space complexity
The space complexity of this solution is O(n x k), because in the worst case, we store all strings inside the hash map buckets and create keys whose total size can scale with the total input size.
Common pitfalls to avoid
- Sorting the entire list instead of sorting each string. You need a per-string identifier; sorting
strsdoesn’t reveal anagram group membership. - Using the sorted characters but forgetting to join into a string key. Lists aren’t hashable in Python, so you need a hashable key (like a string or tuple).
- Assuming output order matters. The problem allows any order, so don’t waste time trying to match a specific ordering unless explicitly required.
- Not handling empty strings naturally. Avoid special casing, this approach already handles them correctly.