This problem trips up many candidates because it sits at the intersection of string manipulation and frequency counting, wrapped inside a sliding window that moves in fixed-size chunks rather than one character at a time. The uniform word length is the key structural insight that most people overlook on their first attempt. Once you recognize that every word occupies the same number of characters, the problem transforms from a combinatorial nightmare into a disciplined window-based scan. Mastering this pattern builds strong intuition for problems where fixed-size blocks and frequency matching intersect.
Problem statement
You are given a string s and an array of strings words, where every string in words has the same length.
A concatenated substring is formed by joining every word in words exactly once, in any order, with no extra characters between them. For instance, if words = ["ab", "cd", "ef"], then "abcdef", "cdabef", "efcdab", and all other permutations of these three words joined together are valid concatenated substrings. However, something like "acdbef" is not valid because it cannot be decomposed into the original words in any arrangement.
Your goal is to find every starting index in s where such a concatenated substring begins. Return the list of starting indices in any order.
Constraints
1 ≤ s.length ≤ 10^31 ≤ words.length ≤ 10001 ≤ words[i].length ≤ 30sandwords[i]consist of lowercase English letters
Examples
| Input | Output | Explanation |
|---|---|---|
| s = “aabbccaabbcc”, words = [“aa”,”bb”,”cc”] | [0, 6] | The substring at index 0 is “aabbcc” (aa + bb + cc), and the substring at index 6 is also “aabbcc”, both valid concatenations. |
| s = “abcdefgh”, words = [“xyz”,”abc”] | [] | The word “xyz” never appears anywhere in the string, so no valid concatenation of length 6 exists. |
| s = “aaaaaa”, words = [“aa”,”aa”,”aa”] | [0] | The entire string “aaaaaa” starting at index 0 splits into “aa”, “aa”, “aa”, matching the required word frequencies exactly. |
| s = “catdogcatdogcat”, words = [“cat”,”dog”] | [0, 3, 6, 9] | Starting at indices 0, 3, 6, and 9, the length-6 substrings “catdog”, “dogcat”, “catdog”, and “dogcat” are each valid concatenations of “cat” and “dog”. |
| s = “abab”, words = [“ab”] | [0, 2] | With a single word “ab”, the substrings at index 0 and index 2 each match. |
| s = “xyzxyzxyzxyz”, words = [“xyz”,”xyz”] | [0, 3, 6] | Substrings of length 6 at indices 0, 3, and 6 are all “xyzxyz”, a valid concatenation of the two required copies of “xyz”. |
Why enumerating all possibilities doesn’t scale
The most direct approach would be to generate every permutation of the words array, join each permutation into a single string, and then search for each of those strings inside s. If there are k words, there are k! permutations. Even with just 10 words, that is over 3.6 million permutations, each requiring a substring search across s. With up to 1000 words, this approach is astronomically expensive and completely impractical.
A slightly better but still inefficient approach checks every possible starting position in s, extracts a substring of the required total length, and then repeatedly slices it into word-sized chunks while comparing against the frequency map. This avoids generating permutations but still performs redundant work. Each new starting position rebuilds the frequency map from scratch, even though consecutive windows share nearly all their content. The overlapping computation makes this quadratic in practice when the string is long relative to the word size.
The core issue in both approaches is the failure to reuse information from one window position to the next. Any efficient solution must find a way to incrementally update the state as the window shifts.
Optimized approach
The uniform word length is what makes this problem tractable. Because every word occupies exactly word_len characters, the string can be thought of as a sequence of word-sized blocks. A sliding window of exactly num_words blocks can glide across these blocks, and at each step, only one block leaves the window and one block enters. This means the frequency map can be updated incrementally rather than rebuilt.
There is a subtlety: the blocks depend on where you start. If word_len is 3, starting at index 0 gives blocks at positions 0, 3, 6, 9, …, while starting at index 1 gives blocks at 1, 4, 7, 10, …. These are entirely different decompositions. So the algorithm runs word_len separate passes, one for each starting offset from 0 to word_len - 1, and within each pass, it maintains a single sliding window.
During each pass, the window tracks a seen counter that records how many times each word appears in the current window. When a new word enters from the right:
- If the word is not in the target word list, the window is invalid and must be completely reset.
- If the word is valid but appears more times than required, the left boundary advances until the excess is eliminated.
- If the window reaches the exact total length needed, the left index is recorded as a valid answer.
This strategy processes each character of s a constant number of times per offset, yielding efficient performance.
Why the optimized approach works well
The sliding window avoids redundant computation by carrying forward the frequency state from one position to the next. Instead of re-examining all words in the window every time it shifts, only the entering and exiting words are processed. The word_len separate passes guarantee that every possible alignment is covered, and within each pass, the two pointers only move forward, preventing any backtracking or repeated scanning.
The frequency map comparison is also handled incrementally. Rather than comparing two full maps at every step, the algorithm maintains a running count and only checks whether the window has reached the target length. This keeps each window adjustment at O(1) amortized cost (excluding the word slicing itself), making the overall approach both time-efficient and straightforward to implement.
Step-by-step algorithm
- Compute
word_lenas the length of the first word,num_wordsas the total count of words, andtotal_lenasword_len * num_words. - Build a frequency map
word_countfrom thewordsarray, counting how many times each word should appear. - Initialize an empty result list.
- Iterate over each starting offset
ifrom 0 toword_len - 1. - For each offset, set
left = i,right = i, and create an empty counterseen. - While
right + word_lendoes not exceed the length ofs, extract the words[right:right + word_len]and advancerightbyword_len. - If the extracted word exists in
word_count, increment its count inseen. If its count inseenexceeds the allowed count inword_count, advanceleftbyword_lenrepeatedly, decrementing the exiting word’s count each time, until the overcount is resolved. - If the extracted word does not exist in
word_count, clear theseencounter and movelefttorightto fully reset the window. - After updating the window, check if
right - leftequalstotal_len. If so, appendleftto the result list. - After all offsets are processed, return the result list.
Python implementation
Code for Substring With Concatenation Of All Words
Time complexity
The algorithm runs word_len separate passes over the string. Within each pass, both the left and right pointers traverse the string from their starting offset to the end, and each pointer only moves forward. Extracting a word-sized substring costs O(word_len) due to string slicing. Across all offsets, every character in s is visited as part of a word extraction a constant number of times. This gives an overall time complexity of O(n * m), where n is the length of s and m is the length of each word. The inner while loop that shrinks the window does not add extra cost because each word can only be added and removed from the window once per pass.
Space complexity
The algorithm uses two Counter objects: word_count, which stores up to k unique words from the input, and seen, which stores at most k unique words at any point during a pass. Here k is the number of distinct words in the words array. The result list stores at most O(n / m) indices. No recursion is used, so there is no stack overhead. The overall space complexity is O(k), where k is the number of unique words.
Edge cases
- Single word in the array: The window only needs to match one word at a time, so the algorithm reduces to finding all occurrences of that single word in
s. - All words are identical: The frequency map has one entry with a count greater than one. The sliding window correctly tracks how many times the repeated word appears and ensures it matches the required count exactly.
- Total concatenation length exceeds the string length: When
word_len * num_words > len(s), no valid substring can exist. The while loop conditionright + word_len <= len(s)naturally prevents any processing, and the result is an empty list. - No words from the array appear in the string: Every extracted word fails the
word in word_countcheck, causing the window to reset continuously. The result is an empty list.
Common pitfalls
- Forgetting to iterate over all starting offsets: If the algorithm only starts from offset 0, it misses valid concatenations that begin at positions not aligned to the first block boundary. For example, with
word_len = 3and a valid match starting at index 1, only the offset-1 pass would catch it. - Resetting the entire window when a word is merely overcounted: When a valid word appears one too many times, the correct action is to shrink the window from the left until the count is valid again, not to clear everything. Clearing the window discards valid partial matches unnecessarily.
- Comparing full frequency maps at every step: Building and comparing two complete maps each time the window shifts adds unnecessary overhead. The incremental approach of maintaining
seenand checking window size is both simpler and faster. - Not handling duplicate words in the array: If
wordscontains the same string multiple times, the algorithm must track exact counts rather than just presence. Using a set instead of a counter would incorrectly accept windows where a duplicate word appears only once.