Algorithms and Data Structures in Go: Strings

I’m generally interested in algorithms and data structures, and looking for a job, so I’m working through some questions in Cracking the Coding Interview (note: affiliate link). The code in the book is mostly Java, but I prefer Go, so I thought it would be fun to do a series of posts on algorithms and data structures in Go, since most content out there is in Python, Java, C++, and so on.

I also wanted to write down my thought process while working on these problems, at the risk of appearing unintelligent for not knowing the solutions immediately. I’m a software engineer with 8 years of industry experience, so perhaps these posts can serve as a reminder that most people, including professionals, need to brush up on CS before interviewing. If you catch yourself wondering, “why is he doing that? That’s wrong!” you’re probably right, and maybe this post isn’t for you.

I’m starting with a string problem: “implement an algorithm to check if a string has all unique characters.”

My first instinct is to loop over the string, use a map to increment a counter for each letter, then check the map to see if any count is > 1. If yes, we’ll stop and return false, but if every count is <= 1 we’ll return true. So let’s do that:

func HasOnlyUniqueChars(s string) bool {
	m := map[rune]int{}
	for _, r := range s {
		m[r] += 1
	}

	for _, v := range m {
		if v > 1 {
			return false
		}
	}

	return true
}

I will mention that in Go, when looping over a string, we get runes. I won’t go into detail about that here since it’s not entirely relevant to the algorithm itself, but take a look at that link if you’re interested.

So let’s write a test for this code:

var cases = []struct {
    in   string
    want bool
}{
    {"foo", false},
    {"abcdefghijklmnopqrstuvwxyz", true},
    {"", true},
}

func TestHasOnlyUniqueChars(t *testing.T) {
    for _, tt := range cases {
        if got := HasOnlyUniqueChars(tt.in); got != tt.want {
            t.Errorf("HasOnlyUniqueChars(%q) = %t, want %t", tt.in, got, tt.want)           
        }
    }       
}

and run the test:

➜  1 go test -cover
PASS
coverage: 100.0% of statements
ok  	github.com/shawnps/ctci/ch1/1	0.363s

It seems to be working. Let’s write a benchmark as well:

func BenchmarkHasOnlyUniqueChars(b *testing.B) {
    for i := 0; i < b.N; i++ {
        HasOnlyUniqueChars("abcdefghijklmnopqrstuvwxyz")
    }   
}

and run that:

➜  1 go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/shawnps/ctci/ch1/1
BenchmarkHasOnlyUniqueChars-4   	  500000	      2086 ns/op
PASS
ok  	github.com/shawnps/ctci/ch1/1	1.451s

It seems fast – I imagine it runs in O(n) time since we’re checking each character in the string.

The book then asks, “what if you cannot use additional data structures?”

My first reaction to how to optimize it is to use a size 26 array of ints instead of the map, but that would be an additional data structure, so that’s wrong.

Then I thought, what if I sort the string first, then iterate over it and check whether the subsequent character is equal to the the character we’re currently checking. If it is, then we can return false.

I also realized at this point that any string greater than 26 characters long should return false, assuming we are only using lowercase English letters. I think it’s probably fair to ask an interviewer to clarify this before solving the problem. It’s too bad I didn’t think of it sooner :) There goes my chance at working for Google.

Let’s test my theory about sorting the string. I’m going to give myself the benefit of the doubt that converting a string to a slice of runes doesn’t count as using an additional data structure, but I’m not sure if an interviewer would be lenient about it. Here’s what I came up with:

func HasOnlyUniqueCharsSorted(s string) bool {
    if s == "" {
        return true
    }   

    if len(s) > 26 {
        return false
    }   

    rs := []rune(s)
    sort.Slice(rs, func(i, j int) bool { return r[i] < r[j] })

    for i := range rs {
        if i == len(rs) {
            break
        }   

        if rs[i] == rs[i+1] {
            return false
        }   
    }   

    return true
}

I copied the previous test and called the new method instead:

func TestHasOnlyUniqueCharsSorted(t *testing.T) {
    for _, tt := range cases {
        if got := HasOnlyUniqueCharsSorted(tt.in); got != tt.want {
            t.Errorf("HasOnlyUniqueCharsSorted(%q) = %t, want %t", tt.in, got, tt.want)
        }   
    }   
}

but when I ran it, I got an error:

➜  1 go test
# github.com/shawnps/ctci/ch1/1 [github.com/shawnps/ctci/ch1/1.test]
./1.go:30:46: undefined: r
FAIL	github.com/shawnps/ctci/ch1/1 [build failed]

My sort.Slice call was wrong. I should have been more careful about that. It should look like this:

sort.Slice(rs, func(i, j int) bool { return rs[i] < rs[j] })

Now running the test again I get a panic:

➜  1 go test
--- FAIL: TestHasOnlyUniqueCharsSorted (0.00s)
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 34 [running]:
testing.tRunner.func1(0xc0000b0200)
	/usr/local/go/src/testing/testing.go:830 +0x392
panic(0x1114040, 0x122c0d0)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/shawnps/ctci/ch1/1.HasOnlyUniqueCharsSorted(0x113de4f, 0x1a, 0x1058900)
	/Users/shawn/mygo/src/github.com/shawnps/ctci/ch1/1/1.go:37 +0x15b
github.com/shawnps/ctci/ch1/1.TestHasOnlyUniqueCharsSorted(0xc0000b0200)
	/Users/shawn/mygo/src/github.com/shawnps/ctci/ch1/1/1_test.go:24 +0x8b
testing.tRunner(0xc0000b0200, 0x1142b60)
	/usr/local/go/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:916 +0x35a
exit status 2
FAIL	github.com/shawnps/ctci/ch1/1	0.288s

Line 37 is this check: if rs[i] == rs[i+1].

So I guess my if i == len(rs) is wrong. That makes sense because the index starts at 0 but the length of a single element slice would be 1, so we have to either add 1 to i when checking or subtract 1 from len(rs). I added 1 to i and the test finally passed. Now let’s see a benchmark for this function too:

func BenchmarkHasOnlyUniqueCharsSorted(b *testing.B) {
    for i := 0; i < b.N; i++ {
        HasOnlyUniqueCharsSorted("abcdefghijklmnopqrstuvwxyz")
    }
}

and run the benchmarks:

➜  1 go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/shawnps/ctci/ch1/1
BenchmarkHasOnlyUniqueChars-4         	 1000000	      1984 ns/op
BenchmarkHasOnlyUniqueCharsSorted-4   	 3000000	       500 ns/op
PASS
ok  	github.com/shawnps/ctci/ch1/1	5.233s

It seems to be about 3x faster. This is surprising considering we’re sorting the string first. The BenchmarkHasOnlyUniqueCharsSorted should be slower.

Glancing at Go’s sort package source, it seems sort.Slice uses quicksort, which has a worst case time complexity of O(n2) and average case of O(n log n).

At this point I couldn’t think of any other ways to optimize the space usage of this solution, and I was unclear why the map solution was slower than the sorted solution, so I looked at the hints.

The first hint suggests using a hash table (for the first iteration of the algorithm), so I have that covered. The second hint suggests that a bit vector might be useful. I didn’t think of this, but a bit vector would be a separate data structure so perhaps this hint applies to the first part of the question? We’ll find out soon after looking at the answer at the end of the post. Finally the last hint asks whether it’s possible to do this in O(n log n) time. That looks awfully familiar, but let’s find out whether I was right in the end.

The Answer

Without giving away too much of what’s in the book (it’s worth buying!), I was right about checking the length of the string. The solution in the book assumes a 128-character (ASCII) alphabet.

I was also right about the bit vector hint: that hint applied to the first part of the question where extra data structures are allowed. Essentially it is used to reduce the space usage of the algorithm.

Finally, it seems I was correct in sorting the string for the second iteration, however I think my solution doesn’t count since I’m creating a separate slice of runes in order to sort the string. I’m curious to see if there’s a way to sort a string in-place in Go, but I’ll save that for another time.

Conclusion

I think I did okay on this problem, but not perfect. I didn’t check the length of the string in my first attempt. I didn’t think to use a bit vector to reduce the space. My solution which sorts the string uses extra space since I didn’t know how to sort a string in Go without instantiating a rune slice.

It seems my weakest point was not knowing to use a bit vector. I can’t remember a time I’ve used a bit vector at work, so that might be why it never pops into my head while attempting these interview questions. I’ll have to start asking myself whether a bit vector could be useful in the future.

I also realized that in my first solution, I could have done everything in a single loop instead of two separate loops. Something like this:

func HasOnlyUniqueChars(s string) bool {
    m := map[rune]int{} 
    for _, r := range s {
        m[r] += 1 
        if m[r] > 1 {
            return false
        }
    }   

    return true
}

Indeed changing the code in this way does seem to speed up its benchmark (for “abcdefghijklmnopqrstuvwxyz”):

➜  1 go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/shawnps/ctci/ch1/1
BenchmarkHasOnlyUniqueChars-4         	 1000000	      1595 ns/op
BenchmarkHasOnlyUniqueCharsSorted-4   	 3000000	       463 ns/op
PASS
ok  	github.com/shawnps/ctci/ch1/1	3.753s

But still not enough to be as fast as the sorted implementation. If we remove the > 26 check from BenchmarkHasOnlyUniqueCharsSorted and run both with the string “abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz”, we get:

➜  1 go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/shawnps/ctci/ch1/1
BenchmarkHasOnlyUniqueChars-4         	 1000000	      1627 ns/op
BenchmarkHasOnlyUniqueCharsSorted-4   	 1000000	      1342 ns/op
PASS
ok  	github.com/shawnps/ctci/ch1/1	3.055s

For an even longer string (“abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz”) we get:

➜  1 go test -bench=.                                                 
goos: darwin
goarch: amd64
pkg: github.com/shawnps/ctci/ch1/1
BenchmarkHasOnlyUniqueChars-4         	 1000000	      1680 ns/op
BenchmarkHasOnlyUniqueCharsSorted-4   	  200000	      6917 ns/op
PASS
ok  	github.com/shawnps/ctci/ch1/1	3.444s

Note that this is still without the > 26 length check in HasOnlyUniqueCharsSorted.

This result makes sense, because the moment HasOnlyUniqueChars sees a character with a count > 1 in the map, it will return, so it returns after it sees the 27th character (the second a).

Out of curiosity I reverted the HasOnlyUniqueChars to the original implementation that loops over the string and puts every character in a map, then loops over the map. These were the results:

➜  1 go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/shawnps/ctci/ch1/1
BenchmarkHasOnlyUniqueChars-4         	  300000	      4778 ns/op
BenchmarkHasOnlyUniqueCharsSorted-4   	  200000	      7099 ns/op
PASS
ok  	github.com/shawnps/ctci/ch1/1	3.022s

References