You will often come across problems in life that require you to sort stuff. Sometimes its your unwieldy comic book collection. Other times its arrays haphazardly filled with data. Fortunately Python wins out on real-life once again and helps us sort things really easily.
A toy problem to illustrate our example: write a function that returns the N longest lines from a given input file.
We will need to know how to sort lists in order to solve this problem. Your clue should have been the word, “longest,” in the problem description. In order to determine which is the longest line, we have to know how long all the other lines in the file are. Since our solution must return as many of the longest lines as we want, the simplest way to do so is to sort the lines into order from longest to shortest (or vice-versa) and return as many lines from either end as needed.
Now, first things first: go read about the various sorting algorithms on Wikipedia. Then go look for implementations of those algorithms in Python. Then store that information away somewhere that you can get at it when someone quizzes on you them later. It won’t be me, but guaranteed it will come up at some point.
Fortunately Python has a sort method available to list instances. If you’re curious, it’s an implementation of a supposedly complicated algorithm known as timsort. It takes a few keyword arguments, but we’re only really interested in one: key. You may be tempted to use cmp, but you will find that your sorts will slow down quite a bit. This was the old way of doing sorting and it’s no good anymore. The reason being that cmp will be run consecutively many times over pairs of values in the list. It’s gone in Python 3.x. One last thing of note on sort — it will mutate the list in-place which can have consequences that you should read about in the Python documentation later on (the alternative is to use sorted which returns a new iterable).
So enough talk; let’s see one possible solution:
# This example assumes Python >= 2.6
def n_longest_lines(infile, n):
"""Return `n` longest lines from `infile`"""
lines = []
with open(infile) as f:
for line in f:
lines.append((line, len(line)))
lines.sort(key=lambda x: x[1]) # sort by line-length
return lines[-n:]
Pretty handy, eh? The key trick here is the data-structure we sort on. Take a look at the lines in the with block. Can you figure out what it looks like?
lines = [("The quick brown", 15), ("fox", 3)]
It’s a list of tuples. The first field of the tuple is a string read as a single line from the file and the second field is the string’s length. This structure then allows us to use the sort method to sort our list. All we have to tell sort is which field to look at.
The sort method takes a parameter, key, which accepts a function that will be called once on every value in the list. In this example I’ve passed it an anonymous function (hint: lambda) which simply returns the second element of the tuple. It’s job is to return a value that the sort method can work with.
Side note: You could improve the readability of this line by looking at the itemgetter, attrgetter, and methodgetter methods from the operator module. Read up on them. Handy stuff.
Finally, to round out our function we use Python’s array-splicing awesomeness to return the requested number of longest lines. If you don’t know how to splice arrays in Python, be sure to look it up. Here we specify a negative number in the from field in order to return the n number of elements from the end of the array.
That’s all there is to it! Sorting in Python is pretty easy and as of Python 2.3, guaranteed to be stable. There are only a few gotchas to remember: sort sorts the list in-place (meaning that it changes the original list and you won’t be able to get access to the original list) while sorted returns a new iterable (leaving the original list intact); try to ignore cmp as it’s not used anymore. Also, don’t think you can completely forget all your sorting algorithms! Try writing a few implementations on your own, test them out with large samples of test data, time them, and see if you can make them faster (or compare the different approaches). Sometimes Python makes things too easy and sorting is one of those things. It’s a double edged sword, but it’s a nice one to have!
If you want to know more about Python sorting, check out this mini how-to.