Emacs sorting with sort-regexp-fields

Table of Contents

img

Emacs is really an endless world of commands, options, and use-cases, especially for dealing with text. Sorting text is a really common thing and often plain-old M-x sort-lines does the trick in one fell swoop. But what if you want something more sophisticated? It turns out my eyes have been opened and emacs has led me to think about sorting things in ways I never even thought about before.

Use case: Sorting sql dumps from two different tables with a single common field

There I was: on an old app I had an “instutions” table and a “schools” table from my database. The question was, how much did they overlap? The tables have different structures, but in each of them, the second field was the name of the place. So I wanted to sort them all into the same list, based on names so I could see if one of the tables was derived from the other with lots of duplicates. Time to gird up my loins and try out a command I’ve always known about but have never worked out the gumption to learn: sort-regexp-fields.

The data look like this (just picture a lot more of them):

,('15492ae9-2f1b-4a2b-905f-b90b7a727e32', 'Bellarmine University', NULL, NULL, '2020-11-12 15:40:28.08977')
,('e392f38f-79b4-4aad-8af8-1e90c9e766e4', 'Alfred University', NULL, NULL, '2020-11-12 15:40:28.08977')

Solution

When I first read describe-function sort-regexp-fields, I didn’t quite grok all this:

sort-regexp-fields is an interactive compiled Lisp function.

(sort-regexp-fields REVERSE RECORD-REGEXP KEY-REGEXP BEG END)

Sort the text in the region lexicographically. If called interactively, prompt for two regular expressions, RECORD-REGEXP and KEY-REGEXP.

RECORD-REGEXP specifies the textual units to be sorted. For example, to sort lines, RECORD-REGEXP would be “^.*$”.

KEY-REGEXP specifies the part of each record (i.e. each match for RECORD-REGEXP) to be used for sorting. If it is “\\digit”, use the digit’th “\(…\)” match field specified by RECORD-REGEXP. If it is “\\&", use the whole record. Otherwise, KEY-REGEXP should be a regular expression with which to search within the record. If a match for KEY-REGEXP is not found within a record, that record is ignored.

With a negative prefix arg, sort in reverse order.

The variable ‘sort-fold-case’ determines whether alphabetic case affects the sort order.

For example: to sort lines in the region by the first word on each line starting with the letter “f”, RECORD-REGEXP would be “^.*$” and KEY would be “\\<f\\w*\\>”

In particular, the double-escaped characters I wasn’t sure about: were they double escaped for the sake of documentation, or do I need to actually write them out that way? Further, what does all that mean? Only one way to find out: use emacs as the giant REPL it is!

First, choose which part of the lines you want to sort. I want to sort the entire line, but I also need to mark which part of the line I am going to sort based upon.

Regexp specifying records to sort: ^.*?, '\(.*?\)',.*$

Last, choose which part of that regexp I’m using for the sort.

Regexp specifying key within record: \1

That’s it – no double escape; just like a regexp-find-replace (you do use that almost daily like I do, right?) you just tell it which group from the step-1 regexp you want.

Result

Like this, but over a lot of rows (all the ones you’ve selected, or everything below cursor):

,('e392f38f-79b4-4aad-8af8-1e90c9e766e4', 'Alfred University', NULL, NULL, '2020-11-12 15:40:28.08977')
,('15492ae9-2f1b-4a2b-905f-b90b7a727e32', 'Bellarmine University', NULL, NULL, '2020-11-12 15:40:28.08977')

Summary

Gone are the days of only thinking I can sort alphabetically or numerically-initially. It’s not that hard to use regexp-find-replace and with it you can sort some or all of some or all lines by specific portions of that line! Super powers unleashed!

Tory Anderson avatar
Tory Anderson
Full-time Web App Engineer, Digital Humanist, Researcher, Computer Psychologist