Grep, the best metadata tool ever

January 25, 2009

Ah, UNIX. So handy, so terrifying.

There’s something about a command line that strikes fear into the heart of everyone who grew up in Windows. That black box, that white writing, that awful font… it’s like standing outside an abandoned factory with a ‘WARNING: DANGER OF DEATH’ sign hanging askew from the locked and rusted front gates.

And while I appreciate the efforts of the good people who take the time to write lovely tutorials, most of them are too damn passionate to make sense to bewildered newbies like myself. (I do recommend, however, this one if you’re interested/bewildered: http://www.nceas.ucsb.edu/scicomp/Dloads/UnixProg/UnixTutor/index.html)

Getting started in UNIX requires a little brain-breaking, a slight tweaking and rejigging of what you think you know about computers. I’m not going to do that today. So, here is a mechanics-only guide to Doing One Handy Thing. Just one – nothing fancy, just handy. Specifically, extracting information you want.

Don’t try this at home. You’ll only delete your wedding photos. Also – sys admins know all about this and will usually help in exchange for food.

Using grep

Here’s what we want to do: We want to go through a file of book metadata and find all the things that were published in the 1990s.

Step 1: For Goodness sake, if you’re doing this for real, make a backup/test directory to do this in. NEVER MESS WITH YOUR ORIGINAL FILES. Seriously.

The scenario is this: Unix works in folders, so we’re going to put the files we want to use in the same folder. There are two files we need:

The metadata file (metadata.txt in this example) 

The file that has in it the terms you want to pull out (selections.txt in this example).

The metadata file

Here’s a sample to get you started. It’s not fancy, but if you copy it and save it as .txt, it will work. (Note: it’s tab separated, which is the best way unless you’re messing with XML and things like that.)

Title Author YearPublished

The Arrival Sean Tan 2006

Introducing Baudrillard Chris Horrocks 1999

Moab is my Washpot Stephen Fry 1997

The Unbearable Lightness of Being in Aberystwyth 2005

The selection file

Unix works line by line, so the selection terms should be one per line. We want to find records from the 1990s, and we know how that info is expressed in metadata.txt, so, the content of the file looks like this:

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

Step 3: Logon to the unix box (‘box’ meaning ‘computer’ here). Ask your sys admin how.

Step 4: Change directories (directory = folder). In windows, this is the bit where you double-click the folder and go into a sub-folder. In Unix, it’s all text-based, so we use ‘cd’. ‘c’hange ‘d’irectory, see?

In Unix, most symbols mean something. Spacebars are crucial, they tell the machine where commands and filenames start and stop (so don’t have spacebars in your filenames or directories or this wont’ work. I’ve written [spacebar] where the spaces go in these instructions to make it clear). 

Having logged on, we are faced with a prompt, which is usually a percentage sign (%). We want to get to our directory, so we type:

%cd[spacebar]directory_I_want/subdirectory_I_want/directory_where_my_files_are

(If you get lost, you can go backwards by typing %cd[spacebar].. You can also find out where you are by typing in %pwd. Extra handy: if you start typing the directory and press ‘tab’, it will autocomplete. If it doesn’t, you’re probably in the wrong place.)

(It might also help to follow your location in a normal window, so you can see where you are and what’s going on in a way you’re used to. That’s what I do.)

Step 5: OK, technically getting complicated but mechanically pretty simple. Type this:

%grep[spacebar]-f[spacebar]selections.txt[spacebar]metadata.txt>selected_metadata.txt

grep‘ is the command (acutally, it’s called a program in this environment) that means ‘extract’.

-f‘ means ‘extract the stuff listed in the file I’m about to tell you about’

selections.txt‘ is the file that lists the stuff you want extracted

metadata.txt‘ is the file that you want to extract the stuff from

> means ‘when you’ve got an answer, send it to the file I’m about to tell you about’

selected_metadata.txt‘ means ‘send the answers to this file please’

So, when you go and open up selected_metadata.txt, you find this:

Introducing Baudrillard Chris Horrocks 1999

Moab is my Washpot Stephen Fry 1997

Ta da!

That’s a basic example, but it has some huge applications.

Here’s some final points:

1. Notice we lost the header line, which had Title/Author/DatePublished in it. That’s because it didn’t have a 1990′s number in it. 

2. No, yo
u don’t have to use the -f option. It’s just the handiest way. If you only wanted one term, 1997 for example, you could just use: 

%grep[spacebar]1997[spacebar]selections.txt[spacebar]metadata.txt>selected_metadata.txt

and that would output the Stephen Fry record.

3. No, you don’t have to output to a text file – if you want the results to display on screen, you can use this:

%grep[spacebar]-f[spacebar]selections.txt[spacebar]metadata.txt

Bear in mind, though, that you can’t do much with the output later. Reading it on screen is about all you can do with it. 

Does it get more complex than this? You betcha. You can search and replace inside files, all sorts of things. But at the starters level, this is still a pretty powerful way to start messing with your data.

Previous post:

Next post:

Leave a Comment

Previous post:

Next post: