Thursday, July 26, 2012

The Vim of Happiness

Joshu Edits
A vimmer told Joshu: `I have just entered the channel. Please teach me.'
Joshu asked: `Have you completed vimtutor?'
The vimmer replied: `I have completed it.'
Joshu said: `Then you had better start editing.'
At that moment the vimmer was enlightened.
The Vim of Happiness T-Shirt Collection





Tuesday, July 3, 2012

On Skinning Cats and Eating Elephants


All characters appearing in this work are fictitious. Even mine. And certainly rovaednez. I mean, who names their son rovaednez?! How do you even pronounce rovaednez? So, yeah, completely fictitious. Any resemblance to real persons, living or dead, is purely coincidental. Totally.

So, check it:

There we were, chilling in the name of #vim, keeping it proper and prim, filling it up to the brim, chattin' it epic like the brothers Grimm. Then who the fuck should storm in? rovaednez, dressed like a pimp! Gold chains and a fist full of bling, throwin' down smack like a curdled fling.
"Yo, bitchies, I can haz moniez for data fixies! All i gots to do iz kick these glitchies. Then pappy's gonna spend his hard earned richies!  Any you fools can help me stitch it?" 
Flashing a cheerful grin at our brash and fearless kin I blithely bid of him: "Fetch a python to fight this battle, cast a perl among the rabble or send in ruby with her paddle. There's many ways to filch your chattel." 
Alas, but all he did was stand and blink with shoulders slumped and mettle kinked; he mumbled feebly, "In all of these, I stink." "Know you sed and awk?" I cried?  He puled, "These too I have not eyed." 
The crowd gasped as one, each looking at none, hoping our little one was merely poking fun. The thunderous peal of silence shook us from our trance. It was quite without a shred of piety with which I stammered my next enquiry: "What pray you wield when beasts you need to slash?  What craft employed to earn your clients' cash?" 
Expecting this to rile the squawk, I gave a nod for him to talk. I feared I might have crushed his soul but it seemed to only steel his resolve. 
His guile aroused, he sat erect and puffed his cheeks and pitched his chest. Holding his manner coy with hands of fleeting joy, he proceeded to destroy: "What do I use to mash my foes?  What is the tool I lash and throw? How do I thrash my enemies so? I'll tell you, bro! Here's the flash: When inserts I need to hash and indents against the rocks I dash, when code I gnash and jobs I stash, there is for me but one lass: my fair maiden, lady bash! 
Eyes widened; noses bled. Osse wept and innocents fled. If need you munge data dread, use you not your tail and head and myriad filters all unwed. Steer you clear of the hell that is scripting in the shell.  Turn your hand and heart instead towards the light of awk and sed.

From this point on, we turn from the apocryphal tale of rovaednez and his blind devotion to the lady of the shell, to an illustrated walkthrough of a typical problem that is better handled by purpose built tools for these sorts of tasks. In particular, we'll be looking at sed and awk.

The problem

Input

We have a file (let's call it the input_file) with records of the form:

  (...),(...),(...)

with NO newlines anywhere. None. Not even one at the very end.

Each (...) record looks like this:


  (012345,67890,'_im_a_file_record','yyyy/mm/0123456789abcdef0123456789abcdef.ext')

Output

The output_file needs to have each (...) record modified to include an absolute prefix in the trailing path:

  (012345,67890,'_im_a_file_record','/an/abs/path/yyyy/mm/0123456789abcdef0123456789abcdef.ext')

Processing

Ostensibly, the task here is simple and was summed up in the expected output statement above, repeated here for reference:

The output_file needs to have each (...) record modified to include an absolute prefix in the trailing path.

How would you achieve that? Take a few seconds or (minutes if you need) to consider how you would solve this problem.

Done? Let's take a look at how you went.

  • Did you write any code?
  • Did you build some sample data files to throw at your code?
  • Did you solve the problem?

If you did any and all of those things, jolly for you. They are all fairly reasonable things to do. Doing them at this stage of the game might be a tad premature, but there are worse moves you could have made.

Let's look at an alternative way to approach this problem: start by asking questions and establishing working knowns and assumptions.

Some things to ask:

  • what absolute path prefix needs to be prepended? Is it constant or does it depend on something, like the content of one or more fields in the record?
  • do other records have any bearings on how changes are to be made to the current record?
  • and... how many records are we talking about, anyway?

It turns out, that last question actually packs a bit more punch than it would seem at face value. More on that later.

It's crucial to understand your environment before you start moving around. If you were parachuted into a dark LZ in enemy territory, you'd want to take a few good looks around before you waddled off and stumbled into a booby trap or hostile patrols. Same thing when you're coding. If you run off half-cocked and start banging out gobs of 'solution code' before you even understand the problem, you may as well have stepped on a land mine and lost your left leg and both nuts.  We would do well to abide the old carpenter's adage here: measure twice, cut once.

The Answers

  • It's constant. We'll use the one shown in the Output section: '/an/abs/path/'   (I decided to keep this easy because complicating this part was not the point of the exercise)
  • Thankfully, no. Seriously, if the answer to this is yes, and you're looking at anything more than simple associations or record numbers exceeding a-few-seconds-to-process-all-told then you should almost certainly drop sed & awk and reach for the Big Boy's Shelf and grab yourself a can of Perl, Python or Ruby and make sure you bundle the Database modules to boot. (Other serious language solutions are acceptable here too - not wanting to alienate our lesser spotted brethren out there... you C# and .Net freaks can just stay bloody well ostracised, though.)
  • This is the kicker (and the main focus of the solution space in this article). The input file is 1.2 Gigabytes in size. And remember, there's not a single newline in sight. Stop for a minute and think how your 'solution' above will cope with those numbers now.

Skinning the Cat

Here are some possible ways to tackle the unencumbered (before we knew about the sheer size of the file) problem:

A Simple Sed

A simple sed that ignores any 'record' notion and simply assumed that all strings needing modification included, say, a slash and that no other field ever contained a slash. For small file sizes and if your client confirms that the records do indeed adhere to that assumption, then you're good to go. Using sed on a 1.2 GB file, though, may keep you at your desk a little longer, I'm afraid.

Testing Aside

It might seem unnecessary to some for me to point this out, but here it is for the greener horned among you: don't use your 1.2 GB file for all your tests. In fact, I tend to have several sized test files:

  • a tiny one with about half a dozen lines
  • a small one with a hundred odd
  • a medium one with a few thousand
  • and the big one

How many intermediate ones you have between the tiny and the real one are up to you and are used to speed test your solution as well as quickly failing sanity tests. If your solution fails to swallow a few thousand records, it ain't gonna eat the million rec bunny, and depending on your solution and where it fails, you might not know that as quickly on the huge file.

Other variations on test files (that I didn't need to cover here) cater for cases where the contents of the records themselves effect the operation of the algorithm.

Size Matters, Even to a Gnu

The GNU sed doco says it can handle any line size your box can malloc(). Given that my box could indeed accommodate that, my tests proved fruitless. The worst fail I saw sed give me was on this expression:

  sed -e 's/),(/),\n(/g'

which returned after only six seconds (when it should have taken at least three times as long as that) with an exit code of zero, and a matching output file size. o_O   Silent Fail.   Bad.

A Brief Excursion on Eating Elephants

Why was I trying the sed statement above? To answer that question, I first have to go back to the make-believe story about rovaednez above.  His bash solution had a nest of loops, each O-ier than the last, and on the innermost level of hell was a sed expression so evil that it made your eyes bleed and left you infertile for a week. Kids, instead of attempting to write one behemoth regex to rule them all, chunk the problem down into bite sized pieces. Just as you wouldn't dislocate your jaw squeezing an elephant's trunk down your gob, nor should you bust a neuron squeezing an oversized regex in your noggin. You'd slice both up and chew on the little pieces, swallowing them at a steady, comfortable and digestible pace.

There is no shame in cutting a bigger operation into smaller, almost trivial sub-steps. In fact, it's the smart thing to do. This takes practise to get good at, especially if you've prided yourself for years on writing strings of dense regexs to prove your manliness.

The sed expression:

  sed -e 's/),(/),\n(/g'

was aiming to introduce newlines so the next chain in the stream could operate in a standard (linux tool philosophy) and familiar way.  Although this didn't work in this particular case because sed choked on the elephant's trunk, this is still a good lesson to take away: chunk your problem space down into bite-sized pieces. Rethink the problem and ask yourself, would I look at this differently if the input data was shaped a little differently? And better, what would the input data look like to make this a trivial solve?

So much for sed. Let's move on.

The Evolution of a Sysop's Tool

When a sysop needs a tool to solve a common task she does on a regular basis, it almost invariably follows an evolution similar to this:

  1. scribble something quickly in shell functions and aliases.
  2. turn them into a bona-fide shell script.
  3. re-write the whole thing in sed & awk.
  4. scrap that and do it all again in perl (or ruby/python/yourscriptinglanguageofchoice)

Well, according to that, we're up to the awk stage of evolution here, so in we jump.

Awk really is a handy little tool. It's centuries old now which you might think renders it useless but that is simply not the case. This little powerhouse belongs in the same group of timeless tools that our beloved vim hangs out in.

The Final Awk Code

  awk -v FS=',' -v RS='\\),\\(' -v ORS='),(' -v OFS=',' 
      -v q="'" -v abspath="'/an/abs/path/"
      '{sub(q, abspath, $4); printf("%s%s", NR>1 ? ORS : "", $0)}'
      input_file > output_file

A brief walkthrough

Awk uses some all-caps internal variables to control its runtime behaviour. The ones I'm using are:

  • FS = input field separator
  • OFS = output field separator
  • RS = record separator
  • ORS = output record separator
  • NR = current record number as we're moving through the input files. Some awk guides refer to this erroneously as the current line number, but that is only true when using RS="\n" (which is the default).

Records and fields are split according to these settings and provided to the awk inner loop (the '{sub(q, abspath, $4) ; printf("%s%s", NR>1 ? ORS : "", $0)}' part) in the form $1 - $4 (for each of the 4 fields) and $0 for the whole record. Of course, you get more $n vars for having more fields in your input records.

I am using two variables:

  • q = '
  • abspath = /an/abs/path/

Separating the single quote value of q out into a separate variable is a common awk idiom to avoid complex quote escaping needed to bury ' within " within ' within... yeah.

I don't feel I need to explain the input_file > output_file part. See man awk and shell redirection.

So that leaves the meat:

  '{sub(q, abspath, $4) ; printf("%s%s", NR>1 ? ORS : "", $0)}'

Read this as:

  • substitute the single-quote at the start of field 4 with the abspath.
  • if this is the first line in the file (NR == 1 and therefore *not* greater than 1) do not print the ORS, but do print the first (now modified by that substitute) record.
  • for all other lines of the file, print the ORS and then the (now modified) record.

This printf() trick with prepending the ORS to all but the first line is one of awk's ways of altering the behaviour of the last line in the file. No, don't email me. Don't rush down to the comments area to tell me I flubbed this one. I meant what I wrote. Awk has a hard time knowing how many records your file has before it's read them all. There are other solutions usually involving you telling awk how many records it has to process, which is quite frankly, for the birds.  This approach 'appends' the ORS to every line in the file *except* the last line. Think about it.

See the online tutorials and the awk man page for a deeper explanation of bits I didn't cover well enough for you.

Note: Senior Orcs will advise you that   -F ','   and   -v q=\'   are more idiomatic than the respective:   -v FS=','   and   -v q="'"   . I used the more verbose and regular forms here for clarity.

I would throw some links here to awk learning resources, tutorials and reference material but I really don't need to. Google has your back there. There's a metric fuck ton of awk goodies just waiting for you to read them. So, off you pop and learn you an awk of great utility.

Update

My apologies to e36freak on #awk for forgetting to thank him for looking over my awk script. He pointed out that I should use $NF instead of $4. $NF is another special awk internal variable that holds the number of fields in the current record. Remembering that we set the field separator to a single comma, given the format of the records shown, it's probably reasonable to assume we'll always therefore get four fields. Assumption is the mother of all fuckups though and if any field were to contain a comma then $4 will refer to the wrong field in our expression. Knowing that the field we always want to change is the final field, we can make our code a little more robust by using $NF instead. This is not fool proof, though. If the last field (a path) happened to contain a comma, we're dead in the water again. Mistakes like these should ideally be checked for in testing phases. It might be infeasible to test a full 1.2 GB data file to find a possible tweet in the stream.

An example of sanity checking your data file to ensure no record has more than four fields:

  awk -v RS='\\),\\(' -v FS=',' 'NF > 4 {print NR, $0}' input_file

Any records having more than four fields (three commas) will be printed out, preceded by their record number within the file. 

Cunning Linguists

It's the simple pleasures in life...
On a spot of R&R
In the land of #awk afar,
Fortune gave a chance to rant
With a local inhabitant.
If you thought the Orcs morose,
Let me show you one jocose:

       bairui | of course, you could sub awk out for perl|python|ruby|
              | ... but interestingly in my tests, sed and perl
              | (one-liner forms) both failed hard on the 1.5 GB sample
              | data file. awk just chewed it up. Sure, a proper perl
              | script could deal with the problem better. Point is:
              | embrace polygramy.
    zendeavor | i am an archfag
    zendeavor | who cares about polygamy
    zendeavor | :)
       bairui | and i said  polygRamy   ;)
    zendeavor | typo
       bairui | play on words: polygamy = more than one sexual partner;
              | polygramy = more than one language
    zendeavor | i got it
galaxywatcher | polylingual
       bairui | but that lacks the pun
galaxywatcher | multilingual even.
galaxywatcher | You seem like a cunning linguist
       bairui | gram = grammar; ah, i enjoy your cunning stunt, sir.
galaxywatcher | I thought it was a punt
       bairui | if that's what the play calls for, then punt;
              | if it's rough and tumble, employ a stunt
galaxywatcher | grunts
       bairui | well, that was blunt
galaxywatcher | and plump
       bairui | sir, i do not mean to affront
       bairui | let's not let this come to thumps
       bairui | i'm sure we can be friends if we could just get over
              | this little hump
galaxywatcher | That's worse than mumps.
galaxywatcher | Good year blimps fly over humps like runts tumble
              | on nips.
       bairui | it seems you've glimpsed my lumps to strike blunts
              | and grumble your lips
galaxywatcher | The pair of lumps that straddle the cage? Or the pair
              | of lumps that dangle near change?
       bairui | would you sir have me in rage? I demand at once to
              | know your gauge.


I especially liked his cage/change line. Well played, sir. :-)