Thursday, December 20, 2012

Learn Vimscript the Hard Way, Commentary II

As I wrote in my first commentary on Steve Losh’s Learn Vimscript the Hard Way, I think it is a worthwhile book for people to read to get more familiar with the mechanics of customising their Vim environment.

What about those who want to use it to actually learn Vimscript?

It does an acceptable job of that too.

The chapters on Vimscript itself (19-27 and 35-40) cover the syntax and semantics of the language with examples and exercises spread throughout to give the learner necessary hands on experience.

I want to stress here that I feel Steve didn’t intend the what of those examples to be used literally in anyone’s vimrc files or personal plugins (in fact, I believe that to be true of the whole book at large) but rather the how of techniques shown. Don’t create your own little maps in your ~/.vimrc file for commenting lines in various filetypes and don’t write your own toy snippets system — very good plugins exist for these purposes already. Do learn that you can do these sorts of things so that when the time comes for you to really write something new, you will know how to.

Steve also provides two larger exercises starting respectively at chapters 32 and 41. The first is a new operator to grep for the motioned text, and the second is a full blown Plugin for a new programming language. Both serve as good models for the sort of larger works of the practising VimLer.

I do recommend this book because it’s freely available to read online. Another resource I would recommend for learning Vimscript is Damian Conway’s five part developerWorks article series, Scripting the Vim Editor — that’s how I first got into VimL (VimL is short for Vim Scripting Language and is another name for Vimscript). Vim’s built-in :help usr_41 is the user guide to writing Vim scripts and :help eval.txt is the reference manual on VimL’s expression evaluation.

Do you have a favourite resource for learning VimL?

Oops... I forgot to add my remarks on some of the technical aspects of Steve's work:

  • As Steve says, always use :help nore maps until you know you need otherwise.
  • In the same vein, always use :normal! (instead of the oft shown :normal) to avoid user-defined keymaps on the right hand side.
  • The :echom command (and friends) expects the evaluations of its expressions to be of type string. Use :help string( to coerce lists and dictionaries to strings for use in these commands. E.g.   :echom string(getline(1, '$'))
  • Vim's help system is context aware based on the format of the help tag. See :help help-context for the list of formats.

Learn Vimscript the Hard Way, Commentary I

Steve Losh has written a book called Learn Vimscript the Hard Way.

It’s badly titled, imho. His definition of the book explains why I say that: a book for users of the Vim editor who want to learn how to customize Vim. With that description in mind, I think the book achieves its goal — a goal that all vimmers would aspire to master. However with the title of the book, I fear even many proficient vimmers would assume that the material is out of their reach or too dense to absorb right now with their busy schedules, relegating it to a later reading pile, at best.

For all you up and coming Vimmers looking to read something to take you beyond all of the beginner tutorials out there, read chapters: 0-18, 28-32, 43-48, 50 & 56.

For those of you who picked the book up specifically because of its title, that review is coming soon. :-)

Sunday, December 16, 2012

call() for a Good Time

Simple functions in Vim are declared like this:

function! A(a, b, c)
  echo a:a a:b a:c

call A(1, 2, 3)

There’s probably nothing surprising there except for the a:a syntax, which is how Vim insists on accessing the function’s arguments (mnemonic: a: for argument).
Just as simple is calling function A() from another function, B(), passing its arguments directly along to A():

function! B(a, b, c)
  return A(a:a, a:b, a:c)

call B(1, 2, 3)

Nothing surprising there at all. But we’ve just laid the groundwork for the main attraction tonight. In VimL, you can call a function using the library function call(func, arglist) where arglist is a list. If you’re calling a function that takes multiple arguments, collect them in an actual list like this:

function! C(a, b, c)
  return call("A", [a:a, a:b, a:c])

call C(1, 2, 3)

If you already have the elements in a list, no need to wrap it in an explicit list:

function! D(a)
  return call("A", a:a)

call D([1, 2, 3])

Let’s step it up a notch. What if you want to be able to accept the args as either separate arguments or as a list? Vim has your back with variadic functions cloaked in a syntax similar to C’s:

Variadics in the key of V:
  • a:0 is a count of the variadic arguments
  • a:000 is all of the variadic arguments in a single list
  • a:1 to a:20 are positional accessors to the variadic arguments

So now it doesn’t matter how we receive the arguments — standalone or in a list — we can keep Vim happy and call A() appropriately.

function! E(...)
  if a:0 == 1
    return call("A", a:1)
    return call("A", a:000)

call E(1, 2, 3)
call E([1, 2, 3])

Ok. That’s not too bad; it’s perhaps a little awkward. We’re calling A() directly here, but it shouldn’t be a surprise to see that we can call C() in the same way too:

function! F(...)
  if a:0 == 1
    return call("C", a:1)
    return call("C", a:000)

call F(1, 2, 3)
call F([1, 2, 3])

Pretty straightforward. What about calling D() instead which expects a single list argument? Hmm… if Vim wants a list, give him a list:

function! G(...)
  if a:0 == 1
    return call("D", [a:1])
    return call("D", [a:000])

call G(1, 2, 3)
call G([1, 2, 3])

It’s worth stopping briefly here to consider what call() is doing to that arglist: It’s splatting it (extracting the arguments and passing them as separate members to the called function). Nice. Wouldn’t it be nice if we could splat lists ourselves? Well, be envious of Ruby coders no more because we can splat lists in VimL!

To splat a list into separate variables (a, b and c here):
let [a, b, c] = somelist
Read :help :let-unpack for the juicy extras.

I like the splatting approach because it gives us variable names to play with inside our function:

function! H(...)
  if a:0 == 1
    let [a, b, c] = a:1
    let [a, b, c] = a:000
  return D([a, b, c])

call H(1, 2, 3)
call H([1, 2, 3])

Of course, it works just as well for calling functions with explicit multiple arguments, like C():

function! I(...)
  if a:0 == 1
    let [a, b, c] = a:1
    let [a, b, c] = a:000
  return C(a, b, c)

call I(1, 2, 3)
call I([1, 2, 3])

You’ll notice that the splat semantics are identical between H() and I() and only the call of D() and C() change, respectively. This is very neat, I think.
So far we’ve been calling through to functions that call A() directly. Happily, we can call through to one of these dynamic functions (like E(), but any would work as well) and have it Just Work too:

function! J(...)
  if a:0 == 1
    let [a, b, c] = a:1
    let [a, b, c] = a:000
  return E(a, b, c)

call J(1, 2, 3)
call J([1, 2, 3])

So, that’s it. Vim has variadic functions and splats. And splats are my recommended pattern for handling deep call chains between variadic functions.
There’s one last, cute, little thing about splats: you can collect a certain number of explicit arguments as you require, and then have any remaining arguments dumped into a list for you. The rest variable here will be a list containing [4, 5, 6] from the subsequent calls:

function! K(...)
  if a:0 == 1
    let [a, b, c; rest] = a:1
    let [a, b, c; rest] = a:000
  echo "rest: " . string(rest)
  return E(a, b, c)

call K([1, 2, 3, 4, 5, 6])
call K(1, 2, 3, 4, 5, 6)

And I thought this was going to be a short post when I started. I almost didn’t bother posting it because of that reason.

Saturday, December 1, 2012

Rapid Programming Language Prototypes with Ruby & Racc, Commentary

I just watched a tolerable ruby conference video by Tom Lee on Rapid Programming Language Prototypes with Ruby & Racc.

What he showed he showed fairly well. His decision to "introduce compiler theory" was, he admitted, last-minute and the hesitation in its delivery bore testimony to that. The demonstration of the compiler pipeline using his intended tools (ruby and racc) was done quite well with a natural progression through the dependent concepts along the way. By the end of the talk he has a functional compiler construction tool chain going from EBNF-ish grammar through to generated (and using gcc, compiled) C code.

I was surprised that nobody in the audience asked the question I was burning to ask from half way through the live-coding session: Why not use Treetop? (or the more generic: why not use a peg parser generator or a parser generator that does more of the heavy lifting for you?)

The whole point of Tom's presentation is: use ruby+racc because it saves you from all the headaches of setting up the equivalent tool chain in C/C++. And it does, he's right. But it feels to me that Treetop does even more of that hard work for you, allowing you to more quickly get to the fun part of actually building your new language. I'm angling for simplicity here.

I could be wrong, though, so let me ask it here (as Confreaks seems to not allow comments): Why not treetop (or an equally 'simple' parser generator) for something like this? (and answers along the lines of EBNF > PEG are not really what I'm after, but if you have a concrete example of that I'd like to hear it too.)

On a completely separate note: Tom, you need to add some flying love to your Vim habits. :-)

Thursday, November 8, 2012

Vim Motions

One of the more frequent admonishments delivered on #vim to the whining novice or the curious journeyman is to master the many motions within the editor. Previously, a bewildering list of punctuation and jumbled letters was unceremoniously dumped on the complainant with the misguided expectation that they'd then take themselves off and get right to the task of memorising the eighty odd glyphs. We mistook their silence for compliance but I rather suspect it was more bewilderment or repulsion or sheer paralysis. In an attempt to friendly that mess up, I have started an infographic series intended to cover the twelve major categories, probably spread over six separate infographics.

The Vim Motions Infographic Series (in 9 parts):

1. Line & Buffer
2. Column
3. Word
4. Find
5. Search
6. Large Objects
7. Marks, Matches & Folds
8. Text Objects (not motions, but mesh nicely at this point)
9. Creating your own Text Objects

I plan to have a different expression on the chibi's face in each of the pages. I'll move the crying one from the Large Object page (as shown below) to page 1 and then progressively improve her mood through the remaining pages: something like -- crying, disappointment, resignation, hope, amazement, happiness, confidence, smugness and something devilish. As an update on that, I have inked five of the chibis now. I look forward to having them all up in their own infographics.

I decided to have the background colour change to suit the mood of the chibi, starting from black in image number one to represent depression and despair. I will roughly follow the same colour spread I used on the How Do I Feel graphic.

I have no experience in putting together a multi-page piece like this. Feedback certainly welcome. I was vaguely thinking of having it a bit like a magazine or comic book spread, but I don't know how to do that or whether it's the right or even a good approach.

Green indicates cursor origin before issuing the motion.
Red indicates cursor destination at the end of the motion.
Orange shows the area covered by the motion. This would be the same area highlighted in Vim if a visual operator was used with these motions.

1. Line & Buffer Motions

2. Column Motions

6. Large Object Motions

The Many Faces of % in Vim

Pity the poor Vimmer for he has so many a face to put to percent:

Help Topic Description
N% go to {count} percentage in the file
% match corresponding [({})] (enhanced with matchit.vim plugin)
g% enhanced match with matchit.vim plugin — cycle backwards through matches
:% as a range, equal to :1,$ (whole file)
:_% used as an argument to an :ex command as the name of the current file
"% as a register, the name of the current file
expr-% in VimL as modulo operator
expand(), printf() and bufname() in VimL use % in printf-like format specifiers
'grepformat', 'errorformat', 'shellredir', 'printheader' and 'statusline' various options use % as a printf-like format specifier
Regular Expression Atoms:
Match locations:
\%# cursor position
\%' position of a mark
\%l specific line
\%c specific column
\%v specific virtual column
\%( non-backref capturing group
\%[ sequence of optionally matched atoms
Numeric character specifier in matches:
\%d decimal
\%o octal
\%x hex (2 digits)
\%u hex (4 digits)
\%U hex (8 digits)
Absolute file or string boundaries:
\%^ start of file (or start of string)
\%$ end of file (or end of string)
\%V match inside visual area

Thursday, November 1, 2012


A true and accurate hysteri of the Rise Of The House Buffalo.

This chronicle begins on the 3rd day of the 7th month in the time of Our Vim where people lived a modal life of happiness and abided the :ex commands with pious fervour and so were blessed with edits most excellent and help abundant. It was a time of peace, prosperity and personal productivity.

Then one fateful day a churlish stranger peddled into town, seated upon a contraption most vile and contemptuous. Perched high atop its mechanical crown he screeched down upon the startled fray: “Tarry not betwixt thy :buffers and switch thee not so slowly as with a :buffer number or partial match thereof! Hark the word of reason and join your wizened brethren in celebration of the wheel! Cycling is thy salvation!” So bold was the orator and so balanced he atop his levered contrivance that several among the crowd, wide eyed and jowls agape, moved toward the wretched apparatus with minds numbed and coveting limbs trembling outstretched in wanton avarice.

Lost were these souls on the dull carousel of endlessly needing to :bnext to their buffers; pitiful prisoners of self-constrained linear, cyclic thinking. Trapped they were in the dungeons of their own device, tormented by the clink of their own chains, damned to traverse the wheel of life for eternity, forever spun without liberation.

Forever, that is, until the mavericks started flying.

Unconvinced by the rhetoric of the Church of the Wheel, various voracious vimmers revolted against the Cyclic Dogma and instead embraced a more direct buffer navigation strategy they dubbed flying. This upstart movement quickly gathered an ardent band of kindred spirits who championed the righteousness of flying over cycling.

Regretfully, the zealous were much harder to shake free from their demonic wheel worship. Skirmishes frequently lead to larger battles, some of which erupted into full blown flame wars involving some very hurtful name calling. Slowly waged this war of ideologies, its opponents forever locked in a struggle for vimmer mindshare.

That all changed when Brother Raimondi rode into town astride a bullock of majestic poise and serious presence. The unassuming fellow dismounted without word, turned to the gathered townsfolk and, lifting his feathered cap in measured civility, said, “I bring you The Buffalo.”

No ordinary ox was this tireless beast! Fast, it was! And nigh on omniscient — inferring your very intention from the merest mumble of your desires. So stunning was the stuff of this beefy buffer bouncer that even acolytes of the Church of the Wheel were leaving the order and forsaking their old cycling ways as sin against good sense and refined taste.

For many millions of clock cycles did the mighty buffalo reign over the land of Vim with an ever brisk gait and unerring (ok, only slightly erring) eye toward buffer discretion. Though happy were the citizens with their bovine bureaucrats, they shared a secret longing for simpler governance, clearer models, a more transparent core. Their collective desires created an exaltation of excellence within the very genes of the tenacious bison herd.

Indeed, the metamorphosis was nothing short of a total paradigm shift. Thus dawned the era of the formidable SkyBison — a wondrous hoofed though winged beast, swooping down from aloft in clean and graceful yet swift and precise arcs of buffer selection. May the SkyBison reign righteously and with longevity.

The buffalo is dead; long live the buffalo! All hail the SkyBison!

And if you’re still cycling when you should be flying… may SkyBison gorge on your artless cud.

Note This is a work of fiction only. Any semblance to peoples either living, dead or pretending to be so is purely coincidental and should not be taken as pertaining to them in any way whatsoever. Unless you feel flattered by the events described herein, in which it’s totally about you. Don’t mention it. You’re welcome. You’re worth it.

Thursday, October 11, 2012

Chewed Out

"peddling a double standard"

Tony Abbott squirmed in his front-row seat through a thumping of Gillard gall, having his asshole chewed out for him as Julia demonstrated numerous accounts of sexist and misogynistic quotes from the leader of the opposition in a political maneuver that highlighted Abbott's hypocrisy in calling for the resignation of Peter Slipper over an alleged series of lurid text messages he sent to a former staffer.

Perhaps Slipper should be removed from "high office" for such offences.
Perhaps Gillard shouldn't have defended such uncouth behaviour.

Maybe. Maybe not. Not mine to opine.

What I can say, though, is Gillard got her smack on that day and Abbott was left with the definite look of someone put firmly in their place. Ouch.

I haven't enjoyed parliament time as much since the raunchy days of unbridled Keating.

Enjoy the full fifteen minute schooling Gillard gives Abbott in documenting insults and slurs along with the date and place each shameful comment was said and unleashing them in a tirade of witty and ferocious criticisms.

A Little Drop of Prudence

I like my Vim with a little drop of prudence

I frequently create temporary macros or maps for ad hoc edits when I find myself having to do the same job more than a few times over. This is a good thing to spend some time reflecting on in your editing. If you’re in the heat of the moment and don’t want to break concentration or waste time on R&D right now, make a note to come back when you have time to look at your current editing inefficiency. I recommend setting up a practise file (something I mention in learnvim) which you can quickly jump to using a global bookmark.

Setting up a Practise File
  1. :e ~/vim-practise.txt
  2. mP
Note This sets a global mark that can be jumped to from anywhere within vim using the normal mode ' command. (:help 'A)
Jumping to your Practise File
  • 'P

So, there you are editing away on another dreary Tuesday and in a moment of lucidity you realise you’ve just mashed the same key pattern a dozen times over — you’ve just discovered an inefficiency! Awesome.

Quick check: “Do I have time to investigate and optimise this now?”

No: :-( Sucks to be you. Quick! To the Practise File! Make a quick note about this so that you can come back to it on your morning tea break.

Yes: :-) You soldier! Yank a sample snippet of the problem at hand and then… Quick! To the Practise File! Paste in your snippet and start experimenting with ways to optimise the necessary changes. Is there a map or macro you can make to wrap these steps up into a fast and simple solution? Take your macro/map back to the real work you were doing before this R&D diversion and finish the rest of those lame edits with the genuine vim you should be applying to life.

“But couldn’t I have just experimented in my original file?”

Sure… but then you lose the problem. You’re left with just a finished solution. A useless page of sterile, problemless text. That might please your boss and clients, but it’s just no good for your continued development as a vimmer. You’ll grow more by squirreling away interesting little nasties like this that you find in the wild so that you can revisit them during quieter moments as part of your Deliberate Practise regimen.

Thursday, October 4, 2012

Eat Me A Camel!

Eat Me A Camel!

Bah… it never rains; it pours! I failed the other day to secure myself a feed of camel from our new Xin Jiang restaurant in the mall opposite our housing estate because the chef had closed the kitchen after lunch. To avoid a disappointing repeat, I went at the socially approved time today, only to be thwarted once again! This time, a foyer full of waiting brethren, all salivating onto their flat bread conciliation prizes briefly averted their collective gaze from the neon camel sign to eyeball the most recent dromedary devouring devotee to idly dawdle through the double doors. Eager as I was to finally sink my teeth into a prime piece of camel, I took the bold step of calmly walking away from the fray. Another day, humped wildebeest! Those sneaky xinjiangians… it’s probably llama anyway. :-/

Monday, September 17, 2012

Ice skating in the desert

It doesn't matter how well you figure-skate on the ice hockey rink.

I have seen this error appear a number of times in my life. Sometimes I'm watching others make it; sometimes I'm waking up to my own blunders.

Covey had a similar phrase: It doesn't matter how fast you climb the wrong ladder.

My Data Communications lecturer at uni (love ya, Terry) used to tell the best anecdotes. He had one about a group of coders who'd been corrected by a senior programmer. They were chiding his solution for being slower than theirs. The problem was, theirs just didn't work. Terry instilled in me then the lesson: It doesn't matter how fast you can calculate the wrong answer.

A few years ago I was teaching ESL at a university in China. I was getting increasingly frustrated at my lessons - the kids just weren't getting on board! I woke up one morning with the revelation: I was playing basketball on the soccer field. D'Oh! So I changed up my game to address their actual needs and the lessons started flowing a lot better. I dropped my preconceptions about what I thought they should want at that stage of their education and instead looked at (and listened to!) what they really needed.

Sometimes we need to be reminded that it doesn't matter how much we know or how much we've done in a particular field or endeavour if that game isn't the one we're being asked to play right now.

Sunday, September 16, 2012

Vim is Like a Big School

Vim is like a big school. When the first-graders come they are shown their playground and classrooms, the washrooms and the canteen. They are happy and content running around in their little world. When they accidentally stumble into the 6th graders' hall, they o_O and run away in terror. It’s only when they are walking alongside a grown-up that they happily follow along, walking right through the scary hall without realising it. Soon those same kids are running through all the halls without fear.

Don’t be afraid to explore your Vim grounds. Sure, you may stumble into uncharted territory and see something really scary - but it’s mostly harmless and there are only a few rooms with auto-closing doors… And the basement is a little tricky to get out of… And you might trip and stumble or run into something sharp and painful. You might even end up running away, screaming. Wear your brown pants and a buffer you can afford to lose and you’ll be just fine.

If you need a grown-up’s hand to hold, knock on the #vim office door - there are plenty of cheerful guides in there who are happy to help.

Two of the scariest rooms to try:
  • Enter the basement from normal mode with gQ
    • The basement is not like any other room in Vim… You can’t leave with :q but must instead scream the school’s Latin name: :vi!
  • Wait in the principal’s office with q:
    • You might also find yourself being sent there for swearing (ctrl-f) on the : command line.

Sunday, September 9, 2012

Genetic Algorithms in VimL (Part I)

Burak Kanber is into machine learning. I was entertained by his Hello World genetic algorithm example and, in alignment with his implementation language agnosticism, I thought I'd write a version in VimL:

This depends on vim-rng for the random number stuff.

let s:Chromosome = {}

function! s:Chromosome.New(...)
  let chromosome = copy(self)
  let chromosome.code = ''
  let chromosome.cost = 9999
  let chromosome.pivot = 0

  if a:0
    let chromosome.code = a:1
    let chromosome.pivot = (strchars(chromosome.code) / 2) - 1

  return chromosome

function! s:Chromosome.random(length)
  let self.code = RandomString(a:length)
  let self.pivot = (a:length / 2) - 1
  return self

function! s:Chromosome.mutate(chance)
  if (RandomNumber(100) / 100.0) < a:chance
    let index = RandomNumber(1, strchars(self.code)) - 1
    let upOrDown = RandomNumber(100) <= 50 ? -1 : 1
    let exploded = split(self.code, '\zs')
    let change = nr2char(char2nr(exploded[index]) + upOrDown)
    if index == 0
      let self.code = change . join(exploded[index+1:], '')
      let self.code = join(exploded[0:index-1], '') . change . join(exploded[index+1:], '')
  return self

function! s:Chromosome.mate(chromosome)
  let child1 = strpart(self.code, 0, self.pivot) . strpart(a:chromosome.code, self.pivot)
  let child2 = strpart(a:chromosome.code, 0, self.pivot) . strpart(self.code, self.pivot)
  return [s:Chromosome.New(child1), s:Chromosome.New(child2)]

function! s:Chromosome.calcCost(compareTo)
  let total = 0
  let i = 0
  while i < strchars(self.code)
    let diff = char2nr(self.code[i]) - char2nr(a:compareTo[i])
    let total += diff * diff
    let i += 1
  let self.cost = total
  return self

function! s:Chromosome.to_s()
  return self.code . ' (' . string(self.cost) . ')'

let s:Population = {}

function! s:Population.New(goal, size)
  let population = copy(self)
  let population.members = []
  let population.goal = a:goal
  let population.generationNumber = 0
  let population.solved = 0

  let size = a:size
  let length = strchars(population.goal)
  while size > 0
    let chromosome = s:Chromosome.New()
    call chromosome.random(length)
    call add(population.members, chromosome)
    let size -= 1

  return population

function! s:Population.display()
  % delete
  call setline(1, "Generation: " . self.generationNumber)
  call setline(2, map(copy(self.members), 'v:val.to_s()'))
  return self

function! s:Population.costly(a, b)
  return float2nr(a:a.cost - a:b.cost)

function! s:Population.sort()
  call sort(self.members, self.costly, self)

function! s:Population.generation()
  call map(self.members, 'v:val.calcCost(self.goal)')

  call self.sort()
  call self.display()

  let children = self.members[0].mate(self.members[1])
  let self.members = extend(self.members[0:-3], children)

  let i = 0
  while i < len(self.members)
    call self.members[i].mutate(0.5)
    call self.members[i].calcCost(self.goal)
    if self.members[i].code == self.goal
      call self.sort()
      call self.display()
      let self.solved = 1
    let i += 1

  let self.generationNumber += 1

  return self

let population = s:Population.New('Hello, world!', 20)
while population.solved != 1
  call population.generation()

To see it run save it in a file and type   :so %   from within vim.

Why, dear bairui, you ask? Well, at this stage... I don't know. It just looked like fun. However, a couple of wild thoughts occurred to me: finding the ideal (good enough; as in 'correct' enough) combination of various vim options to achieve a desired look and behaviour. Take for example the various C indenting styles - what mad combination of &cinoptions, &cinkeys and &cinwords would you need to achieve Frankenstein's Indentation Style? What about getting &formatlistpat right for your preferred markup style? Sure, these might be totally hair-brained ideas -- but they might give you an idea for something less hairy and actually useful. Either way, I plan to keep playing with Burak's tutorial as he progresses through it. Thanks, Burak! :-)

Thursday, August 16, 2012

Cadger's Comeuppance

Phone rings out alone
Mind closed off to callers bent
Moocher baits elsewhere

Wednesday, August 15, 2012

A PEG Parser Generator for Vim

Barry Arthur
v1.2, August 15, 2012
v1.1, October 10, 2011

What is VimPEG?

VimPEG is a Parser Generator which uses the newer Parsing Expression Grammar formalism to specify parse rules.

Why VimPEG?

Vim is a powerful editor. It has lots of features baked right in to make it an editor most awesome. It has a deliciously potent regular expression engine, jaw-dropping text-object manipulations, and fabulous scriptability -- just to name a few of its aces.

One thing our little Vim still lacks, though, is an actual parser. Regular expressions will only get you so far when you're trying to analyse and understand complex chunks of text. If your text is inherently infinite or recursive, then regular expressions become at best combersome, and at worst, useless.

So, Vim needs a parser. I've needed one myself several times when wanting to build a new plugin:
Awesome! This idea will so rock! Now all I need to do is parse <SomeLanguage> and I'll be able to... awww... :-(
I've seen people ask on #vim: How can I <DoSomethingThatNeedsAParser>? And invariably the answer is: You can't. Not easily, anyway. Vimscript is a capable enough language to write your own parser in, but a little alien to most to do so.

You could also use one of the many language bindings that Vim comes bundled with these days to use a parser library in your favourite scripting language. The problem being that your code will only then run on a similarly compiled Vim (not everyone enables these extra language bindings) and with your parser library dependencies.

Beyond those two options, the world of parsing in Vim is quite scant. There exist a small handful of purpose-built recursive descent parsers that target a specific task (like parsing json), but for the general case -- a parser-generator -- you're out of luck. Until now. VimPEG aims to solve this problem.

VimPEG aims to be a 100% VimL solution to your parsing needs.

What would I use VimPEG for?

  • You've come to that paralysing sinkhole in your Vimming when you've said to yourself, "Damn... I wish Vim had a parser."
  • You've asked for something on #vim and the reply is "you can't do that because Vim doesn't have a parser."
  • You're up to your neck in recklessly recursive regexes.

Some ideas:

  • An expression calculator (the beginnings of which we explore here.)
  • Expanding tokens in typed text (think: snippets, abbrevs, maps.)
  • Semantic analysis of code -- for refactoring, reindenting (but sadly not syntax highlighting yet.)
  • C Code bifurcation based on #define values -- want to see what the code would look like with #define DEBUG disabled?
  • Coffeescript for Vim -- sugar-coating some of the uglies in VimL -- this example will be presented in a subsequent VimPEG article.

In fact, most of these ideas have been explored in part inside the examples/ directory of the VimPEG plugin.

For the purposes of introducing VimPEG and parsing in general (if you're new to it), let's consider a fairly easy example of reading and understanding (perhaps calculating) a sum series of integers. They look like this:

    1 + 2 + 12 + 34

NOTE: Vim can already do this for you, so writing a parser for it here is purely pedagogical -- it's a simple enough example without being utterly devoid of educational value. I hope.

The list can be any (reasonable) length, from a single integer upwards. So, this is a valid input to our parser:


As are all of the following:

    1 + 2
    3 + 4 + 5
    123 + 456 + 789

Stop. Right now. And think: How would you parse such an arbitrarily long series of integers separated by + operators? What tool would you reach for? What if you had to do it in Vim? And :echo eval('1 + 2 + 3') is cheating. :-p

We'll continue to use this example throughout this article and eventually show you how VimPEG solves this little parsing requirement.

But first, let's make sure we're all on the same page about the question: What is parsing?


Feel free to skip to the next section if you're comfortable with the following concepts:
  • parsing
  • pasrer generators
  • (E)BNF and PEGs

Let's begin by defining some terms:

What is 'Parsing'?

Parsing is making sense of something. When we want a computer to understand something we've written down for it to do, it needs to 'parse' that writing.  Without going into too much detail yet, let's consider a sentence uttered at one time or another by your parental unit: "Take the rubbish out!". When you (eventually -- after you unplug your iPod, put down your PS3 controller, pocket your smart-phone and wipe the disdain off your face) parse this sentence, your brain goes through two processes:

firstly, syntax recognition:

  • it scans the words to make sure they're legitimate:
    • they're in a language you know
    • they're all valid words, and
    • they're all in the right order

and secondly, semantic analysis:

  • it filters out the 'meaning' and presents that to a higher actor for further deliberation

In this case, the parser would extract the verb phrase 'take out' and the noun 'rubbish'. Your higher self (sarcasm aside) knows where this magic 'out' place is. We'll come back to these two processes ('syntax recognition' and 'semantic analysis') later.

In the case of our sum series of integers, syntax recognition would involve collecting the sequence of digits that comprise an integer, skipping unnecessary whitespace and expecting either an end of input or a + character and another integer and... so on. If the input contained an alphabetic character it would fail in this phase -- alphabetic characters are just not expected in the input. If the lexical recogniser found two integers separated by whitespace or two + characters in a row...  it would not fail in this phase -- these are all valid tokens in 'this' lexical recogniser.

I am describing the more general process of lexical recognition and it being a separate stage to semantic analysis which is typical of a lot of parsers. PEG parsers, however, do not have separate phases as described here -- they are quite strict about not only what shape the next token must have, but also its purpose in this place (context) of the input. Having two consecutive integers or two consecutive characters will upset a PEG parser expecting a sum series of integers -- it's just that it gets upset all in its single parse phase.

The semantic analysis phase is all about doing something
"meaningful" with the collected integers. Maybe we should sum them? Maybe we just want to pass back a nested list structure representing the parse tree, like this:

    [1, '+', [2, '+', [3, '+', 4]]]

given this input:

    1 + 2 + 3 + 4

Either way, whatever is done, it's the job of the semantic analysis phase to do so. In our example in this article, we produce a sum of the collected integer series. So, our parser would return: 10 for the example input given above.

What is a 'Parser Generator'?

Writing a parser is not easy. Well, it's not simple. It's fussy. It's messy. There's a lot of repetition and many edge cases and minutia that bores a good coder to tears. Sure, writing your first recursive descent parser is better than sex, but writing your second one isn't. Writing many is as much fun as abstinence. Enough said.

So, we (as fun loving coders) want a better alternative. Parser generators provide that alternative. They generate parsers; which means they do all the boring, tedious, repetitive hard-labour and clerical book-keeping stuff for us. I hope I've painted that with just the right amount of negative emotion to convince you on a subliminal level that Parser Generators are a Good Thing(TM).

How do they generate a parser? or What's a 'PEG'?

Parser Generators are told what to expect (what is valid or invalid) through a grammar -- a set of rules describing the allowed constructs in the language it's reading. Defining these rules in a declarative form is much easier, quicker and less error-prone than hand-coding the equivalent parser.

Bryan Ford recently (circa 2004) described a better way to declare these rules in the form of what he called Parsing Expression Grammars -- PEGs.

NOTE: We used to declare these parsing rules in EBNF, intended for a recursive descent parser (or an LL or LALR or other parser). And before you drown me in comments of "They so still use that, dude!" -- I know. They do.

In a nutshell, PEGs describe what is expected in the input, rather than the (E)BNF approach of describing what is possible. The difference is subtle but liberating. We'll not go too much into that now -- except to say: PEGs offer a cleaner way to describe languages that computers are expected to parse. If you want to re-program your 13 year old brother, you might not reach for a PEG parser generator, but as we're dabbling here in the confines of computers and the valley of vim, PEGs will do just fine.

A major benefit to PEG parsers is that there is no separate lexical analysis phase necessary. Because PEG parsers 'expect' to see the input in a certain way, they can ask for it in those expected chunks. If it matches, great, move on. If it doesn't match, try another alternative. If all the alternatives fail, then the input doesn't match. Allow for backtracking, and you have all you need to parse 'expected' input.

NOTE: VimPEG is not a memoising (packrat) parser -- not yet, anyway.

A brief overview of the PEG parsing rule syntax

  • Terminal symbols are concrete and represent actual strings (or in the case of VimPEG, Vim regular expressions) to be matched.
  • Non-terminal symbols are names referring to combinations of other terminal and/or non-terminal symbols.
  • Each rule is of the form:   A ::= e -> #s
    • A is a non-terminal symbol
    • e is a parsing expression
    • s (optional) is a semantic transformation (data-munging callback)
  • Each parsing expression is either: a terminal symbol, a non terminal symbol or the empty string.
  • Given the parsing expressions, ++e1++ and ++e2++, a new parsing expression can be constructed using the following operators:
    • Sequence: e1 e2
    • Ordered choice: e1 / e2
    • Zero-or-more: e*
    • One-or-more: e+
    • Optional: e?
    • And-predicate: &e
    • Not-predicate: !e

A Conceptual Model of VimPEG

There are three players in the VimPEG game:
  1. The VimPEG Parser Generator (Vim plugin)
  2. The Language Provider
  3. The Client

The VimPEG Parser Generator

This is a Vim plugin you'll need to install to both create and use VimPEG based parsers.

The Language Provider

This is someone who creates a parser for a new or existing language or data-structure. They create the grammar, data-munging callbacks, utility functions and a public interface into their 'parser'.

The Client

This is someone who wants to 'use' a parser to get some real work done. Clients can either be Vim end-users or other VimL coders using a parser as a support layer for even more awesome and complicated higher-level purposes.

There are five pieces to VimPEG

  1. The VimPEG library (plugin)
  2. A PEG Grammar (provider-side)
  3. Callbacks and utility functions [optional] (provider-side)
  4. A public interface (provider-side)
  5. Client code that calls the provider's public interface. (client-side)

Our Parsing Example

Let's return to our parsing example: recognising (and eventually evaluating) a sum series of integers.

Examples of our expected Input

  • 123
  • 1 + 2 + 3
  • 12 + 34 + 56 + 78

A traditional CFG style PEG for a Series of Integer add & subtract operations:

  Expression  ::= Sum | Integer
  Sum         ::= Integer '+' Expression
  Integer     ::= '\d\+'

In the above PEG for matching a Sum Series of Integers, we have:
  • Three non-terminal symbols: 'Integer', 'Sum' and 'Expression'
  • Two terminal symbols: \d\+ and  '+'
  • One use of Sequence with the three pieces: 'Integer' '+' 'Expression'
  • One use of Ordered choice: 'Sum' | 'Integer'
NOTE: The original (and actual) PEG formalism specifies the fundamental expression type as a simple string. VimPEG shuns (at probable cost) this restriction and allows regular expressions as the fundamental expression type. Original PEG grammars use / to indicate choice, but VimPEG uses | instead.

Anyone familiar with CFG grammar specifications will feel right at home with that example PEG grammar above. Unfortunately, it isn't idiomatic PEG. The thing to be parsed here is a list. PEGs have a compact idiomatic way of expressing that structure:

  Expression   ::= Integer (('+' | '-') Integer)*
  Integer      ::= '\d\+'

Here the arguably simpler concept of iteration replaces the CFG use of recursion to describe the desired list syntax to be parsed. It's so much simpler that it seemed a waste not to bundle subtraction in with the deal. Now our parser can evaluate a series of integer add and subtract operations.


  peg.e(expression, options)                  "(Expression)
  peg.and(sequence, options)                  "(Sequence)
  peg.or(choices, options)                    "(Ordered Choice)
  peg.maybe_many(expression, options)         "(Zero or More)
  peg.many(expression, options)               "(One or More)
  peg.maybe_one(expression, options)          "(Optional)
  peg.has(expression, options)                "(And Predicate)
  peg.not_has(expression, options)            "(Not Predicate)

Defining the Series of Integer Add and Subtract Operations PEG

  let p = vimpeg#parser({'skip_white': 1})
  call p.e('\d\+', {'id': 'integer'})
  let expression =
        \ p.and(
        \   [ 'integer',
        \     p.maybe_many(
        \       p.and(
        \         [ p.or(
        \           [ p.e('+'),
        \             p.e('-')]),
        \           'integer'])),
        \     p.e('$')],
        \   {'on_match': 'Expression'})

This example demonstrates several aspects of VimPEG's API:
  1. Elements that have been 'identfied' (using the 'id' attribute) can be referred to in other expressions. 'Integer' is identified in this case and referenced from 'Expression'.
  2. Only root-level elements need to be assigned to a Vim variable. In this case, the 'expression' element is considered to be a root element -- we can directly call on that element now to parse a series of integer add and subtract operations.
  3. Intermediate processing (for evaluations, reductions, lookups, whatever) is achieved through callback functions identified by the 'on_match' attribute.  The 'Expression' rule uses such a callback to iterate the list of add or subtract operations to evaluate their final total value. Here is that callback function:

  function! Expression(args)
    " initialise val with the first integer in the series
    let val = remove(a:args, 0)
    " remaining element of a:args is a list of [ [<+|->, <int>], ... ] pairs
    let args = a:args[0]
    while len(args) > 0
      let pair = remove(args, 0)
      let val = (pair[0] == '+') ? (val + pair[1]) : (val - pair[1])
    return val

The public API interface

  function! EvaluateExpression(str)
    let res = g:expression.match(a:str)
    if res.is_matched
      return res.value
      return res.errmsg

The res object holds a lot of information about what was actually parsed (and an errmsg  if parsing failed). The value element will contain the cumulative result of all the 'on_match' callbacks as the input was being parsed.

Using it

  echo EvaluateExpression('123')
  echo EvaluateExpression('1 + 2')
  echo EvaluateExpression('1 + 2 + 3')
  echo EvaluateExpression('4 - 5 + 6')
  echo EvaluateExpression('1 - a')

NOTE: The last example there will return the error message: 'Failed to match Sequence at byte 2'. This might seem unexpected -- we might have been hoping for something more meaningful about not expecting an alphabetic character when looking for an integer digit. It's telling us that (after gracefully falling back out of the optional series of add and subtract operations) it can't match '$' (end of line) at byte 2 because a '-' character is in the way.

Not terribly exciting, granted, but hopefully this serves as a reasonable introduction to the VimPEG Parser Generator. What can you do with it? I look forward to seeing weird and wonderful creations and possibilities in Vim now that real parsing tasks are more accessible.

What's Next?

As beautiful (ok, maybe not, but I've seen more hideous interfaces) as VimPEG's API is, she could do with a touch of lipstick. Instead of calling the API directly, it would be nice to be able to declare the rules using the PEG formalism. That's exactly what Raimondi has done in one of his contributions to VimPEG and that's what we'll be talking about in the next article.

In a future article I will show an example of sugar-coating the VimL language to make function declarations both a little easier on the eyes and fingers as well as adding two long-missing features from VimL -- default values in function parameters and inline function
declarations, a la if <condition> | something | endif .

colour me a dragon

Wanna learn the colours in Chinese?

Not a visual? Don't wanna just absorb? Got a kino itch? Perhaps colouring-in is more your style:

Get your colour on.

Sunday, August 5, 2012

The 23rd Harm

Although the bible and the Wife User Manual do have a lot in common, it's
the little differences that catch you out: An excerpt from the 23rd Harm:
Yea, as I walk through the valley of the shadow of death, I will fear upheaval for thou art with me; Thy rod and thy staff, they come for me. Thou preparest a table before me in the presence of mine mother in law; Thou annoyest my head with toil; My cup runneth empty. 
Just Kidding! :-D ...honey? We're cool, right?

I Do Not Like To Travel, Man

I Do Not Like To Travel, Man
Barry Arthur, 2012

I do not like to ride a train;
To go by bus or take a plane.
Breathing smoke; Sneezing folk;
Whining kids; Locked in grids.
No, no, no! If it's all the same,
I think right here I shall remain. 
The 'net has much for me to see;
Without the need to cross the sea.
I go over, on or under it
While in my comfy chair I sit.
I wouldn't care if it were free;
Travelling is simply not for me. 
I like to be in places strange,
But getting there is such a pain.
If only I could blaze around,
'T'would my desires make unbound:
Lunch in Paris on la Seine;
Cena di Roma; Home by ten. 
Alas right now 'tis but a dream;
To see the world upon a beam.
And instead we're made to take
Wretched, infernal carts of freight.
Until the tech can help me here,
I'll surf the net and sip my beer.
So, thank you for the journey plan, — but,
I do not like to travel, man.
ode to osse

Thursday, July 26, 2012

The Vim of Happiness

Joshu Edits
A vimmer told Joshu: `I have just entered the channel. Please teach me.'
Joshu asked: `Have you completed vimtutor?'
The vimmer replied: `I have completed it.'
Joshu said: `Then you had better start editing.'
At that moment the vimmer was enlightened.
The Vim of Happiness T-Shirt Collection

Tuesday, July 3, 2012

On Skinning Cats and Eating Elephants

All characters appearing in this work are fictitious. Even mine. And certainly rovaednez. I mean, who names their son rovaednez?! How do you even pronounce rovaednez? So, yeah, completely fictitious. Any resemblance to real persons, living or dead, is purely coincidental. Totally.

So, check it:

There we were, chilling in the name of #vim, keeping it proper and prim, filling it up to the brim, chattin' it epic like the brothers Grimm. Then who the fuck should storm in? rovaednez, dressed like a pimp! Gold chains and a fist full of bling, throwin' down smack like a curdled fling.
"Yo, bitchies, I can haz moniez for data fixies! All i gots to do iz kick these glitchies. Then pappy's gonna spend his hard earned richies!  Any you fools can help me stitch it?" 
Flashing a cheerful grin at our brash and fearless kin I blithely bid of him: "Fetch a python to fight this battle, cast a perl among the rabble or send in ruby with her paddle. There's many ways to filch your chattel." 
Alas, but all he did was stand and blink with shoulders slumped and mettle kinked; he mumbled feebly, "In all of these, I stink." "Know you sed and awk?" I cried?  He puled, "These too I have not eyed." 
The crowd gasped as one, each looking at none, hoping our little one was merely poking fun. The thunderous peal of silence shook us from our trance. It was quite without a shred of piety with which I stammered my next enquiry: "What pray you wield when beasts you need to slash?  What craft employed to earn your clients' cash?" 
Expecting this to rile the squawk, I gave a nod for him to talk. I feared I might have crushed his soul but it seemed to only steel his resolve. 
His guile aroused, he sat erect and puffed his cheeks and pitched his chest. Holding his manner coy with hands of fleeting joy, he proceeded to destroy: "What do I use to mash my foes?  What is the tool I lash and throw? How do I thrash my enemies so? I'll tell you, bro! Here's the flash: When inserts I need to hash and indents against the rocks I dash, when code I gnash and jobs I stash, there is for me but one lass: my fair maiden, lady bash! 
Eyes widened; noses bled. Osse wept and innocents fled. If need you munge data dread, use you not your tail and head and myriad filters all unwed. Steer you clear of the hell that is scripting in the shell.  Turn your hand and heart instead towards the light of awk and sed.

From this point on, we turn from the apocryphal tale of rovaednez and his blind devotion to the lady of the shell, to an illustrated walkthrough of a typical problem that is better handled by purpose built tools for these sorts of tasks. In particular, we'll be looking at sed and awk.

The problem


We have a file (let's call it the input_file) with records of the form:


with NO newlines anywhere. None. Not even one at the very end.

Each (...) record looks like this:



The output_file needs to have each (...) record modified to include an absolute prefix in the trailing path:



Ostensibly, the task here is simple and was summed up in the expected output statement above, repeated here for reference:

The output_file needs to have each (...) record modified to include an absolute prefix in the trailing path.

How would you achieve that? Take a few seconds or (minutes if you need) to consider how you would solve this problem.

Done? Let's take a look at how you went.

  • Did you write any code?
  • Did you build some sample data files to throw at your code?
  • Did you solve the problem?

If you did any and all of those things, jolly for you. They are all fairly reasonable things to do. Doing them at this stage of the game might be a tad premature, but there are worse moves you could have made.

Let's look at an alternative way to approach this problem: start by asking questions and establishing working knowns and assumptions.

Some things to ask:

  • what absolute path prefix needs to be prepended? Is it constant or does it depend on something, like the content of one or more fields in the record?
  • do other records have any bearings on how changes are to be made to the current record?
  • and... how many records are we talking about, anyway?

It turns out, that last question actually packs a bit more punch than it would seem at face value. More on that later.

It's crucial to understand your environment before you start moving around. If you were parachuted into a dark LZ in enemy territory, you'd want to take a few good looks around before you waddled off and stumbled into a booby trap or hostile patrols. Same thing when you're coding. If you run off half-cocked and start banging out gobs of 'solution code' before you even understand the problem, you may as well have stepped on a land mine and lost your left leg and both nuts.  We would do well to abide the old carpenter's adage here: measure twice, cut once.

The Answers

  • It's constant. We'll use the one shown in the Output section: '/an/abs/path/'   (I decided to keep this easy because complicating this part was not the point of the exercise)
  • Thankfully, no. Seriously, if the answer to this is yes, and you're looking at anything more than simple associations or record numbers exceeding a-few-seconds-to-process-all-told then you should almost certainly drop sed & awk and reach for the Big Boy's Shelf and grab yourself a can of Perl, Python or Ruby and make sure you bundle the Database modules to boot. (Other serious language solutions are acceptable here too - not wanting to alienate our lesser spotted brethren out there... you C# and .Net freaks can just stay bloody well ostracised, though.)
  • This is the kicker (and the main focus of the solution space in this article). The input file is 1.2 Gigabytes in size. And remember, there's not a single newline in sight. Stop for a minute and think how your 'solution' above will cope with those numbers now.

Skinning the Cat

Here are some possible ways to tackle the unencumbered (before we knew about the sheer size of the file) problem:

A Simple Sed

A simple sed that ignores any 'record' notion and simply assumed that all strings needing modification included, say, a slash and that no other field ever contained a slash. For small file sizes and if your client confirms that the records do indeed adhere to that assumption, then you're good to go. Using sed on a 1.2 GB file, though, may keep you at your desk a little longer, I'm afraid.

Testing Aside

It might seem unnecessary to some for me to point this out, but here it is for the greener horned among you: don't use your 1.2 GB file for all your tests. In fact, I tend to have several sized test files:

  • a tiny one with about half a dozen lines
  • a small one with a hundred odd
  • a medium one with a few thousand
  • and the big one

How many intermediate ones you have between the tiny and the real one are up to you and are used to speed test your solution as well as quickly failing sanity tests. If your solution fails to swallow a few thousand records, it ain't gonna eat the million rec bunny, and depending on your solution and where it fails, you might not know that as quickly on the huge file.

Other variations on test files (that I didn't need to cover here) cater for cases where the contents of the records themselves effect the operation of the algorithm.

Size Matters, Even to a Gnu

The GNU sed doco says it can handle any line size your box can malloc(). Given that my box could indeed accommodate that, my tests proved fruitless. The worst fail I saw sed give me was on this expression:

  sed -e 's/),(/),\n(/g'

which returned after only six seconds (when it should have taken at least three times as long as that) with an exit code of zero, and a matching output file size. o_O   Silent Fail.   Bad.

A Brief Excursion on Eating Elephants

Why was I trying the sed statement above? To answer that question, I first have to go back to the make-believe story about rovaednez above.  His bash solution had a nest of loops, each O-ier than the last, and on the innermost level of hell was a sed expression so evil that it made your eyes bleed and left you infertile for a week. Kids, instead of attempting to write one behemoth regex to rule them all, chunk the problem down into bite sized pieces. Just as you wouldn't dislocate your jaw squeezing an elephant's trunk down your gob, nor should you bust a neuron squeezing an oversized regex in your noggin. You'd slice both up and chew on the little pieces, swallowing them at a steady, comfortable and digestible pace.

There is no shame in cutting a bigger operation into smaller, almost trivial sub-steps. In fact, it's the smart thing to do. This takes practise to get good at, especially if you've prided yourself for years on writing strings of dense regexs to prove your manliness.

The sed expression:

  sed -e 's/),(/),\n(/g'

was aiming to introduce newlines so the next chain in the stream could operate in a standard (linux tool philosophy) and familiar way.  Although this didn't work in this particular case because sed choked on the elephant's trunk, this is still a good lesson to take away: chunk your problem space down into bite-sized pieces. Rethink the problem and ask yourself, would I look at this differently if the input data was shaped a little differently? And better, what would the input data look like to make this a trivial solve?

So much for sed. Let's move on.

The Evolution of a Sysop's Tool

When a sysop needs a tool to solve a common task she does on a regular basis, it almost invariably follows an evolution similar to this:

  1. scribble something quickly in shell functions and aliases.
  2. turn them into a bona-fide shell script.
  3. re-write the whole thing in sed & awk.
  4. scrap that and do it all again in perl (or ruby/python/yourscriptinglanguageofchoice)

Well, according to that, we're up to the awk stage of evolution here, so in we jump.

Awk really is a handy little tool. It's centuries old now which you might think renders it useless but that is simply not the case. This little powerhouse belongs in the same group of timeless tools that our beloved vim hangs out in.

The Final Awk Code

  awk -v FS=',' -v RS='\\),\\(' -v ORS='),(' -v OFS=',' 
      -v q="'" -v abspath="'/an/abs/path/"
      '{sub(q, abspath, $4); printf("%s%s", NR>1 ? ORS : "", $0)}'
      input_file > output_file

A brief walkthrough

Awk uses some all-caps internal variables to control its runtime behaviour. The ones I'm using are:

  • FS = input field separator
  • OFS = output field separator
  • RS = record separator
  • ORS = output record separator
  • NR = current record number as we're moving through the input files. Some awk guides refer to this erroneously as the current line number, but that is only true when using RS="\n" (which is the default).

Records and fields are split according to these settings and provided to the awk inner loop (the '{sub(q, abspath, $4) ; printf("%s%s", NR>1 ? ORS : "", $0)}' part) in the form $1 - $4 (for each of the 4 fields) and $0 for the whole record. Of course, you get more $n vars for having more fields in your input records.

I am using two variables:

  • q = '
  • abspath = /an/abs/path/

Separating the single quote value of q out into a separate variable is a common awk idiom to avoid complex quote escaping needed to bury ' within " within ' within... yeah.

I don't feel I need to explain the input_file > output_file part. See man awk and shell redirection.

So that leaves the meat:

  '{sub(q, abspath, $4) ; printf("%s%s", NR>1 ? ORS : "", $0)}'

Read this as:

  • substitute the single-quote at the start of field 4 with the abspath.
  • if this is the first line in the file (NR == 1 and therefore *not* greater than 1) do not print the ORS, but do print the first (now modified by that substitute) record.
  • for all other lines of the file, print the ORS and then the (now modified) record.

This printf() trick with prepending the ORS to all but the first line is one of awk's ways of altering the behaviour of the last line in the file. No, don't email me. Don't rush down to the comments area to tell me I flubbed this one. I meant what I wrote. Awk has a hard time knowing how many records your file has before it's read them all. There are other solutions usually involving you telling awk how many records it has to process, which is quite frankly, for the birds.  This approach 'appends' the ORS to every line in the file *except* the last line. Think about it.

See the online tutorials and the awk man page for a deeper explanation of bits I didn't cover well enough for you.

Note: Senior Orcs will advise you that   -F ','   and   -v q=\'   are more idiomatic than the respective:   -v FS=','   and   -v q="'"   . I used the more verbose and regular forms here for clarity.

I would throw some links here to awk learning resources, tutorials and reference material but I really don't need to. Google has your back there. There's a metric fuck ton of awk goodies just waiting for you to read them. So, off you pop and learn you an awk of great utility.


My apologies to e36freak on #awk for forgetting to thank him for looking over my awk script. He pointed out that I should use $NF instead of $4. $NF is another special awk internal variable that holds the number of fields in the current record. Remembering that we set the field separator to a single comma, given the format of the records shown, it's probably reasonable to assume we'll always therefore get four fields. Assumption is the mother of all fuckups though and if any field were to contain a comma then $4 will refer to the wrong field in our expression. Knowing that the field we always want to change is the final field, we can make our code a little more robust by using $NF instead. This is not fool proof, though. If the last field (a path) happened to contain a comma, we're dead in the water again. Mistakes like these should ideally be checked for in testing phases. It might be infeasible to test a full 1.2 GB data file to find a possible tweet in the stream.

An example of sanity checking your data file to ensure no record has more than four fields:

  awk -v RS='\\),\\(' -v FS=',' 'NF > 4 {print NR, $0}' input_file

Any records having more than four fields (three commas) will be printed out, preceded by their record number within the file.