Aae34a7973a8d98e53764a1c89090c55

Hello,
I'm a web-scrapping enthusiast and I script short one-liners in bash using sed, awk, perl, grep, tail, head, tr,... that sort of programs. Here's a really cool perl one-liner that basically extracts values from any xml(html) tag. You should try it. Can you make it shorter, or any more powerful?
Cheers,
Guillaume

1
curl http://www.cnn.com | perl -ne 'm/>([^<].*?[^>])<\// && print$1."\n"'

Refactorings

No refactoring yet !

Avatar

V

November 13, 2007, November 13, 2007 22:57, permalink

No rating. Login to rate!

didn't test it, so this could be wrong.

1
curl http://www.cnn.com | perl -ne 'm/>([^<>]*?)<\// && print$1."\n"'
Aae34a7973a8d98e53764a1c89090c55

griflet

November 20, 2007, November 20, 2007 18:00, permalink

No rating. Login to rate!

Tested. Works. I also added a sed command to remove blank lines. Anyone cares to insert that in the perl one-liner, for sports?

1
curl http://www.cnn.com | perl -ne 'm/>([^<>]*?)<\// && print$1."\n"' | sed -e '/^$/d'
Ff0bd1a8c9502aac62868cabf40b2b7d

pascal.charest

February 5, 2008, February 05, 2008 21:19, permalink

No rating. Login to rate!

Here is another version.

Using curl -s flag enable silent mode, you won't have a progress bar on your terminal output.

Using +? instead of *? remove a lot of empty line that were matched by a succession of tags.

1
curl -s http://www.cnn.com | perl -ne 'm/>([^<>]+?)<\// && print$1."\n"' 

Your refactoring





Format Copy from initial code

or Cancel