Regular expressions and Solaris 8

Chuck Yerkes chuck+baylisa at snew.com
Wed Jul 21 17:41:49 PDT 2004


I'll point to the egrep/grep in /usr/xpg4/bin/
and perhaps /usr/sfw/bin/  (sun freeware).

Sun's grep and diff packages are lovely and should be kept.
In a museum.  On any Solaris machine I control, these are two
packages that quickly get replaced with something from the past
decade.

(note also that sun's egrep is about 10x faster than Sun's grep,
so I just "alias grep=egrep").

So I'd just offer that trying a proper grep (BSD's if it will
compile, GNU-grep which will compile) might make you happier.

Now anyone here work at Sun wanna talk to them about "doing an
Apple" and perhaps taking a lot of the userland apps from BSD?
whoish, diff, grep, REAL z-utils (zgrep, zcat).  OpenBSD has
a BSD licenced diff and non-FSF gzip and friends (libz had a BSD
ok license and all the routines where there waiting...).

I tire of working around Solaris' tools.

But have I mentioned that netbds's www.pkgsrc.org  stuff makes
me happier on Solaris? (and AIX and MacOS and...)

Quoting David Wolfskill (david at catwhisker.org):
> As (some of) you may recall, in my role as postmaster at baylisa.org I make
> use of a couple of different approaches to try to squelch spam at
> BayLISA'a MTA.
> 
> One of those approaches is a content filter that uses regular
> expressions.  The bulk of the specification I use for it are intended to
> look for certain "spamvertized" domains.  (The census of these is now at
> about 3975.)
> 
> Thus, a typical regex deployed for this use looked like
> 
> 	`([^-0-9a-z]|([=%]2[ef]))2LD(=2E|\.)TLD`ie
> 
> where:
> * the ` are the delimiters -- I didn't use / because sometimes I specify
>   more of a URL, and they often have / characters in them.
> 
> * "2LD" is the second-level domain
> 
> * "TLD" is the top-level domain
> 
> * "ie" (after the closing delimiter) denotes case-insensitive matching
>   and extended regular expression syntax.
> 
> 
> Well, this morning, I received a spam that mentioned a known
> spamvertized domain.  On looking at the spam a bit more closely, I saw
> that the doamin name in question was left-anchored on the line; thus,
> the above regex would not match (because it's looking for some sort of
> delimiter to the left of the doamin name).
> 
> So I poked around in Jeffrey Friedl's _Mastering Regular Expressions_
> and found that the construct "\<" may be used to serve as a "left
> word-anchor" ... in some regular expression implementations.
> 
> I then tried using "egrep"on one of my FreeBSD boxen (running the same
> flavor of FreeBSD as my home firewall/MTA) and found that a regex of the
> form
> 
> 	`\>2LD(=2E|\.)TLD`ie
> 
> fed to egrep appeared to work.
> 
> Then I got a little more adventurous:  some spammers like to use encodin
> constructs for the URLS; I tried
> 
> 	`(\<|([=%]2[ef]))2LD(=2E|\.)TLD`ie
> 
> and that appeared to work very nicely.
> 
> (The next step, assuming all works OK, is to use
> 
> 	`(\<|([=%]2[ef]))2LD(=2E|\.)TLD\>`ie
> 
> though that's not really foolproof.)

> However, when I tried the same egrep test on the BayLISA machine, it
> failed to find the lines in question -- so I thought that maybe Solaris
> 8 didn't have supportfor \< and \> in its regex library.
> 
> But the regexp)5) man page seems to indicate that the construct is
> recognized.
> 
> Anyone have any clue whether this ought to work or not?  (Note that the
> application is a "milter," not egrep (per se).



More information about the Baylisa mailing list