Expanding the include/exclude options

Donovan Baarda [email protected]
Fri, 29 Mar 2002 21:21:44 +1100


--liOOAslEiF7prFVr
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Howdy, 

me again...

I posted that I had rsync include/exclude list code available for
this. I think the rsync method is perfect for this, and see no reason to
re-invent something else.

In the fiddling around implementing this stuff, I've hit and solved various
issues with doing this stuff.

Someone suggested forgetting about include/exclude lists, and just using
a list of files from something like 'find'. This is fine, except when you
want to selectively restore files, you can't use 'find' to search through a
backed-up file list, so your "select for backup' tool ends up different to
your "select for restore" tool. This can be a pain.

So what you need is something that can do two things; efficiently scan
directories for matches, and quickly find matches from a list of backed-up
files. It is nice if you have a method of specifying include/exclude lists
that can differentiate between files and directories, and is well suited to
specifying file paths.

Additionaly, there are things like include or exclude by default, require
directories to be explicitly included for their contents to be included or
not, and implicitly include directories of included files or not. This has
subtle implications on how everything works.

For example, "--include /home/*/Mail/** --exclude **". If you require
directories to be explicitly included, then this will match nothing because
"/home" is excluded. If you implicitly include dirs, then this will match
"/home/", '/home/*', '/home/*/Mail/', and anything in "/home/*/Mail/". If you
don't implicitly include dirs and don't require dirs to be explicitly
included, this will only match everything in "/home/*/Mail/". Note that to
actualy _find_ the files that match these different criteria may require you
to scan through directories that are not included, just because their
contents might be.

rsync's extended unix-wildcard syntax is nice; directories end in '/', '*' and
'?' match anything except '/',  ** and ?? match anything including '/'.
rsync has taken the "require directories to be explicitly included" approach
which means you need to do things like "--include /home/ --include /home/*/
--include /home/*/Mail/ --include /home/*/Mail/** --exclude **" to get what
you really wanted from the above example. It also allows a sortof shorthand
where anything without a '/' matches a filename with any directory prefix.

I have code to do all of the above. The extended unix-wildcard "efnmatch.py"
is complete and attached. The include/exclude list matching and directory
scanning code is complete, but is different from rsync in that it takes the
"don't require directories to be included, don't implicitly include them"
approach. I was going to expand this to to handle all of the above before I
posted them, but I thought I'd better post it now before someone re-invents
something worse :-). I'll post a "Usage" blurb + help info to anyone that
asks. I'll also update it to do pretty much anything you want.

The future of this code is up in the air. I would like to mantain and make
them publicly available under GPL. I have a few small Python projects on
freshmeat that I mantain this way, but this one is so small I'd feel
embarased creating a project out of it. Any suggestions as to the best way
to support and advertise this code are welcome :-)

-- 
----------------------------------------------------------------------
ABO: finger [email protected] for more info, including pgp key
----------------------------------------------------------------------

--liOOAslEiF7prFVr
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="efnmatch.py"

"""Filename matching with extended shell patterns.

efnmatch(FILENAME, PATTERN) matches according to the local convention.
efnmatchcase(FILENAME, PATTERN) always takes case in account.

The functions operate by translating the pattern into a regular
expression.  They cache the compiled regular expressions for speed.

The function translate(PATTERN) returns a regular expression
corresponding to PATTERN.  (It does not compile it.)
"""

import re

_cache = {}

def efnmatch(name, pat):
    """Test whether FILENAME matches PATTERN.

    Patterns are an extended Unix shell style:

    **      matches everything including os.sep
    *       matches everything except os.sep
    ?       matches any single character except os.sep
    ??      matches any single character including os.sep
    [seq]   matches any character in seq
    [!seq]  matches any char not in seq

    An initial period in FILENAME is not special.
    Both FILENAME and PATTERN are first case-normalized
    if the operating system requires it.
    If you don't want this, use fnmatchcase(FILENAME, PATTERN).
    """
    import os
    name = os.path.normcase(name)
    pat = os.path.normcase(pat)
    return efnmatchcase(name, pat)

def efnmatchcase(name, pat):
    """Test whether FILENAME matches PATTERN, including case.

    This is a version of efnmatch() which doesn't case-normalize
    its arguments.
    """
    if not _cache.has_key(pat):
        res = translate(pat)
        _cache[pat] = re.compile(res)
    return _cache[pat].match(name) is not None

def translate(pat,sep=None):
    """Translate a shell PATTERN to a regular expression.

    There is no way to quote meta-characters.
    """
    import os,string
    if not sep: sep=os.sep
    sep=re.escape(sep)

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c,s = pat[i],pat[i:i+2]
        i = i+1
        if s == '**':
            res = res + '.*'
            i = i + 1
        elif c == '*':
            res = res + '[^' + sep + ']*'
        elif s == '??':
            res = res + '.'
            i=i+1
        elif c == '?':
            res = res + '[^' + sep + ']'
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\\['
            else:
                stuff = string.replace(pat[i:j],'\\','\\\\')
                i = j+1
                if stuff[0] == '!':
                    stuff = '^' + stuff[1:]
                elif stuff[0] == '^':
                    stuff = '\\' + stuff
                res = res + '[' + stuff + ']'
        else:
            res = res + re.escape(c)
    return res + "$"

--liOOAslEiF7prFVr
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="dirscan.py"

#!/usr/bin/env python
"""
Selective directory scanning

Selects is a list of extended unix filename wildcards prefixed by + (include)
or - (exclude). This is processed in order, and the first match is used to
include or exclude a file. The default is to include if no match is made.
Directorys will only match selects ending in '/'

When scanning directories, selects are used to prune directories when
possible. A directory is not pruned unless all possible files and directories
within it are excluded. This means "-home/,+**.c" will not prune a directory
"home/" (but the directory itself will be excluded), since it is possible that
there could be a file ending in '.c' included somewhere within it. It is
dangerously easy to use selects that do not allow any directories to be
pruned. Any entry prefixed by '+**' will force scanning of all directories not
pruned by an entry earlier in the list. The safest way to explicitly prune a
directory is with a "-<directory>**" before any includes.

Thoughts

Implicit vs explicit inclusion/pruning;

a) files can be included, irrespective of whether their parent directorys are.
b) any excluded subdirectory implys exluding all files within it

a) requires complex piece-by-piece pattern compares to see if a dir can be
pruned without excluding any included files. This makes filematching easy, but
pruning hard.
b) requires piece-by-piece filename compares to see if a file is excluded
because of directory pruning. This makes matching hard, but pruning easy.

Of the two, b) is probably easier because piece-by-piece processing of proper
filenames is easier than piece-by-piece processing of extended unix wildcards.
However, one may be more intuitive and/or flexible than the other. It is
possible to exclude parent directories of included files using a), but not b).

+**.c,-** will find all *.c files using a), but b) will only include *.c files
in the start directory, due to all directories being pruned.

+**/,+**.c,-** will perform the above for b), but will also include all
directories.

default include vs default exclude...
should the default be to include or exclude?

"""
import sys,os,re
from stat import *
from efnmatch import efnmatch

# regex's that match os.sep
exclpat=r"\[![^\%s][^]\%s]*\]" % (os.sep,os.sep)        # an exclude
inclpat=r"\[(?:[^!][^]]*)?\%s[^]]*\]" % os.sep          # an include
eseppat=r"\*\*|\?\?|" + exclpat + '|' + inclpat         # any type of wildcard
esplitpat="^(.*)("+ eseppat + ")(.*?)$"                 # esplit regex
esplitre=re.compile(esplitpat)

def esplit(pat):
    """ Performs an os.path.split() operation on an extended shell pattern
        returns <path>,<sep>,<name> because <sep> can be a variety of wildcards.
    """
    seppos = pat.rfind(os.sep)
    match = esplitre.match(pat)
    if match and match.end(2) > seppos:
        return match.group(1,2,3)
    elif seppos>=0:
        return pat[:seppos],os.sep,pat[seppos+1:]
    else:
        return "","",pat
    
def edirname(pat):
    """ Performs an os.path.dirname() operation on extended shell patterns"""
    return esplit(pat)[0]    

def ebasename(pat):
    """ Performs an os.path.basename() operation on extended shell patterns"""
    return esplit(pat)[3]    

def filematch(path,selects):
    """tests if a file should be included when scanning"""
    for pat in selects:
        if efnmatch(path,pat[1:]):
            return pat[0]=='+'
    #default is include
    return 1

def prunematch(path,selects):
    """tests if a path can be pruned when scanning directories"""
    for pat in selects:
        sig,pat=pat[0],pat[1:]
        # if path matches a wildcard ended pattern, prune depends on sig
        if pat[-2:]=='**' and efnmatch(path,pat):
            return sig=='-'
        # if path matches part of an include, it cannot be pruned
        elif sig=='+':
            pat,sep,fil=esplit(pat)
            if fil=="": sep=""
            while pat+sep:
                print pat+sep
                if efnmatch(path,pat+sep):
                    return 0
                pat,sep,fil=esplit(pat)
    #default is don't prune
    return 0

def __scan__((filelist,selects,match,prune),dirname,names):
    """Used by scan as the method for os.path.walk"""
    for n in names[:]:
        file=os.path.join(dirname,n)[2:]
        if os.path.isdir(file):
            file=os.path.join(file,"")
        if match(file,selects):
            filelist.append(file)
        if os.path.isdir(file) and prune(file,selects):
            names.remove(n)
        
def scan(startdir='.',selects=[],match=filematch,prune=prunematch):
    """Adds files to an index by scanning directories"""
    olddir=os.getcwd()
    os.chdir(startdir)
    filelist=[]
    if match('.',selects):
        filelist.append('.')
    os.path.walk('.',__scan__,(filelist,selects,match,prune))
    os.chdir(olddir)
    return filelist

def filematch2(path,selects):
    """tests if a path matches multiple select lists"""
    for s in selects:
        if not filematch(path,s):
            return 0
    return 1

def prunematch2(path,selects):
    """tests if a path can be pruned for multiple select lists"""
    for s in selects:
        if prunematch(path,s):
            return 1
    return 0

def scan2(startdir='.',selects=[[]]):
    scan(startdir,selects,filematch2,prunematch2)
    

--liOOAslEiF7prFVr--