Friday, May 6, 2011

Writing/parsing a fixed width file using Python

I'm a newbie to Python and I'm looking at using it to write some hairy EDI stuff that our supplier requires.

Basically they need an 80-character fixed width text file, with certain "chunks" of the field with data and others left blank. I have the documentation so I know what the length of each "chunk" is. The response that I get back is easier to parse since it will already have data and I can use Python's "slices" to extract what I need, but I can't assign to a slice - I tried that already because it sounded like a good solution, and it didn't work since Python strings are immutable :)

Like I said I'm really a newbie to Python but I'm excited about learning it :) How would I go about doing this? Ideally I'd want to be able to say that range 10-20 is equal to "Foo" and have it be the string "Foo" with 7 additional whitespace characters (assuming said field has a length of 10) and have that be a part of the larger 80-character field, but I'm not sure how to do what I'm thinking.

From stackoverflow
  • It's a little difficult to parse your question, but I'm gathering that you are receiving a file or file-like-object, reading it, and replacing some of the values with some business logic results. Is this correct?

    The simplest way to overcome string immutability is to write a new string:

    # Won't work:
    test_string[3:6] = "foo"
    
    # Will work:
    test_string = test_string[:3] + "foo" + test_string[6:]
    

    Having said that, it sounds like it's important to you that you do something with this string, but I'm not sure exactly what that is. Are you writing it back to an output file, trying to edit a file in place, or something else? I bring this up because the act of creating a new string (which happens to have the same variable name as the old string) should emphasize the necessity of performing an explicit write operation after the transformation.

  • You don't need to assign to slices, just build the string using % formatting.

    An example with a fixed format for 3 data items:

    >>> fmt="%4s%10s%10s"
    >>> fmt % (1,"ONE",2)
    '   1       ONE         2'
    >>>
    

    Same thing, field width supplied with the data:

    >>> fmt2 = "%*s%*s%*s"
    >>> fmt2 % (4,1, 10,"ONE", 10,2)
    '   1       ONE         2'
    >>>
    

    Separating data and field widths, and using zip() and str.join() tricks:

    >>> widths=(4,10,10)
    >>> items=(1,"ONE",2)
    >>> "".join("%*s" % i for i in zip(widths, items))
    '   1       ONE         2'
    >>>
    
  • Hopefully I understand what you're looking for: some way to conveniently identify each part of the line by a simple variable, but output it padded to the correct width?

    The snippet below may give you what you want

    class FixWidthFieldLine(object):
    
        fields = (('foo', 10),
                  ('bar', 30),
                  ('ooga', 30),
                  ('booga', 10))
    
        def __init__(self):
            self.foo = ''
            self.bar = ''
            self.ooga = ''
            self.booga = ''
    
        def __str__(self):
            return ''.join([getattr(self, field_name).ljust(width) 
                            for field_name, width in self.fields])
    
    f = FixWidthFieldLine()
    f.foo = 'hi'
    f.bar = 'joe'
    f.ooga = 'howya'
    f.booga = 'doin?'
    
    print f
    

    This yields:

    hi        joe                           howya                         doing
    

    It works by storing a class-level variable, fields which records the order in which each field should appear in the output, together with the number of columns that field should have. There are correspondingly-named instance variables in the __init__ that are set to an empty string initially.

    The __str__ method outputs these values as a string. It uses a list comprehension over the class-level fields attribute, looking up the instance value for each field by name, and then left-justifying it's output according to the columns. The resulting list of fields is then joined together by an empty string.

    Note this doesn't parse input, though you could easily override the constructor to take a string and parse the columns according to the field and field widths in fields. It also doesn't check for instance values that are longer than their allotted width.

  • You can convert the string to a list and do the slice manipulation.

    >>> text = list("some text")
    >>> text[0:4] = list("fine")
    >>> text
    ['f', 'i', 'n', 'e', ' ', 't', 'e', 'x', 't']
    >>> text[0:4] = list("all")
    >>> text
    ['a', 'l', 'l', ' ', 't', 'e', 'x', 't']
    >>> import string
    >>> string.join(text, "")
    'all text'
    
    S.Lott : Intersting. You don't need to convert to a list to extract. That's silly. But building a list and then collapsing to a string... it gives you what looks a little bit like a "mutable string" -- only if you pre-allocate enough space.
    Skurmedel : Actually you don't need to preallocate anything if you don't care all too much about performance. The list type will automatically allocate more space if the range sliced is assigned a bigger range.
    Skurmedel : Also the list-conversion is there for clarity. Of course it might be better if he read data straight into a list from the beginning, but that's not what I wanted to show.
  • It is easy to write function to "modify" string.

    def change(string, start, end, what):
        length = end - start
        if len(what)<length: what = what + " "*(length-len(what))
        return string[0:start]+what[0:length]+string[end:]
    

    Usage:

    test_string = 'This is test string'
    
    print test_string[5:7]  
    # is
    test_string = change(test_string, 5, 7, 'IS')
    # This IS test string
    test_string = change(test_string, 8, 12, 'X')
    # This IS X    string
    test_string = change(test_string, 8, 12, 'XXXXXXXXXXXX')
    # This IS XXXX string
    
  • You can use justify functions to left-justify, right-justify and center a string in a field of given width.

    'hi'.ljust(10) -> 'hi        '
    

0 comments:

Post a Comment