Monday, March 28, 2011

Textual Irregularities

Does anybody know of a library or piece of software out there that will locate irregularities in text? For example, lets say I have...

1. Name 1, Comment
2. Name 2, Comment
3. Name 3 , Comment
5. Name 10, Comment

This software or library would first cut up portions of text that it would find similar (much alike a piece of compression software would encode repetitive similar portions of text to compress it down) but using a variable for error tolerance it could find similar portions of text, now much alike a text comparison application or diff/merge tool it could actually highlight what it sees as different. I'm thinking about possibly making this tool but I do not wish to reinvent the wheel. If there is anything out there anywhere remotely capable of this I would really like to know to possibly help on this project or at least know not to make one. Not to mention this answer could possibly help other people hunting for the same thing, I would think the demand would be high enough for the supply that's why it boggles my mind that I can't find anything at all.

From stackoverflow
  • If you are into Python, you might try difflib.

    It's not an exact solution to your problem, but it might be helpful.

  • Depending on what sort of real life irregularities you want to find or correct this problem is radically different.

    Here is your example updated with real text:

    1. Lazarus Long, Get the first shot off fast.
    2. Hiro Protagonist, Greatest swordfighter[sic] in the world.
    3. Alice , Down the rabbit hole.
    5. Orem, Sink of power.
    

    In this example the errors could be fixed with a decent text editor with find an replace. Text editors and hex editors can work miracles if you get creative with wildcards. The problem remains simple as long as your delimiting factors are in existence (. or ,). As you have probably already know; as soon as one of those is missing the problem becomes much more complex.

    Example of a hard problem:

    1. Lazarus Long, Get the first shot off fast.
     2. Hiro Protagonist  Greatest swordfighter[sic] in the world.
    3. Alice , Down the rabbit hole.
    5 . Orem, , Sink of power.
    

    I would probably attack this in a few steps. 1. Clean up extra spaces. 2. Find out key statistics such as the number of delimiters per line and the avg number of words or characters per delimited column. Most names are one or two words, comments are unknown or limited by input. 3. Find lines with a statistically improbably number of key features. 4. Try your best to correct them.

    I understand that this is not directly solving your problem, but maybe one idea can patch your problem over for a bit. It is possible that past wheel wrights never completed any designs.

  • Sounds basically like you'd want to use Regex to create an "ideal response" then compare the rest of the lines against it.

    Or you could write a more complicated program which would boil each line down into a Regex query, and then compare the queries to each other to see which ones are different.

0 comments:

Post a Comment