Analyzing (un)structured data to convert it into a structured, normalized format.

learn more… | top users | synonyms

2
votes
2answers
75 views

Resources for writing a parser combinator library

I often use parser combinator libraries, but I've never written one. What are good resources for getting started? In case it matters, I'm using Julia, a functional (but not lazy) language.
1
vote
2answers
153 views

Should I use a formal grammar for my interpreted scripting language

I have a scripting engine I just published as an open source project. It's been sitting on my harddrive waiting for about a year. My engine of course isn't complete in any way, but it does work for ...
12
votes
3answers
292 views

Implementing the Visitor Pattern for an Abstract Syntax Tree

I'm in the process of creating my own programming language, which I do for learning purposes. I already wrote the lexer and a recursive descent parser for a subset of my language (I currently support ...
1
vote
2answers
75 views

How are line/column position data dealt with in parser combinator libraries?

I'm building a parser using a parser combinator library. I need to keep track of where AST nodes started and ended in the textual input -- line and column numbers. How is this problem approached ...
6
votes
3answers
546 views

Generic file parser design in Java using the Strategy pattern

I am working on a product in which the responsibility of one of the modules is to parse XML files and dump the required content in a database. Even though the present requirement is only to parse XML ...
0
votes
2answers
106 views

Creating a text input simplifaction tool

I have been working on a project for several months now to write a web-based tool that will help me at work. I work at a call center and the CRM software that is used is wretched, we basically have to ...
9
votes
4answers
644 views

Using a “dead man's switch” to manage time-sensitive code

In our software environment, we often run a/b tests, as is probably good practice. However, our environment is set up such that, in very short order, the code starts to become very crufty with dead ...
4
votes
1answer
729 views

First and Follow Sets for a Grammar

I'm studying for a Compiler Construction module I'm doing and I have a sample question as follows: Calculate the FIRST and FOLLOW sets for the following grammar.. S -> uBDz B -> Bv B -> w D ...
3
votes
1answer
202 views

Showing a grammar is ambiguous

I have the following question taken from a compilers course exam: Show that the following grammar is ambiguous. S = XcY X = a Y = b | Z Z = bW W = d | ϵ I drew the following tree: Am I correct ...
1
vote
2answers
127 views

How to extract operators from the grammar productions for conflict resolution in LALR parser?

Is there some standardized or widely accepted algorithm for picking up operators in shift/reduce conflicts in LALR parser? The question is naive, my problem is not with implementing my solution, but ...
6
votes
2answers
193 views

Language parsing to find important words

I'm looking for some input and theory on how to approach a lexical topic. Let's say I have a collection of strings, which may just be one sentence or potentially multiple sentences. I'd like to ...
1
vote
2answers
97 views

Extracting useful information from free text

We filter and analyse seats for events. Apparently writing a domain query language for the floor people isn't an option. I'm using C# 4.0 & .NET 4.0, and have relatively free reign to use ...
3
votes
1answer
152 views

How to add precedence to LALR parser like in YACC?

Please note, I am asking about writing LALR parser, not writing rules for LALR parser. What I need is... ...to mimic YACC precedence definitions. I don't know how it is implemented, and below I ...
2
votes
1answer
198 views

LL(∞) and left-recursion [closed]

I want to understand the relation between LL/LR grammars and the left-recursion problem (for any question I know parcially the answer, but I ask them as I don't know nothing, because I am a little ...
3
votes
1answer
196 views

Extracting color profile information from JPEG files

I'm trying to look up info about reading JPEG's color profile info and to my surprise there's very little open specific how-to information on that regard, but rather lots of general explanation on ...
8
votes
4answers
238 views

How should I implement a command processing application?

I want to make a simple, proof-of-concept application (REPL) that takes a number and then processes commands on that number. Example: I start with 1. Then I write "add 2", it gives me 3. Then I ...
11
votes
5answers
312 views

How can I best manage making open source code releases from my company's confidential research code?

My company (let's call them Acme Technology) has a library of approximately one thousand source files that originally came from its Acme Labs research group, incubated in a development group for a ...
4
votes
1answer
173 views

Scripting custom drawing in Delphi application with IF/THEN/ELSE statements?

I'm building a Delphi application which displays a blueprint of a building, including doors, windows, wiring, lighting, outlets, switches, etc. I have implemented a very lightweight script of my own ...
7
votes
2answers
259 views

Persisting natural language processing parsed data

I've recently started experimenting with natural language processing (NLP) using Stanford's CoreNLP, and I'm wondering what are some of the standard ways to store NLP parsed data for something like a ...
1
vote
1answer
110 views

Does JAXP natively parse HTML?

So, I whip up a quick test case in Java 7 to grab a couple of elements from random URIs, and see if the built-in parsing stuff will do what I need. Here's the basic setup (with exception handling etc ...
4
votes
4answers
713 views

Can the csv format be defined by a regex?

A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, ...
2
votes
3answers
268 views

How do I translate user input into a fictitious language?

For experimental reasons, I am trying to convert user input into a fictitious language. All of the translation can be 1:1. I would prefer if I could accomplish this with PHP. Should I use gettext ...
0
votes
1answer
112 views

How do I parse a header with two different version [ID3] avoiding code duplication?

I really hope you can give me some interesting viewpoints for my situation, because I am not satisfied with my current approach. I am writing an MP3 parser, starting with an ID3v2 parser. Right now ...
2
votes
2answers
141 views

parsing terminology: comments+whitespaces vs actual code

In languages like c/c++ spacing and comments are ignored and only actual code gets into compiler. I'm interested if there is accepted way of naming these two things? comments & spacing ...
5
votes
4answers
564 views

Programming Language Parser (in Java) - What would be a better design alternative for a special case?

Background I'm currently designing my own programming language as a research project. I have most of the grammar done and written down as context-free grammar, and it should be working as is. - Now ...
2
votes
1answer
280 views

Picture Parsing

If I open a picture file, lets say with an PNG extension, I will see bunch of code. Now let say I want to get some information from the picture mechanically. So the question here is what is the first ...
6
votes
2answers
177 views

Parsing multiple file formats/protocols

We are starting a project where we will need to write parsers for a bunch of binary file formats, each of them representing very similar data (time-value series from different measurement devices). ...
1
vote
2answers
292 views

How to create a Semantic Network like wordnet based on Wikipedia?

I am an undergraduate student and I have to create a Semantic Network based on Wikipedia. This Semantic Network would be similar to Wordnet(except for it is based on Wikipedia and is concerned with ...
5
votes
4answers
285 views

How are comments expressed in programming language grammars?

I'm learning how to build parsers using grammars, but I got stuck trying to express comments, because they can appear almost anywhere. This indicates that comments can be stripped from the token ...
0
votes
2answers
179 views

Using a parser to locate faulty code

Lately I've been working a lot in PHP and have run into an abnormally large number of parsing errors. I realize these are my own fault and a result of sloppy initial coding on my part, but it's ...
6
votes
5answers
2k views

Getting data from a webpage in a stable and efficient way

Recently I've learned that using a regex to parse the HTML of a website to get the data you need isn't the best course of action. So my question is simple: What then, is the best / most efficient and ...
6
votes
6answers
762 views

Best way to parse a file

I'm trying to find a better solution for making a parser to some of the famous file formats out there such as: EDIFACT and TRADACOMS. If you aren't familiar with these standards then check out this ...
5
votes
1answer
441 views

What is this algorithm for converting strings into numbers called?

I've been doing some work in Parsec recently, and for my toy language I wanted multi-based fractional numbers to be expressible. After digging around in Parsec's source a bit, I found their ...
3
votes
4answers
945 views

What is the simplest human readable configuration file format?

Current configuration file is as follows: mainwindow.title = 'test' mainwindow.position.x = 100 mainwindow.position.y = 200 mainwindow.button.label = 'apply' mainwindow.button.size.x = 100 ...
0
votes
2answers
214 views

How do I capture information from a website that doesn't provide an API?

Do you know any good tutorials, frameworks, anything that can help me to write code that captures information from a website that don't have a public API, or hasn't been written in a RESTful way? ...
0
votes
3answers
202 views

What is a good parsing reference?

I am working on a project that needs parsing and text processing functionality. I searched the web about parsing and I found that my best choice for parsing is python. What is a good, fast, and ...
1
vote
1answer
190 views

Extracting text from various file formats

I want to extract text from various files. I used Apache POI for parsing Microsoft documents. It's working and now I want to parse PDFs and extract text from them. Is there a Java API that I could ...
3
votes
1answer
304 views

How can I test a parser for a bespoke XML schema?

I'm parsing a bespoke XML format into an object graph using .NET 4.0. My parser is using the System.XML namespace internally, I'm then interrogating the relevant properties of XmlNodes to create my ...
5
votes
5answers
888 views

Are separate parsing and lexing passes good practice with parser combinators?

When I began to use parser combinators my first reaction was a sense of liberation from what felt like an artificial distinction between parsing and lexing. All of a sudden everything was just ...
22
votes
6answers
931 views

What are the arguments against parsing the Cthulhu way?

I have been assigned the task of implementing a Domain Specific Language for a tool that may become quite important for the company. The language is simple but not trivial, it already allows nested ...
12
votes
12answers
2k views

How to write a command interpreter/parser?

Problem: Run commands in the form of a string. command example: /user/files/ list all; equivalent to: /user/files/ ls -la; another one: post tw fb "HOW DO YOU STOP THE TICKLE MONSTER?;" ...
2
votes
1answer
231 views

Idea of an algorithm to detect a website's navigation structure?

Currently I am in the process of developing an importer of any existing, arbitrary (static) HTML website into the upcoming release of our CMS. While the downloading the files is solved successfully, ...
3
votes
3answers
309 views

Any good reason to open files in text mode?

(Almost-)POSIX-compliant operating systems and Windows are known to distinguish between 'binary mode' and 'text mode' file I/O. While the former mode doesn't transform any data between the actual file ...
2
votes
3answers
908 views

What programming language is most suitable for handling unstructured data?

I'm trying to automate the application of metadata to huge amount of text, but I'm not sure what language would make this task easier (if there is one). What programming language is most suitable ...
0
votes
2answers
1k views

Fastest C++ XML parsing library

I have thousands of .xml files from size 1MB-45MB (no DTDs). I need to parse and further manipulate these XML files before generating separate .xml files with the results of my regex. What the ...
6
votes
3answers
383 views

C++ XML Parsing: Suggestions on Approach for Parsing and Storing data

I am looking into developing a C++ application to parse xml (using the rapidxml framework), and I would like some advice on how to approach this. The file I want to parse is a XML game file that ...
6
votes
2answers
975 views

Algorithm for formating SQL code

I need a tool (for in house usage) that will format SQL code (SQL Server/MySQL). There are various 3rd party tools and online web sites that do it but no exactly how I need it. So I want to write my ...
12
votes
3answers
590 views

How should I specify a grammar for a parser?

I have been programming for many years, but one task that still takes me inordinately long is to specify a grammar for a parser, and even after this excessive effort, I'm never sure that the grammar ...
3
votes
3answers
554 views

Parsing scripts that use curly braces

To get an idea of what I'm doing, I am writing a python parser that will parse directx .x text files. The problem I have deals with how the files are formatted. Although I'm writing it in python, I'm ...
7
votes
1answer
192 views

Should I let my users write BnfExpressions to extend my grammar?

Preface I'm designing a templating language (please skip the don't/why?? speech). One of the major goals of this language is to be extensible. There are 2 main elements in my language. "Tags" and ...

1 2