Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

I am trying to figure out how to reformat a file full of wrongly formatted dates. The source looks like this:

{"_id":"","timestamp":"Mon Apr 20 08:30:55 +0000 2015"}
{"_id":"","timestamp":"Mon Apr 20 08:32:25 +0000 2015"}
{"_id":"","timestamp":"Mon Apr 20 08:35:39 +0000 2015"}

There are about 3 million entries like this. I need the timestamp to be formatted in the ISO-8601 format, that is: YYYY-MM-DDTHH:mm:ss.mmm<+/-Offset>

I have tried this, which works:

date -d "Mon Apr 20 08:35:39 +0000 2015" +%FT%T%z

Because I am obviously not going to go through all 3 mln entries manually, I looked into using sed for this:

cat input.json | sed "s|\"timestamp\":\"\(.*\)\"|\"timestamp\":\"$(date -d \1 +%FT%T%z)\"|g" > output.json

However, this prints the same (wrong) output every, namely 2015-05-08T01:00:00+0000. Is there maybe a brilliant mind somewhere around here who can help me out?

share|improve this question

3 Answers 3

up vote 3 down vote accepted

It would be possible to use sed for this but awk is more natural:

awk -F'"' -v OFS='"'  '$8 {cmd="date -d \""$8"\" +%FT%T%z"; cmd | getline $8; close(cmd)} 1' input.json
{"_id":"","timestamp":"2015-04-20T01:30:55-0700"}
{"_id":"","timestamp":"2015-04-20T01:32:25-0700"}
{"_id":"","timestamp":"2015-04-20T01:35:39-0700"}

The above show an offset of -7:00 hours. This reflects the system's default timezone. Changing the shell variable TZ will change the default.

How it works

  • -F'"' -v OFS='"'

    This sets both the input and output field separators to ".

  • $8 {cmd="date -d \""$8"\" +%FT%T%z"; cmd | getline $8; close(cmd)}

    With " as the field separator, the date is field number 8. This creates string with the correct date command and then runs the command capturing the output into an updated field 8.

    The $8 out in front means that this part will only be run if there is a non-empty value for field 8. This allows, for example, empty lines to pass through unmolested.

  • 1

    This is awk's cryptic shorthand for "print this line."

Handling Extra Double-Quotes

Since we are using " as the field separator. Suppose that there are a variable number of " before the timestamp. In that case, we need to refer to the timestamp as the second-last field, $(NF-1), rather than the eighth field, $8. In this case:

awk -F'"' -v OFS='"'  '$8 {cmd="date -d \""$(NF-1)"\" +%FT%T%z"; cmd | getline $(NF-1); close(cmd)} 1' input.json

Adding Custom Formatting to the Date Field

$ awk -F'"' -v OFS='"'  '$8 {cmd="date -d \""$(NF-1)"\" +%FT%T%z"; cmd | getline $(NF-1); close(cmd);$(NF-1)="{$date:" $(NF-1) "}"} 1' input.json
{"_id":"","timestamp":"{$date:2015-04-20T01:30:55-0700}"}
{"_id":"","timestamp":"{$date:2015-04-20T01:32:25-0700}"}
{"_id":"","timestamp":"{$date:2015-04-20T01:35:39-0700}"}
share|improve this answer
    
You are great!! One problem though: when the file is just a few entries long (for testing purposes), everything goes smooth. However, when I try to run it on the actual database, I get this error when trying to save the output: awk: cmd. line:1: (FILENAME=Apple.json FNR=2407) fatal: cannot open pipe `date -d "Mon Apr 20 08:56:56 +0000 2015" +%FT%T%z' (Too many open files) Any thoughts? –  Diederik May 8 at 21:20
    
@Diederik Oops. I updated the answer to add a close command. –  John1024 May 8 at 21:24
    
Thanks a million! –  Diederik May 8 at 21:31
    
Sorry to bother you again, but one last question. The actual dataset contains more fields, like a text field where people can also use the quote-sign (i.e. "text":"This is \"great\", really!"). Is there any way to solve this? Like, is it possible to select only the last " as the separator? –  Diederik May 8 at 21:46
    
Using $(NF-1) instead of $8 or any other number does the trick. Thanks again! –  Diederik May 8 at 21:52

jq, node:

cat /tmp/what \
| jq '.timestamp' \
| while read line ; do \
  node -e "console.log(new Date($line).toISOString())" ;\
  done

Could also use a single Node.js program to process the whole file for performance, but that might be getting too far into JS-land. (Ping if you want details.)

share|improve this answer

If you can guarantee that the input is formatted exactly as you have shown, you can do it entirely in sed.  It's a little brute force-ish:

sed \
        -e 's/"timestamp":"... Jan/"timestamp":"01/' \
        -e 's/"timestamp":"... Feb/"timestamp":"02/' \
        -e 's/"timestamp":"... Mar/"timestamp":"03/' \
        -e 's/"timestamp":"... Apr/"timestamp":"04/' \
        -e 's/"timestamp":"... May/"timestamp":"05/' \
        -e 's/"timestamp":"... Jun/"timestamp":"06/' \
        -e 's/"timestamp":"... Jul/"timestamp":"07/' \
        -e 's/"timestamp":"... Aug/"timestamp":"08/' \
        -e 's/"timestamp":"... Sep/"timestamp":"09/' \
        -e 's/"timestamp":"... Oct/"timestamp":"10/' \
        -e 's/"timestamp":"... Nov/"timestamp":"11/' \
        -e 's/"timestamp":"... Dec/"timestamp":"12/' \
-e 's/"timestamp":"\(..\) \(..\) \(..:..:..\) \(.....\) \(....\)"/"timestamp":"\5-\1-\2T\3\4"/' \
        input.json > output.json

It replaces the name of the month with its number, and then it carves up the (now) all-numeric date string, and puts the pieces back together in the desired order.  If you want to add .000 for milliseconds, insert it between \3 and \4.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.