Reformat date string

Question

I am trying to figure out how to reformat a file full of wrongly formatted dates. The source looks like this:

{"_id":"","timestamp":"Mon Apr 20 08:30:55 +0000 2015"}
{"_id":"","timestamp":"Mon Apr 20 08:32:25 +0000 2015"}
{"_id":"","timestamp":"Mon Apr 20 08:35:39 +0000 2015"}

There are about 3 million entries like this. I need the timestamp to be formatted in the ISO-8601 format, that is: YYYY-MM-DDTHH:mm:ss.mmm<+/-Offset>

I have tried this, which works:

date -d "Mon Apr 20 08:35:39 +0000 2015" +%FT%T%z

Because I am obviously not going to go through all 3 mln entries manually, I looked into using sed for this:

cat input.json | sed "s|\"timestamp\":\"\(.*\)\"|\"timestamp\":\"$(date -d \1 +%FT%T%z)\"|g" > output.json

However, this prints the same (wrong) output every, namely 2015-05-08T01:00:00+0000. Is there maybe a brilliant mind somewhere around here who can help me out?

John1024 · Accepted Answer · 2015-05-08 22:18:09Z

It would be possible to use sed for this but awk is more natural:

awk -F'"' -v OFS='"'  '$8 {cmd="date -d \""$8"\" +%FT%T%z"; cmd | getline $8; close(cmd)} 1' input.json
{"_id":"","timestamp":"2015-04-20T01:30:55-0700"}
{"_id":"","timestamp":"2015-04-20T01:32:25-0700"}
{"_id":"","timestamp":"2015-04-20T01:35:39-0700"}

The above show an offset of -7:00 hours. This reflects the system's default timezone. Changing the shell variable TZ will change the default.

How it works

-F'"' -v OFS='"'

This sets both the input and output field separators to ".
$8 {cmd="date -d \""$8"\" +%FT%T%z"; cmd | getline $8; close(cmd)}

With " as the field separator, the date is field number 8. This creates string with the correct date command and then runs the command capturing the output into an updated field 8.

The $8 out in front means that this part will only be run if there is a non-empty value for field 8. This allows, for example, empty lines to pass through unmolested.
1

This is awk's cryptic shorthand for "print this line."

Handling Extra Double-Quotes

Since we are using " as the field separator. Suppose that there are a variable number of " before the timestamp. In that case, we need to refer to the timestamp as the second-last field, $(NF-1), rather than the eighth field, $8. In this case:

awk -F'"' -v OFS='"'  '$8 {cmd="date -d \""$(NF-1)"\" +%FT%T%z"; cmd | getline $(NF-1); close(cmd)} 1' input.json

Adding Custom Formatting to the Date Field

$ awk -F'"' -v OFS='"'  '$8 {cmd="date -d \""$(NF-1)"\" +%FT%T%z"; cmd | getline $(NF-1); close(cmd);$(NF-1)="{$date:" $(NF-1) "}"} 1' input.json
{"_id":"","timestamp":"{$date:2015-04-20T01:30:55-0700}"}
{"_id":"","timestamp":"{$date:2015-04-20T01:32:25-0700}"}
{"_id":"","timestamp":"{$date:2015-04-20T01:35:39-0700}"}

You are great!! One problem though: when the file is just a few entries long (for testing purposes), everything goes smooth. However, when I try to run it on the actual database, I get this error when trying to save the output: awk: cmd. line:1: (FILENAME=Apple.json FNR=2407) fatal: cannot open pipe `date -d "Mon Apr 20 08:56:56 +0000 2015" +%FT%T%z' (Too many open files) Any thoughts? — Diederik, May 8 at 21:20
@Diederik Oops. I updated the answer to add a close command. — John1024, May 8 at 21:24
Sorry to bother you again, but one last question. The actual dataset contains more fields, like a text field where people can also use the quote-sign (i.e. "text":"This is \"great\", really!"). Is there any way to solve this? Like, is it possible to select only the last " as the separator? — Diederik, May 8 at 21:46
Using $(NF-1) instead of $8 or any other number does the trick. Thanks again! — Diederik, May 8 at 21:52

Anko · Answer 2 · 2015-05-08 22:23:59Z

jq, node:

cat /tmp/what \
| jq '.timestamp' \
| while read line ; do \
  node -e "console.log(new Date($line).toISOString())" ;\
  done

Could also use a single Node.js program to process the whole file for performance, but that might be getting too far into JS-land. (Ping if you want details.)

G-Man · Answer 3 · 2015-05-09 02:30:05Z

If you can guarantee that the input is formatted exactly as you have shown, you can do it entirely in sed. It's a little brute force-ish:

sed \
        -e 's/"timestamp":"... Jan/"timestamp":"01/' \
        -e 's/"timestamp":"... Feb/"timestamp":"02/' \
        -e 's/"timestamp":"... Mar/"timestamp":"03/' \
        -e 's/"timestamp":"... Apr/"timestamp":"04/' \
        -e 's/"timestamp":"... May/"timestamp":"05/' \
        -e 's/"timestamp":"... Jun/"timestamp":"06/' \
        -e 's/"timestamp":"... Jul/"timestamp":"07/' \
        -e 's/"timestamp":"... Aug/"timestamp":"08/' \
        -e 's/"timestamp":"... Sep/"timestamp":"09/' \
        -e 's/"timestamp":"... Oct/"timestamp":"10/' \
        -e 's/"timestamp":"... Nov/"timestamp":"11/' \
        -e 's/"timestamp":"... Dec/"timestamp":"12/' \
-e 's/"timestamp":"\(..\) \(..\) \(..:..:..\) \(.....\) \(....\)"/"timestamp":"\5-\1-\2T\3\4"/' \
        input.json > output.json

It replaces the name of the month with its number, and then it carves up the (now) all-numeric date string, and puts the pieces back together in the desired order. If you want to add .000 for milliseconds, insert it between \3 and \4.

asked	4 months ago
viewed	82 times
active	4 months ago

current community

your communities

more stack exchange communities

Reformat date string

3 Answers 3

How it works

Handling Extra Double-Quotes

Adding Custom Formatting to the Date Field

Your Answer

Not the answer you're looking for? Browse other questions tagged text-processing sed date or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Reformat date string

3 Answers 3

How it works

Handling Extra Double-Quotes

Adding Custom Formatting to the Date Field

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged text-processing sed date or ask your own question.

Related

Hot Network Questions