I've come to realize that using a SQL database (Postgres) is one of the most efficient ways to port my relational data (40+ GB across 24 CSV files) into Elasticsearch with a nested structure. However I'm still having a couple issues with the formatting of my JSON output from Postgres: 1) undesired line feeds (\n), 2) undesired header line and 3) undesired date format. Here is a basic example to demonstrate:
file1
id,age,gender,wave
1,49,M,1
2,72,F,0
file2
id,time,event1
1,2095-04-20 12:28:55,V39
1,2095-04-21 2:27:45,T21
2,2094-05-17 18:17:25,V39
file3
id,time,event2
1,2095-04-22 3:48:53,P90
2,2094-05-18 1:28:23,RT4
2,2094-05-18 4:23:53,W3
after adding these CSVs to a schema named forum and running this SQL code:
with f_1 as(
SELECT id, json_agg(file1.*) AS tag
FROM forum.file1
GROUP BY id
), f_2 as (
SELECT id, json_agg(file2.*) AS tag
FROM forum.file2
GROUP BY id
), f_3 as (
SELECT id, json_agg(file3.*) AS tag
FROM forum.file3
GROUP BY id
)
SELECT ('{"id":' || a.id), ('"file1":' || a.tag), ('"file2":' || b.tag), ('"file3":' || c.tag ||'}')
FROM f_1 AS a, f_2 AS b, f_3 AS c
WHERE b.id = a.id AND c.id = a.id;
I get this output (pgAdminIII - Export to file - no quoting):
?column?,?column?,?column?,?column?
{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20T12:28:55","event1":"V39"},
{"id":1,"time":"2095-04-21T02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22T03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17T18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18T01:28:23","event2":"RT4"},
{"id":2,"time":"2094-05-18T04:23:53","event2":"W3"}]}
You can see that for a given id there is data on multiple lines. I need all of the data to be on one line for a given id (i.e. no \n's). A couple other minor issues which I haven't spent much time on but would like to change are the first row isn't needed, I'd like to get rid of the ?column?,?column?,?column?,?column?
without having to open the file after it's done processing. Ideally I'd also prefer that there was no T
in the date output. I should be able to accommodate the T
in Elasticsearch but thus far haven't gotten it to accept it. This is the output I desire from Postgres which works for input into Elasticsearch (using stream2es and a nested mapping structure):
{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20 12:28:55","event1":"V39"},{"id":1,"time":"2095-04-21 02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22 03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17 18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18 01:28:23","event2":"RT4"},{"id":2,"time":"2094-05-18 04:23:53","event2":"W3"}]}
Adding to_json
does fix the undesired line feeds but it adds \"
in place of "
which the stream2es parser doesn't like:
SELECT to_json('{"id":' || a.id), to_json('"file1":' || a.tag::json), to_json('"file2":' || b.tag::json), to_json('"file3":' || c.tag::json ||'}')
"{\"id\":1","\"file1\":[{\"id\":1,\"age\":49,\"gender\":\"M\",\"wave\":1}]"
...
es2stream exception:
Exception in thread "stream dispatcher" java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IPersistentMap