2

I've come to realize that using a SQL database (Postgres) is one of the most efficient ways to port my relational data (40+ GB across 24 CSV files) into Elasticsearch with a nested structure. However I'm still having a couple issues with the formatting of my JSON output from Postgres: 1) undesired line feeds (\n), 2) undesired header line and 3) undesired date format. Here is a basic example to demonstrate:

file1
id,age,gender,wave
1,49,M,1
2,72,F,0

file2
id,time,event1
1,2095-04-20 12:28:55,V39
1,2095-04-21 2:27:45,T21
2,2094-05-17 18:17:25,V39

file3
id,time,event2
1,2095-04-22 3:48:53,P90
2,2094-05-18 1:28:23,RT4
2,2094-05-18 4:23:53,W3

after adding these CSVs to a schema named forum and running this SQL code:

with f_1 as(
   SELECT id, json_agg(file1.*) AS tag
   FROM forum.file1
   GROUP BY id
), f_2 as (
   SELECT id, json_agg(file2.*) AS tag
   FROM forum.file2
   GROUP BY id
), f_3 as (
   SELECT id, json_agg(file3.*) AS tag
   FROM forum.file3
   GROUP BY id
)
SELECT ('{"id":' || a.id), ('"file1":' || a.tag), ('"file2":' || b.tag), ('"file3":' || c.tag ||'}') 
FROM f_1 AS a, f_2 AS b, f_3 AS c
WHERE b.id = a.id AND c.id = a.id;

I get this output (pgAdminIII - Export to file - no quoting):

?column?,?column?,?column?,?column?
{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20T12:28:55","event1":"V39"}, 
 {"id":1,"time":"2095-04-21T02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22T03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17T18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18T01:28:23","event2":"RT4"}, 
 {"id":2,"time":"2094-05-18T04:23:53","event2":"W3"}]}

You can see that for a given id there is data on multiple lines. I need all of the data to be on one line for a given id (i.e. no \n's). A couple other minor issues which I haven't spent much time on but would like to change are the first row isn't needed, I'd like to get rid of the ?column?,?column?,?column?,?column? without having to open the file after it's done processing. Ideally I'd also prefer that there was no T in the date output. I should be able to accommodate the T in Elasticsearch but thus far haven't gotten it to accept it. This is the output I desire from Postgres which works for input into Elasticsearch (using stream2es and a nested mapping structure):

{"id":1,"file1":[{"id":1,"age":49,"gender":"M","wave":1}],"file2":[{"id":1,"time":"2095-04-20 12:28:55","event1":"V39"},{"id":1,"time":"2095-04-21 02:27:45","event1":"T21"}],"file3":[{"id":1,"time":"2095-04-22 03:48:53","event2":"P90"}]}
{"id":2,"file1":[{"id":2,"age":72,"gender":"F","wave":0}],"file2":[{"id":2,"time":"2094-05-17 18:17:25","event1":"V39"}],"file3":[{"id":2,"time":"2094-05-18 01:28:23","event2":"RT4"},{"id":2,"time":"2094-05-18 04:23:53","event2":"W3"}]}

Adding to_json does fix the undesired line feeds but it adds \" in place of " which the stream2es parser doesn't like:

SELECT to_json('{"id":' || a.id), to_json('"file1":' || a.tag::json), to_json('"file2":' || b.tag::json), to_json('"file3":' || c.tag::json ||'}')

"{\"id\":1","\"file1\":[{\"id\":1,\"age\":49,\"gender\":\"M\",\"wave\":1}]"...

es2stream exception: Exception in thread "stream dispatcher" java.lang.ClassCastException: java.lang.String cannot be cast to clojure.lang.IPersistentMap

1 Answer 1

1

Select all in one column (instead of four). The function format() will help you to write it down more clearly. Use

regexp_replace (str, '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g')

to correct the date format and

replace (str, e' \n ', '')

to skip newline chars.

Use COPY command to simplify the issue:

COPY (
    with f_1 as(
       SELECT id, json_agg(file1.*) AS tag
       FROM forum.file1
       GROUP BY id
    ), f_2 as (
       SELECT id, json_agg(file2.*) AS tag
       FROM forum.file2
       GROUP BY id
    ), f_3 as (
       SELECT id, json_agg(file3.*) AS tag
       FROM forum.file3
       GROUP BY id
    )
    SELECT
        replace(
            regexp_replace(
                format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', 
                    a.id, a.tag, b.tag, c.tag),
                '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
            e' \n ', '')
    FROM f_1 AS a, f_2 AS b, f_3 AS c
    WHERE b.id = a.id AND c.id = a.id
) TO '/full/path/to/your/file';

To prepend each line of data with a command line you can use a trick with a function returning two rows. Some part of formatting can be moved to the function on the occasion.

create or replace function format_data_line(command text, data_str text)
returns setof text language plpgsql as $$
begin
    return next command;
    return next             
        replace(
            regexp_replace(data_str,
                '(\d\d\d\d-\d\d-\d\d)T', '\1 ', 'g'),
            e' \n ', '');
end $$;

COPY (
    with f_1 as(
       SELECT id, json_agg(file1.*) AS tag
       FROM forum.file1
       GROUP BY id
    ), f_2 as (
       SELECT id, json_agg(file2.*) AS tag
       FROM forum.file2
       GROUP BY id
    ), f_3 as (
       SELECT id, json_agg(file3.*) AS tag
       FROM forum.file3
       GROUP BY id
    )
    SELECT 
        format_data_line(
            'my command', 
            format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', 
                a.id, a.tag, b.tag, c.tag))
    FROM f_1 AS a, f_2 AS b, f_3 AS c
    WHERE b.id = a.id AND c.id = a.id
) TO '/full/path/to/your/file';
6
  • This seems to be doing exactly what I asked for, thanks! However stream2es seems to be breaking when I feed in a large file. Therefore I'm planning to go back to using the Bulk API. For that I need to add a command line ({command}\n) above each data line. I've been trying to figure out how to add line breaks to the above but haven't had any luck (E'\n' and standard_conforming_strings formatting are entering \n's as actual strings which aren't being recognized as line feeds). Commented Nov 12, 2015 at 21:42
  • Thanks again! Unfortunately my example didn't include the case when a given subject might be missing from a file. Using the structure above this will eliminate them because of the AND's). I figured out how to use LEFT OUTER JOIN instead of WHERE to alleviate this but the leftover string from the format statement is an issue. For example if you eliminate the two lines for 'id 1' from file2.csv the output will give "file2":,"file3":[{"id":1, which isn't valid. I've tried forcing"file2":{},"file3" in the case of NULL but can't get it to work with CASE or COALESE. Commented Nov 14, 2015 at 17:04
  • Also I should note that I added a format line since I wanted the id to be dynamic in my command line of code: format('{"index":{"_index":"forum_sql_bulk","_type":"subject","_id":%s}}',a.id), format('{"id":%s,"file1":%s,"file2":%s,"file3":%s}', a.id, a.tag, b.tag, c.tag)). One other thing is my output file size is blowing up (10MB + 30 MB + 750MB tables = 1.8GB output instead of 790 MB). I think this is because the #rows (->width) for some id's in some files is much longer. I'll revise the code here to perform my BULK update differently. Commented Nov 14, 2015 at 17:10
  • It's far beyond the scope of the original question. Please, consider to ask a new question. Commented Nov 14, 2015 at 17:43
  • Ok, if adding the {} in the case of a null json_agg(filex.*) isn't straight forward I can start a new thread. Thanks for the help! Commented Nov 14, 2015 at 18:23

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.