Change query strings with a Bash script

Question

The following script changes start and end dates of several queries saved as files in the folder /app/queries. It is used by executing the script and passing two date strings as arguments, for example:

./config_query.sh 2022-03-28 2022-03-30

This would change all start dates of all queries in the folder /app/queries with the first parameter and all end dates with the second parameter.

I would like to get some feedback concerning the readability of the script. Especially the sed commands are very hard to read, which makes the script hard to maintain and extend. Is there a better way to change the start and end dates of the queries?

This is the script:

#!/usr/bin/env bash
#: Title        : config_query.sh
#: Date         : 2022-04-06
#: Version      : 1.0
#: Description  : Change start and end dates of queries to values of parameters $1 and $2.
#: Options      : None

# Note: The script assumes that both date strings are valid dates in format YYYY-mm-dd.
# Date format validation is performed by the script.
# Date value validation is not performed by the script, i.e. valid dates have to be passed to generate consistent queries.

ROOT="/app"
QUERIES="${ROOT}/queries"
arr=( "${QUERIES}/leads.flux" "${QUERIES}/subs.flux" "${QUERIES}/gads.sql" "${QUERIES}/ms.sql" "${QUERIES}/li.sql" )

# Check number of arguments passed to script:
if [ "$#" -ne 2 ]; then
    echo "Illegal number of arguments. Program expects two date strings (YYYY-mm-dd)."
    exit $?
fi

START_DATE=$1
END_DATE=$2

# Check if start and end are formatted correctly:
for date in $START_DATE $END_DATE
do
  format="^[0-9]{4}-[0-9]{2}-[0-9]{2}$"
  if ! [[ $date =~ $format ]]
    then
      echo "Date $date is incorrectly formatted (not YYYY-mm-dd)."
      exit $?
  fi
done

START_DATE_DECR=$(date '+%Y-%m-%d' -d "$START_DATE - 1 day")
START_DATE_INCR=$(date '+%Y-%m-%d' -d "$START_DATE + 1 day")
END_DATE_INCR=$(date '+%Y-%m-%d' -d "$END_DATE + 1 day")

# Replaces 'start' and 'stop' in Influx leads query:
sed -i -E  "s/start: .{4}-.{2}-.{2}T00:00:00.000Z, stop: .{4}-.{2}-.{2}/start: ${START_DATE}T00:00:00.000Z, stop: ${END_DATE}/g" "${arr[0]}"

# Replaces 'start' and 'stop' in Influx subs query:
sed -i -E  "s/start: .{4}-.{2}-.{2}T00:00:00.000Z, stop: .{4}-.{2}-.{2}/start: ${START_DATE_DECR}T00:00:00.000Z, stop: ${END_DATE_INCR}/g" "${arr[1]}"
sed -i -E  "s/start: .{4}-.{2}-.{2}T00:00:00.000Z, stop: .{4}-.{2}-.{2}.*delete.first.row/start: ${START_DATE_INCR}T00:00:00.000Z, stop: ${END_DATE_INCR}T00:00:00.000Z)   \/\/ delete first row/g" "${arr[1]}"


# Replaces 'START_DATE' and 'END_DATE' in Bigquery queries:
for ((i=2; i<${#arr[@]}; i++))
do
  sed -i -E  "s/\('.{4}-.{2}-.{2}', DATE_SUB\('.{4}-.{2}-.{2}',/\('$START_DATE', DATE_SUB\('$END_DATE',/g" "${arr[$i]}";
  sed -i -E  "s/>= '.{4}-.{2}-.{2}'/>= '$START_DATE'/g" "${arr[$i]}";
  sed -i -E  "s/< '.{4}-.{2}-.{2}'/< '$END_DATE'/g" "${arr[$i]}";
done

These are the queries as a reference (I don't seek feedback for the queries, I just included them for completeness):

/app/queries/gads.sql:

WITH dates AS (
    SELECT
    *
    FROM
        UNNEST(GENERATE_DATE_ARRAY('2022-03-11', DATE_SUB('2022-03-15', INTERVAL 1 DAY), INTERVAL 1 DAY)) AS date
), gads_data AS (
    SELECT DATE(__hevo_report_date) AS date,
        campaign,
        ad_group,
        sum(impressions) AS impressions,
        sum(clicks) AS clicks,
        ROUND(sum(cost)/1000000, 2) AS spend
    FROM `table1`
    WHERE DATE(__hevo_report_date) >= '2022-03-11' AND DATE(__hevo_report_date) < '2022-03-15'
    GROUP BY date, campaign, ad_group
)

SELECT dates.date, campaign, ad_group, impressions, clicks, spend
FROM dates
LEFT OUTER JOIN gads_data
ON dates.date = gads_data.date
ORDER BY dates.date

# li.sql
WITH dates AS (
    SELECT date
    FROM
        UNNEST(GENERATE_DATE_ARRAY('2022-03-11', DATE_SUB('2022-03-15', INTERVAL 1 DAY), INTERVAL 1 DAY)) AS date
), gads_data AS (
    SELECT DATE(TIMESTAMP_MILLIS(date)) AS date,
        campaign_id,
        sum(impressions) AS impressions,
        sum(clicks) AS clicks,
        sum(cost_in_local_currency) AS spend
    FROM `table2`
    WHERE DATE(TIMESTAMP_MILLIS(date)) >= '2022-03-11' AND DATE(TIMESTAMP_MILLIS(date)) < '2022-03-15'
    GROUP BY date, campaign_id
), names AS (
    SELECT id, name AS campaign
    FROM `ten-x-studio-data.threema.li_campaign`
)

SELECT dates.date, campaign, impressions, clicks, spend
FROM dates
LEFT OUTER JOIN gads_data
ON dates.date = gads_data.date
LEFT OUTER JOIN names
ON gads_data.campaign_id = names.id
ORDER BY dates.date

/app/queries/ms.sql:

WITH dates AS (
    SELECT
    *
    FROM
        UNNEST(GENERATE_DATE_ARRAY('2022-03-11', DATE_SUB('2022-03-15', INTERVAL 1 DAY), INTERVAL 1 DAY)) AS date
), ms_data AS (
    SELECT DATE(date) AS date,
        campaign_name AS campaign,
        ad_group_name AS ad_group,
        sum(impressions) AS impressions,
        sum(clicks) AS clicks,
        sum(spend) AS spend
    FROM `table3`
    WHERE DATE(date) >= '2022-03-11' AND DATE(date) < '2022-03-15'
    GROUP BY date, campaign_name, ad_group_name
)

SELECT dates.date, campaign, ad_group, impressions, clicks, spend
FROM dates
LEFT OUTER JOIN ms_data
ON dates.date = ms_data.date
ORDER BY dates.date

/app/queries/leads.flux:

from(bucket: "bucket1")
  |> range(start: 2022-03-11T00:00:00.000Z, stop: 2022-03-15T00:00:00.000Z)
  |> filter(fn: (r) => r["env"] == "productive")
  |> duplicate(as: "val", column: "_value")
  |> difference(columns: ["val"], nonNegative: true, keepFirst: true)
  |> fill(column: "val", value: 0.0)
  |> group(columns: ["_field", "path"])
  |> aggregateWindow(every: 1d, fn: sum, column: "val")
  |> yield(name: "mean")

app/queries/subs.flux:

// Get added values
a = from(bucket: "bucket2")
  |> range(start: 2022-03-10T00:00:00.000Z, stop: 2022-03-16T00:00:00.000Z)
  |> filter(fn: (r) => r["env"] == "productive")
    |> duplicate(column: "_value", as: "added_values")
    |> difference(columns: ["added_values"], nonNegative: true, keepFirst: true)
    |> fill(column: "added_values", value: 0.0)
  |> aggregateWindow(every: 1d, fn: sum, column: "added_values")
  |> map(fn:(r) => ({r with _time: string(v: r._time)}))
  |> drop(columns:["_measurement", "_start", "_stop", "env"])

// Get total values
b = from(bucket: "bucket2")
  |> range(start: 2022-03-10T00:00:00.000Z, stop: 2022-03-16T00:00:00.000Z)
  |> filter(fn: (r) => r["env"] == "productive")
  |> aggregateWindow(every: 1d, fn: last, column: "_value")
  |> map(fn:(r) => ({r with _time: string(v: r._time)}))
  |> drop(columns:["_measurement", "_start", "_stop", "env"])

// Get subtracted values
c = from(bucket: "bucket2")
  |> range(start: 2022-03-10T00:00:00.000Z, stop: 2022-03-16T00:00:00.000Z)
  |> filter(fn: (r) => r["env"] == "productive")
    |> duplicate(column: "_value", as: "sub_values")
    |> difference(columns: ["sub_values"], nonNegative: false, keepFirst: true)
    |> fill(column: "sub_values", value: 0.0)
  |> map(fn: (r) => ({r with sub_values: if r.sub_values < 0.0 then r.sub_values else 0.0}))
  |> aggregateWindow(every: 1d, fn: sum, column: "sub_values")
  |> map(fn:(r) => ({r with _time: string(v: r._time)}))
  |> drop(columns:["_measurement", "_start", "_stop", "env"])



temp = join(tables: {a:a, b:b},
  on: ["_field", "_time", "isTrial", "offer", "segment"]
)

res = join(tables: {temp:temp, c:c},
  on: ["_field", "_time", "isTrial", "offer", "segment"]
)
  |> map(fn: (r) => ({ r with _time: time(v: r._time) }))
  |> range(start: 2022-03-12T00:00:00.000Z, stop: 2022-03-16T00:00:00.000Z)   // delete first row
  |> group()
  |> sort(columns: ["_time", "_field", "isTrial", "offer", "segment"], desc: false)
  |> yield()
```

Toby Speight · Accepted Answer · 2022-08-14 14:41:06Z

Generating files from template files

Replacing dates in files correctly can be tricky. It requires carefully crafting the pattern so that:

It correctly matches the values to replace.
It does not match anything else unintended.

I think a more robust alternative would be to define template files with clear placeholders that should be replaced with computed values. For example you could have placeholders in the form of {{NAME}}. This would make the pattern replacement simpler, generic (usable not only with date strings but with anything), and also reduce the risk of unintended matches.

Exiting with error

I'm wondering what you expected the exit code to be here:

if [ "$#" -ne 2 ]; then
    echo "Illegal number of arguments. Program expects two date strings (YYYY-mm-dd)."
    exit $?

$? stores the exit code of the last executed command, in this case the echo, which practically always succeeds. I suggest replacing this with exit 1 instead. Same thing at one more place in the code.

Always enclose variables in double-quotes used on the command line

This loop aims to validate the date variables:

for date in $START_DATE $END_DATE
do
  format="^[0-9]{4}-[0-9]{2}-[0-9]{2}$"
  if ! [[ $date =~ $format ]]
    then
      echo "Date $date is incorrectly formatted (not YYYY-mm-dd)."
      exit $?
  fi
done

If one of $START_DATE and $END_DATE is empty, the loop body will not run for it, and validation will be skipped.

If both are missing, the script will crash with a syntax error.

Use a validator function

Instead of looping over $START_DATE and $END_DATE, a validator function would be better:

validate_date() {
  # ...
}

validate_date "$START_DATE"
validate_date "$END_DATE"

Not only does this give a name to the validation logic, it would catch an invalid date even if you missed correctly double-quoting the variables, unlike when using a loop.

Reduce repeated overwriting of files

The other answer already recommended using the -e flag to pass multiple expressions to the same sed command. Another advantage of doing that is to reduce the repeated overwriting of the files. This might not make much difference in practice in this example; it's good to do anyway as a good practice.

And to make it really easy to read, I would format like this:

sed -i -E \
  -e "s/\('.{4}-.{2}-.{2}', DATE_SUB\('.{4}-.{2}-.{2}',/\('$START_DATE', DATE_SUB\('$END_DATE',/g" \
  -e "s/>= '.{4}-.{2}-.{2}'/>= '$START_DATE'/g" \
  -e "s/< '.{4}-.{2}-.{2}'/< '$END_DATE'/g" \
  "${arr[$i]}"

That is, each line here does either a single thing, or something easy to understand, and the file that will be affected is also easy to see because it stands out as the only thing on the line.

Use descriptive variable names

My main objection is about the first two elements of the array. ${arr[0]} and ${arr[1]} are not used in a loop, and I don't see a good reason to have these in the array. It would be better to give them descriptive names.

Also note that if arr contains only the names to use in the loop, then you can use a simpler classic iterating loop instead of a counting loop:

for name in "${arr[@]}"; do

And of course, it would be good to use a descriptive name instead of arr, for example sql_files or even just sqls.

Use brace expansion

Instead of repeating the ${QUERIES} prefix multiple times:

arr=( "${QUERIES}/leads.flux" "${QUERIES}/subs.flux" "${QUERIES}/gads.sql" "${QUERIES}/ms.sql" "${QUERIES}/li.sql" )

You could use brace expansion to reduce duplication:

arr=("${QUERIES}"/{leads.flux,subs.flux,gads.sql,ms.sql,li.sql})

Braces can also be nested, but beware to not overdo and hurt readability:

arr=("${QUERIES}"/{{leads,subs}.flux,{gads,ms,li}.sql})

Reinderien · Accepted Answer · 2022-08-14 14:56:26Z

I have certainly seen worse shell scripts and calls to sed.

I'd split the sed scripts up with multiple -e flags instead of one long one. This will also let you use line continues more easily.

Ex:

sed -i -E -e "s/start: .{4}-.{2}-.{2}/start: ${START_DATE}/" \
    -e "s/stop: .{4}-.{2}-.{2}/stop: ${END_DATE}/" "${arr[0]}"

You also repeat the date regex enough that it's worth separating out into its own variable. I'd take that format var on line 28, rename it to something like DATE_FORMAT, make it readonly, and use it in the sed patterns.

Ex:

readonly DATE_FORMAT="[0-9]{4}-[0-9]{2}-[0-9]{2}"
sed -i -E -e "s/start: ${DATE_FORMAT}/start: ${START_DATE}/" \
    -e "s/stop: ${DATE_FORMAT}/stop: ${END_DATE}/" "${arr[0]}"

Stack Exchange Network

Change query strings with a Bash script

2 Answers 2

Generating files from template files

Exiting with error

Always enclose variables in double-quotes used on the command line

Use a validator function

Reduce repeated overwriting of files

Use descriptive variable names

Use brace expansion

Your Answer

Not the answer you're looking for? Browse other questions tagged
sql
bash
nosql
or ask your own question.

Linked

Hot Network Questions

Change query strings with a Bash script

2 Answers 2

Generating files from template files

Exiting with error

Always enclose variables in double-quotes used on the command line

Use a validator function

Reduce repeated overwriting of files

Use descriptive variable names

Use brace expansion

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sqlbashnosql or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sql
bash
nosql
or ask your own question.