Find the complement set of an array?

Question

Following from this question, where I wish to extract 10 random lines from a file, I now wish to also have the remaining 90 lines as a separate file.

Since the document has 100 lines, indexing from 1 to 100, the problem boils down to finding the complement set of ind within 1, 2, ..., 100, where

ind=$(shuf -i 1-100 -n 10 | sort -n)

So my questions are

How can I generate the array 1, 2, ..., 100 efficiently? and
It seems that this can be done with comm. If so, how should I do comm on arrays (not files)?

@muru Could you please elaborate why? I tried, using d gives me an empty file — Sibbs Gambling, May 28 at 2:24

Janis · Accepted Answer · 2015-05-28 02:49:32Z

up vote 3 down vote accepted

Based on my proposal from the other thread:

awk '
  BEGIN { srand(); do a[int(100*rand()+1)]; while (length(a)<10) }
  NR in a
' ~/orig.txt > ~/short.txt

this could be changed to create both files:

awk -v range=100 -v offset=1 -v amount=10 '
  BEGIN { srand(); do a[int(range*rand()+offset)]; while (length(a)<amount) }
  NR in a    { print > "short.txt" }
  !(NR in a) { print > "rest.txt" }
' ~/orig.txt

(Note that inside awk you cannot use ~. It's possible, though, to use HOME through ENVIRON[], as in: print > ENVIRON["HOME"] "/short.txt", or resp., print > ENVIRON["HOME"] "/rest.txt".)

edited May 28 at 2:49

answered May 27 at 17:30

Janis
7,0491427

This is elegant, but I am wondering why I get

awk: syntax error at source line 2  context is 	  BEGIN { srand(); do >>>  a[int(${ <<<  awk: illegal statement at source line 2 awk: illegal statement at source line 2

. – Sibbs Gambling May 28 at 2:43

@Sibbs Gambling; Where does that $ and { come from that I see in your error message? (Are you using a modified version?) - In case you were trying to use shell variables in awk code (which doesn't work) I'll edit my answer to reflect that. – Janis May 28 at 2:45

Yes, I changed 100 and 1 into two variables ${a} and ${b}. – Sibbs Gambling May 28 at 2:47

1

@Sibbs Gambling; I added the parameter passing mechanism. Now you can replace the numbers in the call by your shell variables. – Janis May 28 at 2:50

Great, it works now! One very quick follow-up: what if I also need to use variable in { print > "short.txt" }? I tried -v dir=$runDir and then { print > "${dir}/short.txt" }, but it is not working. Thanks a lot! – Sibbs Gambling May 28 at 3:12

| show 4 more comments

mikeserv · Answer 2 · 2015-05-28 00:32:57Z

Ok, on second thought - I worked way too hard on that. You just need this:

shuf -i 1-100 -n10 |
sed 's/$/{p;b\n}/' |
sed -nf - -e 'w separate_file' infile >outfile

Though you might need a literal newline in-place of the n in the sed substitution. Anyway that does the same as below - it just doesn't bother doing all of the other 90 lines - they just fall into place because they're in the file - so they don't need any special consideration.

Here's the whole deal:

set  " $(shuf -i 1-100 -n 10) "
while [ "$((i+=1))" -le 100 ]
do    [ -z "${1##*[!0-9]$i[!0-9]*}" ]
      printf "$i%.$((!$?))s%.$?s\n" p H 
done| sed -nf - -e '$!d;x;s/.//p' <infile >outfile

There - we just basically write a sed script that looks like:

1H
2H
3H
4p
5H
...
90p
91H
...

And so on through to 100. On the last line - after all of the randomly selected lines have already been printed, we exchange into Hold space, s///ubstitute away the first inserted \newline character, and print the lot of the rest.

To do this without the shell loop you could do:

set  "$(shuf -i 1-100 -n 10)"
{ seq 100 | grep -Fxv "$1"; echo "$1"; } |
sed '1,90s/$/H/;91,$s/$/p/' |
sed -nf - -e '$!d;x;s/.//p' <infile >outfile

But I'm not sure whether on this scale that would be beneficial at all.

Anyway, I used a seq 100 output file as a test, and after running it through it printed...

...and on through to 100 for all of the lines not included in the initial random 100.

Interesting approaches... I would have used printf '%s{w random10\nd}\n' $(shuf -i 1-100 -n 10) | sed -f - infile > remaining90 — don_crissti, May 28 at 0:36
@don_crissti: yeah - I probably would have as well - and definitely considered it - but I've already taken a stern scolding about $IFS today and didn't want to hear it again. Anyway, it was also to demonstrate how to find the complement - that's why I used grep -Fxv on a full set and the [ -z ... ] test. — mikeserv, May 28 at 0:41
@don_crissti - Oh, and I guess I didn't realize at first that two outbuffers were wanted - so the bottom stuff all writes to the same stream, with the random selection output first. — mikeserv, May 28 at 0:50

muru · Answer 3 · 2015-05-27 17:44:40Z

I believe you can do everything on the command line, but some problems are better solved with an actual programming language. As an example, a python based solution to your problem would be:

import random
import pprint

with open("file.txt", "w") as f:
  # create a file filled with numbers from 00 to 99
  f.writelines(map(lambda x: "%02d\n" % x, range(100)))

with open("file.txt") as f:
  # read it and assign each line to array, strip newlines 
  ar = set(map(lambda x: x.strip(), f.readlines()))

selection = set(random.sample(ar, 10))
rest = ar - selection

pprint.pprint(selection)
pprint.pprint(rest)

asked	3 months ago
viewed	83 times
active	3 months ago

current community

your communities

more stack exchange communities

Find the complement set of an array?

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged shell-script text-processing array or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Find the complement set of an array?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged shell-script text-processing array or ask your own question.

Linked

Related

Hot Network Questions