Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

Following from this question, where I wish to extract 10 random lines from a file, I now wish to also have the remaining 90 lines as a separate file.

Since the document has 100 lines, indexing from 1 to 100, the problem boils down to finding the complement set of ind within 1, 2, ..., 100, where

ind=$(shuf -i 1-100 -n 10 | sort -n)

So my questions are

  1. How can I generate the array 1, 2, ..., 100 efficiently? and
  2. It seems that this can be done with comm. If so, how should I do comm on arrays (not files)?
share|improve this question
    
Why not use the same sed command with d instead of p? –  muru May 27 at 17:10
    
@muru Could you please elaborate why? I tried, using d gives me an empty file –  Sibbs Gambling May 28 at 2:24

3 Answers 3

up vote 3 down vote accepted

Based on my proposal from the other thread:

awk '
  BEGIN { srand(); do a[int(100*rand()+1)]; while (length(a)<10) }
  NR in a
' ~/orig.txt > ~/short.txt

this could be changed to create both files:

awk -v range=100 -v offset=1 -v amount=10 '
  BEGIN { srand(); do a[int(range*rand()+offset)]; while (length(a)<amount) }
  NR in a    { print > "short.txt" }
  !(NR in a) { print > "rest.txt" }
' ~/orig.txt

(Note that inside awk you cannot use ~. It's possible, though, to use HOME through ENVIRON[], as in: print > ENVIRON["HOME"] "/short.txt", or resp., print > ENVIRON["HOME"] "/rest.txt".)

share|improve this answer
    
This is elegant, but I am wondering why I get awk: syntax error at source line 2 context is BEGIN { srand(); do >>> a[int(${ <<< awk: illegal statement at source line 2 awk: illegal statement at source line 2. –  Sibbs Gambling May 28 at 2:43
    
@Sibbs Gambling; Where does that $ and { come from that I see in your error message? (Are you using a modified version?) - In case you were trying to use shell variables in awk code (which doesn't work) I'll edit my answer to reflect that. –  Janis May 28 at 2:45
    
Yes, I changed 100 and 1 into two variables ${a} and ${b}. –  Sibbs Gambling May 28 at 2:47
1  
@Sibbs Gambling; I added the parameter passing mechanism. Now you can replace the numbers in the call by your shell variables. –  Janis May 28 at 2:50
    
Great, it works now! One very quick follow-up: what if I also need to use variable in { print > "short.txt" }? I tried -v dir=$runDir and then { print > "${dir}/short.txt" }, but it is not working. Thanks a lot! –  Sibbs Gambling May 28 at 3:12

Ok, on second thought - I worked way too hard on that. You just need this:

shuf -i 1-100 -n10 |
sed 's/$/{p;b\n}/' |
sed -nf - -e 'w separate_file' infile >outfile

Though you might need a literal newline in-place of the n in the sed substitution. Anyway that does the same as below - it just doesn't bother doing all of the other 90 lines - they just fall into place because they're in the file - so they don't need any special consideration.

Here's the whole deal:

set  " $(shuf -i 1-100 -n 10) "
while [ "$((i+=1))" -le 100 ]
do    [ -z "${1##*[!0-9]$i[!0-9]*}" ]
      printf "$i%.$((!$?))s%.$?s\n" p H 
done| sed -nf - -e '$!d;x;s/.//p' <infile >outfile

There - we just basically write a sed script that looks like:

1H
2H
3H
4p
5H
...
90p
91H
...

And so on through to 100. On the last line - after all of the randomly selected lines have already been printed, we exchange into Hold space, s///ubstitute away the first inserted \newline character, and print the lot of the rest.

To do this without the shell loop you could do:

set  "$(shuf -i 1-100 -n 10)"
{ seq 100 | grep -Fxv "$1"; echo "$1"; } |
sed '1,90s/$/H/;91,$s/$/p/' |
sed -nf - -e '$!d;x;s/.//p' <infile >outfile

But I'm not sure whether on this scale that would be beneficial at all.

Anyway, I used a seq 100 output file as a test, and after running it through it printed...

3
4
5
19
57
63
64
73
80
88
1
2
6
7
8
9
10
11
12
13
14
15
16
...

...and on through to 100 for all of the lines not included in the initial random 100.

share|improve this answer
    
Interesting approaches... I would have used printf '%s{w random10\nd}\n' $(shuf -i 1-100 -n 10) | sed -f - infile > remaining90 –  don_crissti May 28 at 0:36
1  
@don_crissti: yeah - I probably would have as well - and definitely considered it - but I've already taken a stern scolding about $IFS today and didn't want to hear it again. Anyway, it was also to demonstrate how to find the complement - that's why I used grep -Fxv on a full set and the [ -z ... ] test. –  mikeserv May 28 at 0:41
    
@don_crissti - Oh, and I guess I didn't realize at first that two outbuffers were wanted - so the bottom stuff all writes to the same stream, with the random selection output first. –  mikeserv May 28 at 0:50

I believe you can do everything on the command line, but some problems are better solved with an actual programming language. As an example, a python based solution to your problem would be:

import random
import pprint

with open("file.txt", "w") as f:
  # create a file filled with numbers from 00 to 99
  f.writelines(map(lambda x: "%02d\n" % x, range(100)))

with open("file.txt") as f:
  # read it and assign each line to array, strip newlines 
  ar = set(map(lambda x: x.strip(), f.readlines()))

selection = set(random.sample(ar, 10))
rest = ar - selection

pprint.pprint(selection)
pprint.pprint(rest)
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.