/* Sample Data Set Without Replacement Author: *===========================================================* Dr David Baird Biometrician Mail: AgResearch, PO Box 60, Lincoln, NEW ZEALAND Phone: +64 3 325 6900 Fax: +64 3 325 2946 Direct Dial: +64 3 325 6901 3975# Internet: BairdD@AgResearch.CRI.NZ *===========================================================* Date: Tue, 29 Aug 95 Provided without guarantee for public non-commercial use. To extract only 10% of the total rows (without replacement): follow the following steps. */ /* Obtain random permutation if vector x */ proc permute(x) ; retp(submat(sortc(x~rndu(rows(x),1),2),0,1)) ; endp ; /* Load ascii data matrix */ load data[n,k] = datafile.txt; /* Number of samples to take */ nsamples = int(0.10*n); /* Obtain random row numbers */ rownos = submat(permute(seqa(1,1,n)),seqa(1,1,nsamples),0); /* Take selected rows from data matrix */ sample = data[rownos,.]; /********************************************************* PROC: sample Author: Diderik Lund Diderik.Lund@econ.uio.no Date: 14 Oct 96 Provided without guarantee for public non-commercial use. *********************************************************/ PROC sample(h,k); /* Samples h numbers between 1 and k (incl.) without replacement. */ LOCAL rando; /* One uniform random number is multiplied by k, an integer > 1. It is then uniform over the interval [0,k]. Rounding up by CEIL makes it an integer, all integers in (0,k] have equal probabilities. */ rando = CEIL(RNDU(1,1)*k); /* In loop, attempts are made to draw the remaining random numbers, multiplying and rounding them as the first one. However, replication is now possible, and replicates are discarded by the UNIQUE command. It is likely that the first attempt leaves rando short of h elements by some percentage. The next attempt preserves the unique numbers from the previous one(s), and is likely to add some new, thereby reducing that percentage. After a few more attempts, rando will have h unique integers. */ DO UNTIL ROWS(rando) == h; rando = UNIQUE(CEIL(RNDU(h-ROWS(rando),1)*k)|rando,1); ENDO; RETP(rando); ENDP; /********************************************************* PROC: sample Author: Paul Fackler paul_fackler@ncsu.edu Date: 16 Oct 96 Provided without guarantee for public non-commercial use. *********************************************************/ /* Sampling without replacement. Returns a set of m numbers selected randomly from {1..n}. Uses "fastest" of two methods (read commentary). A few days ago I provided a method for sampling without replacement that contained a loop and Ted Thompson provided a loopless procedure. Here is Thompson's proc: fn srswor(m,n)=submat(rankindx(rndu(n,1),1),seqa(1,1,m),1); Ted Thompson tat5@CCDDDT1.EM.CDC.GOV A comparison of the two is instructive because it illustrates the tradeoffs involved in using an interpreted vectorized language like GAUSS. The problem here is how to generate m random numbers from the integers 1..n without replacement. The version I provided has a loop of length m which involves swapping elements of the vector on the first n integers (seqa(1,1,n)) based on vector of m random numbers. Thompson's method involves generating n random numbers and sorting them (using the external function rankindx). The relative speed of these algorithms depends on the relationship between m and n. The problem in using a language like Gauss is that it uses an interpretor to parse the commands. Consequently, each time it goes through a loop, it must reinterpret the commands, which slows down execution. The general rule is to avoid loops where possible, particularly ones that must be executed many times. In the case of sampling without replacement, one should avoid a looping procedure like mine if m is large. It turns out, however, that it is better to loop m times when m is small relative to n than to generate n random numbers and sort them (this can be very time consuming when n is large). I include below a sampling without replacement proc that compares m and n and uses the looping procedure when m is small and the loopless sorting procedure when m is large. The cutoff is based on a quick, unscientific comparision for n on the order of 5000-25000; the optimal cutoff is probably machine dependent anyway. If speed is important for a particular application (if you use this proc in a resampling procedure for example) you may want to experiment with which algorithm is faster for your application. This is a clear case where having an internal function or a REX or DLL implementation would be preferable to coding in GAUSS as there seems to be no way to efficiently write this procedure using compiled GAUSS functions. */ proc without(m,n); local s,i,x; if m>n; "Error: cannot select m numbers from {1..n} if m>n"; stop; endif; if ln(n) > (1.14*ln(m)); s=ceil(seqa(n,-1,m).*rndu(m,1))+seqa(0,1,m); x=seqa(1,1,n); i=0; do while i