SAS: sum over subset of a table column (multiple times)
SAS: sum over subset of a table column (multiple times)
(I'm new to SAS, and I struggle with how involved it is to do stuff column-wise which would be quite easy in a "normal" language. Please bear with me if this is extremely basic.)
I have a table with values type, a1-a10 and b1-b10, and I would like to find (for each N) the sum of bN for those rows where aN is positive. I can do it one variable at a time, e.g. with something like this:
proc sql;
create table work.test1 as
select type, b1
from work.table
where (a1 >0);
run;
and then sum over all those tables and then merge them, but that's going to be a lot of code and a bit of a faff. Is there a nice and compact way to do this?
Edit: The output I'd like is a table with values type, sum1-sum10, where sumN is the sum described above.
Sample data:
type | a1 | a2 | ... | b1 | b2 | ...
------------------------------------
cat 10 14 ... 1 2 ...
cat -5 3 ... 1 1 ...
dog 35 -1 ... 9 3 ...
dog 9 2 ... 0.5 1 ...
Desired output:
type | sum1 | sum2 | ...
------------------------
cat 1 3 ...
dog 9.5 1 ...
So for each type and N sums those bN where the aN on the same row is positive.
No, I'd like one table with variables type, sum1, sum2, etc.
– Ketil Tveiten
Jun 29 at 9:15
Can you add a sample data? Do you want the sum to be done within each type?
– John Doe
Jun 29 at 9:23
Yes, I'd like the sum by type. I'll add some sample data.
– Ketil Tveiten
Jun 29 at 9:25
I stuck with your initial idea. There are other ways of solving the problem. Let me know if that does it for you.
– John Doe
Jun 29 at 9:52
4 Answers
4
You can do this in a single SQL
select
with one case
clause for each variable.
SQL
select
case
Data
data have; input
type $ a1 a2 b1 b2 ; datalines;
cat 10 14 1 2
cat -5 3 1 1
dog 35 -1 9 3
dog 9 2 0.5 1
run;
SQL Example
proc sql;
create table want_way1
as select
type
, sum (case when a1 > 0 then b1 else 0 end) as sum1
, sum (case when a2 > 0 then b2 else 0 end) as sum2
from have
group by type
;
SQL does not have arrays, so a macro would be written to generate either the entire SQL or only the needed clauses for each of the N a
and b
variable pairs. The macro can examine the data's metadata if you wanted to the macro to discover N itself.
a
b
DATA Step example
Using DOW loop for group processing and arrays for item processing. Quite compact if you remove the comments.
data want_way2 (keep=type sum:);
do until (last.type);
set have;
by type;
* array statement is non-executable, but associates PDV variables with the array reference;
* array statement will create new variables in PDV if needed;
array a a1-a2; /* connects existing variables with array */
array b b1-b2; /* connects existing variables with array */
array s sum1-sum2; /* creates new variables and connects them with array */
* repurpose _n_ as simply an automatic variable that does not need to be dropped;
do _n_ = 1 to dim(a);
s(_n_) = sum ( s(_n_) , ifn ( a(_n_) > 0, b(_n_), 0 ) );
end;
end;
run;
This is exactly what I was looking for, thanks!
– Ketil Tveiten
Jun 29 at 10:43
The DOW example relies on tacit automatic actions of the
DATA
step implicit loop. Use with caution if adding that example into your coding bag of tricks.– Richard
Jun 29 at 10:55
DATA
Yeah, I find that kind of trickery is way above my level; I don't like doing things I don't understand fully. I prefer the SQL example, that is more or less what I was trying to do in the first place, but didn't know how.
– Ketil Tveiten
Jun 29 at 10:58
Here's a proc summary
approach. This isn't quite as direct as the array approach, but it's much easier to generalise to other statistics that might interest you.
proc summary
data have;
input type $ a1 a2 b1 b2 ;
datalines;
cat 10 14 1 2
cat -5 3 1 1
dog 35 -1 9 3
dog 9 2 0.5 1
;
run;
/*Create a view of the dataset with suitable weight columns*/
data t_have / view = t_have;
set have;
array a[*] a1-a2;
do i = 1 to dim(a);
a[i] = a[i] > 0;
end;
run;
/*Use proc summary to sum across rows*/
proc summary nway data = t_have;
class type;
var b1 /weight=a1; /*You could macro-ise this bit to avoid excessive repetition*/
var b2 /weight=a2;
output out= want(drop=_:) sum= mean= /autoname;
run;
Good example of the features of PROC means/summary. But I reckon you already know that.
– data _null_
Jun 29 at 14:16
You can use arrays to perform this task in a single step, with a single output. An array stores values across columns in memory, which can then be looped through to do the calculations.
In this code, I've created 3 arrays, one for A1-A2, one for B1-B2 and one for the new variables SUM1-Sum2. Obviously in your real data you would change the range to A10, B10, SUM10.
I've added a load of comments to describe what the code is doing, but I also recommend reading up on arrays to get a better understanding.
/* create input data */
data have;
input type $ a1 a2 b1 b2;
datalines;
cat . 14 1 2
cat -5 3 1 1
dog 35 -1 9 3
dog 9 2 0.5 1
;
run;
/* sort data by type (needed for next step) */
proc sort data=have;
by type;
run;
data want;
set have;
by type; /* data neds to be sorted by this */
array var_a{2} a1-a2; /* store the values of a in an array */
array var_b{2} b1-b2; /* store the values of b in an array */
array sumvar{2} sum1-sum2; /* set up an array of sum variables (will also create physical variables) */
if first.type then do; /* set sum variables to zero when type changes */
do i = 1 to dim(sumvar);
sumvar{i} = 0;
end;
end;
do j=1 to dim(var_a); /* loop through each var_a value and add var_b to sum_N if var_a>0 */
if var_a{j}>0 then sumvar{j}+var_b{j}; /* syntax var1 + var2 retains value across rows */
end;
keep type sum: ; /* only keep required variables */
if last.type then output; /* only output last record for each type, with the total sum */
run;
Amended code to reset initial value of SUM_N variables to 0 instead of missing. Therefore if no VAR_A values are positive then the sum will show as 0
– Longfish
Jun 29 at 11:00
If I understand what you want then I think putting your code in a macro would do the trick. For example in the one below, I use a %do
loop to generate 10 different datasets, one for each N
. All I did was wrap your code up in a macro so I'm hoping your code does what you want already. And I replaced your run
by quit
otherwise proc sql
won't stop.
%do
N
run
quit
proc sql
Edit:
options symbolgen mprint mlogic;
%macro Y(N=);
%macro compute;
%do i = 1 %to &N.;
proc sql;
create table work.test&i. as
select type, sum(b&i.) as sum&i.
from work.table
where (a&i. >0)
group by type
order by type;
quit;
%end;
%mend;
%compute;
data want;
%do i = 1 %to &N.;
merge test&i.;
%end;
run;
%mend;
%Y(N=10);
This seems to work, except it fails if there are some missing values (my actual aN are often zero, so I guess the sum becomes '.' if none are positive?), is there a way around this? I don't quite understand how the sum() function handles missing values.
– Ketil Tveiten
Jun 29 at 10:02
@KetilTveiten look at the documentation for the SUM function to see how it handles missing values.
– data _null_
Jun 29 at 14:13
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Do you want each output to be in a different tables e.g test1, test2 etc.?
– John Doe
Jun 29 at 8:43