How to aggregate on a column while grouping by several columns values using CoPPer?
How to aggregate on a column while grouping by several columns values using CoPPer?
I have a dataset with the current stock for some products:
+--------------+-------+
| Product | Stock |
+--------------+-------+
| chocolate | 300 |
| coal | 70 |
| orange juice | 400 |
+--------------+-------+
and the sales for every product over the years for the current month and the next month in another dataset:
+--------------+------+-------+-------+
| Product | Year | Month | Sales |
+--------------+------+-------+-------+
| chocolate | 2017 | 05 | 55 |
| chocolate | 2017 | 04 | 250 |
| chocolate | 2016 | 05 | 70 |
| chocolate | 2016 | 04 | 200 |
| | | | | | | | |
| coal | 2017 | 05 | 40 |
| coal | 2017 | 04 | 30 |
| coal | 2016 | 05 | 50 |
| coal | 2016 | 04 | 20 |
| | | | | | | | |
| orange juice | 2017 | 05 | 400 |
| orange juice | 2017 | 04 | 350 |
| orange juice | 2016 | 05 | 400 |
| orange juice | 2016 | 04 | 300 |
+--------------+--------------+-------+
I want to compute the stock that I will need to order for the next month, by computing the expected sales over the current month and the next month, using the following formula:
ExpectedSales = max(salesMaxCurrentMonth) + max(salesMaxNextMonth)
The orders will then be
Orders = ExpectedSales * (1 + margin) - Stock
Where margin is, for example, 10%.
I tried to group by several columns using GroupBy
, as in the following, but it seems to aggregate by Stock
instead of Product
:
GroupBy
Stock
Product
salesDataset
.groupBy(Columns.col("Month"), Columns.col(“Product”))
.agg(Columns.max(“Sales”).as(“SalesMaxPerMonth”))
.agg(Columns.sum(“SalesMaxPerMonth”).as(SalesPeriod))
.withColumn(
“SalesExpected”,
Columns.col(“SalesPeriod”).multiply(Columns.literal(1 + margin)))
.withColumn(
“Orders”,
Columns.col(“SalesExpected”).minus(Columns.col(“Stock”)))
.withColumn(
“Orders”,
Columns.col(“Orders”).map((Double a) -> a >= 0 ? a: 0))
.doNotAggregateAbove()
.toCellSet()
.show();
1 Answer
1
You got the logic correct in terms of aggregation but there is another way to build your CellSet
, where you provide a map to describe the location of the query which generates it.
CellSet
salesDataset
.groupBy(Columns.col("Month"), Columns.col(“Product”))
.agg(Columns.max(“Sales”).as(“SalesMaxPerMonth”))
.agg(Columns.sum(“SalesMaxPerMonth”).as(SalesPeriod))
.withColumn(
“SalesExpected”,
Columns.col(“SalesPeriod”).multiply(Columns.literal(1 + margin)))
.withColumn(“Orders”, Columns.col(“SalesExpected”).minus(Columns.col(“Stock”)))
.withColumn(“Orders”, Columns.col(“Orders”).map((Double a) -> a >= 0 ? a: 0))
.doNotAggregateAbove()
.toCellSet(
Empty.<String, Object>map()
.put(“Product”,null)
.put(“Stock”, null))
.show();
Where null
in a location represents the wildcard *
.
null
*
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Comments
Post a Comment