A plea to Microsoft
A while back I posted a plea on the Microsoft website wanting them to update Excel to have a (better than) functionality to GoogleDocs with an SQL function. That didn't include an example of why it would be good to have but recently I was exploring the best way of emulating parts of a database in Access (yes, I know, horses for course, but there is a reasonably good reason behind this). So here are some examples of why we need =RUNSQL() in Excel.
The challenge was to take an input table which had the following fields (this is simplified a bit for the purpose of illustration, but essentially):
[Key] Unique index; [Name] text; [StartDate] date; [Order] number
where each row of data has a unique combination of [name]+[startdate]+[order] (referred to by the [key]) but individually neither [name], [startdate] nror [order] is unique.
The output is a table which gives, for each name, the latest date. Where there is more than one entry for the name with that date, the output must have the highest [order] number.
The SQL in the original database is pretty simple to follow:
SELECT T.Name, T.StartDate
FROM Mytable AS T
WHERE T.StartDate <= [SetDate] AND
T.StartDate = (SELECT MAX(T3.StartDate)
FROM Mytable AS T3
WHERE T3.Name = T.Name
AND T3.StartDate <= [SetDate]) AND
T.[Order] = (SELECT MAX(T2.[Order])
FROM Mytable AS T2
WHERE T2.Name = T.Name
AND T2.StartDate = T.StartDate
AND T2.StartDate <= [SetDate])
ORDER BY T.Name;
I'm using BI/Get and Transform to do the manipulation of the data in Excel, so the corresponding M-formula code for Excel is:
Source = Excel.CurrentWorkbook(){[Name="Mytable"]}[Content],
ChangedType = Table.TransformColumnTypes(Source,
{{"Key", Int64.Type}, {"Name", type text},
{"StartDate", type date}, {"Order", Int64.Type}}),
FilteredByDate = Table.SelectRows(ChangedType,
each [StartDate] <= SetDate),
GroupedByName = Table.Group(FilteredByDate,{"Name"},
{{"MaxStartDate", each List.Max([StartDate]), type date},
{"AllRows", each _, type table}}),
ExpandedRows = Table.ExpandTableColumn(GroupedByName,"AllRows",
{"Key", "StartDate", "Order"},{"Key", "StartDate",
"Order"}),
ChangedType1 = Table.TransformColumnTypes(ExpandedRows,
{{"Key", Int64.Type},{"StartDate", type date},
{"Order", Int64.Type}}),
FilteredMaxDate = Table.SelectRows(ChangedType1,
each [StartDate] = [MaxStartDate]),
GroupedByNameAndDate = Table.Group(FilteredMaxDate,
{"Name", "StartDate"},{{"MaxOrder",
each List.Max([Order]), Int64.Type},
{"AllRows", each _, type table}}),
ExpandedRows1 = Table.ExpandTableColumn
(GroupedByNameAndDate,"AllRows",
{"Key", "Order"},{"Key", "Order"}),
FilteredByOrder = Table.SelectRows(ExpandedRows1,
each [Order] = [MaxOrder]),
RemoveColumns = Table.RemoveColumns(FilteredByOrder,
{"Key","Order","MaxOrder"}),
ReorderedColumns = Table.ReorderColumns(RemoveColumns,
{"Name", "StartDate"}),
SortedRows = Table.Sort(ReorderedColumns,
{{"Name", Order.Ascending}})
in SortedRows
It's would be easier to read if LinkedIn gave wider columns. And its a bit cumbersome - I'm sure somebody can do better, but it works. Of course, this isn't dynamic so when the input data changes, the query has to be refreshed, but that is fine and a couple of lines of VBA do that automatically.
Now of course you can do it as an Excel formula, which will recalculate dynamically, and return the results as a dynamic array rather than a table, but who on earth would spend their time writing this:
Recommended by LinkedIn
=SORT(IFERROR(INDEX(Mytable[[Name]:[StartDate]], MATCH(LET(names,UNIQUE(Mytable[Name]),dates,MAXIFS(Mytable[StartDate],Mytable[Name],names,Mytable[StartDate],"<="&SetDate),XMATCH(names&"-"&dates&"-"&MAXIFS(Mytable[Order],Mytable[Name],names,Mytable[StartDate],dates),Mytable[Name]&"-"&Mytable[StartDate]&"-"&Mytable[Order],0)), Mytable[Key], 0), {1,2}), ""),1)
(Oh, I did, just to prove it could be done, but by tomorrow I will have forgotten what it does and nobody else would have a clue. So really not a good approach.)
Then there is putting the code into Python as a =PY() function which will recalculate dynamically but requires an internet connection to run:
=PY("
import pandas as pd
from datetime import datetime
dM = xl("Mytable")
dM.columns = ['Key', 'Name', 'StartDate', 'Order']
dM['StartDate'] = pd.to_datetime(dM['StartDate'], errors='coerce')
setdate = pd.to_datetime(xl("SetDate"), errors='coerce')
dM_filtered = dM[dM['StartDate'] <= setdate].copy()
max_startdate = dM_filtered.groupby('Name')['StartDate'].max().reset_index()
max_startdate = max_startdate.rename(columns={'StartDate': 'MaxStartDate'})
dM_merged = dM_filtered.merge(max_startdate, on='Name')
dM_merged = dM_merged[dM_merged['StartDate'] == dM_merged['MaxStartDate']]
max_order = dM_merged.groupby(['Name', 'StartDate'])['Order'].max().reset_index()
max_order = max_order.rename(columns={'Order': 'MaxOrder'})
result = dM_merged.merge(max_order, on=['Name', 'StartDate'])
result = result[result['Order'] == result['MaxOrder']]
result = result[['Name', 'StartDate']].sort_values(by='Name').values
result
")
So, in summary:
SQL would be the clear winner ... if only Microsoft had an SQL interpreter. GoogleSheets does. It has QUERY() which takes a SQL query and executes it. Excel doesn't. And Microsoft could make it better than Google docs by allowing it to execute multiline statements and interpret range names rather than just hard coded cell addresses. So come on, Microsoft, up your game. I'd miss the enjoyment of working out all this fun programming (yes, I do enjoy it) but we would all be more productive, wouldn't we?
Malcolm
P.S. It's even shorter in APL, but I'm not that unreasonable in my expectations.
latestByName ← {
names ← ∪ data[;1]
⊃{
name ← ⍵
rows ← data[data[;1] = name;]
maxDate ← ⌈/ rows[;2]
maxDateRows ← rows[rows[;2] = maxDate;]
maxOrder ← ⌈/ maxDateRows[;3]
maxRow ← maxDateRows[maxDateRows[;3] = maxOrder;]
⊃maxRow
}¨ names
}