kj
Posts:
188
Registered:
8/17/09


porting Python fn to MATLAB
Posted:
Jan 22, 2013 7:16 AM


I have a Python function, called "collect", that basically groups records according to some userspecified criterion, computes some userspecified aggregate function of each of these groups, and returns the sequence of these aggregate values.
I would like to port this "collect" function to "idiomatic" MATLAB.
An important aspect of "collect" is that it operates on sequences of *heterogeneous* tuples, aka "records", containing both string and numeric fields. (The number of fields and their datatypes are constant across all records.) This means that the MATLAB version of "collect" needs to operate either on arrays of <1xN cell>s or on an <MxN cell>. (I'm agnostic on which of these two alternatives would be preferable, and open to suggestions.)
In slightly more detail, the Python version of "collect" takes as arguments a sequence of records, and two functions, called "key" and "aggregate", respectively. The "key" function, as the name suggests, computes a key value for each record. The records are then grouped into the "equivalence classes" induced by this "key" function. (IOW, two records r and r' are in the same equivalence class if and only if key(r) == key(r').)
(By far, the most common type of "key" function is one that extracts from the record the value of some specific *string* field, possibly normalizing it in some way, e.g. converting it to upper case, removing leading and trailing white space, etc.)
Once all the equivalence classes are populated, the value that gets returned by "collect" is the sequence obtained by applying the "aggregate" function to each of these equivalence classes.
One final detail is that the order in which records appear in the input determines both the ordering of the records in each individual equivalence class as well as the ordering of the set of all the equivalence classes. This ordering, in turn, is preserved in the value returned by the "collect" function. (The "aggregate" function, also, may conceivably make use of the ordering of the records in each equivalence class, although in practice I have not run into this usecase yet.)
The question is how to implement something like "collect" in MATLAB.
My Python implementation (see below) uses a hash to implement the equivalence classes. I know that MATLAB also has support for hash tables, but I don't know if, in the case of MATLAB, a hash table is the best approach.
(NB: Python's syntax and standard data structures happen to be particularly well optimized for this sort of function. The MATLAB version of this function need not be equally succinct, and that's perfectly OK.)
Any suggestions would be much appreciated.
Thanks in advance!

(Unless the description above was not sufficiently clear, there's no need to read the rest of this post.)
PS: FWIW, here's the implementation of the function in Python
import collections as co
def collect(records, key, aggregate):
coll = co.OrderedDict() for rec in records: coll.setdefault(key(rec), []).append(rec)
return tuple(aggregate(accum) for accum in coll.values())
There are only a couple of nonobvious lines in this code. One of them is the line
coll.setdefault(key(rec), []).append(rec)
Here, the possibly obscure bit is the subexpression
coll.setdefault(key(rec), [])
It returns the element coll[key(rec)]; if this element does not already exist in the coll dictionary, it gets initialized to the empty list ([]).
The other somewhat strangelooking bit is the expression
tuple(aggregate(accum) for accum in coll.values())
It represents the tuple whose elements correspond to applying the aggregate function to each value of the coll dictionary. (Since the coll dict is an OrderedDict, these values are visited in the order of "first appearance" of their corresponding keys.)
And here is an example of the function above in action:
data = ( ('C', 70), ('I', 42), ('m', 29), ('M', 69), ('A', 62), ('a', 15), ('A', 61), ('A', 67), ('i', 64), ('h', 71), ('H', 40), ('u', 68), ('U', 60), ('u', 70), ('O', 26), ('H', 20), ('c', 20), ('T', 57), ('T', 69), ('t', 69), ('t', 69), ('T', 60), ('s', 70), ('S', 69), ('s', 4), ('h', 11), ('a', 34) )
for r in collect(data, key=lambda r: r[0].upper(), aggregate=lambda recs: (recs[0][0], sum(r[1] for r in recs))): print r
In this example, the input records are grouped caseinsensitively by their first element, and aggregated by summing together all the second elements in each group.
Thus, the output of the above is:
('C', 90) ('I', 106) ('m', 98) ('A', 239) ('h', 142) ('u', 198) ('O', 26) ('T', 324) ('s', 143)
Here, the key function extracts the string stored as the first element (r[0]) of each record r and returns its uppercase transform (r[0].upper()).
The aggregate function returns a new "summary" record whose first element is the same as the first element of the first record in its argument, and whose second element is the sum of the second elements of all the records in its argument.

