A 'secret weapon' that has served me very well for learning classifiers is to first learn a good linear classifier. I am almost hesitant to give this away (kidding).
Use the non-thresholded version of that linear classifier output as one additional feature-dimension over which you learn a decision tree. Then wrap this whole thing up as a system of boosted trees (that is, with more short trees added if needed).
One of the reasons why it works so well, is that it plays to their strengths:
(i) Decision trees have a hard time fitting linear functions (they have to stair-step a lot, therefore need many internal nodes) and
(ii) linear functions are terrible where equi-label regions have a recursively partitioned structure.
In the decision tree building process the first cut would usually be on the synthetic linear feature added, which would earn it the linear classifier accuracy right away, leaving the DT algorithm to work on the part where the linear classifier is struggling. This idea is not that different from boosting.
One could also consider different (random) rotations of the data to form a forest of trees build using steps above, but was usually not necessary. Or rotate the axes so that all are orthogonal to the linear classifier learned.
One place were DT struggle is when the features themselves are very (column) sparse, not many places to place the cut.
If you think about it, this what reinforcement learning folks are doing. They use the vanilla state and then they lift it to the observed state by doing some additional calculation with the original state data.
For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it.
That’s correct
I’ve spent most of the last 20 years doing reinforcement learning and it is exceptionally simple conceptually
The challenge is data acquisition and the right types of process frameworks
I think it's worth mentioning, that the achilles heel of DT, is in fact, data (more specifically feature) engineering. If one does not spend significant time cleaning and engineering the features, the results would be much worse than, say a "black box" model, like NN. This is the catch. Ironically, NN can detect such latent features, but very difficult to interpret why.
This varies so wildly from domain to domain. Highly structured data (time series, photos, audio, etc.) typically has a metric boatload of feature extraction methodology. Neural networks often draw on and exploit that structure (i.e. convolutions). You could even get some pretty good results on manually-extracted neural-network-esque features handed off to a random forest. This heuristic begins to fall off with deep learning though, which imo, is a precursor to LLMs and showed that emergent complexity is possible with machine learning.
But non-structured data? Pretty pointless to hand off to a neural network imo.
Ah! Your comment helped me understand the parent comment so much more. I thought it was more about data hygiene needs.
Yes a DT on raw pixel values, or a DT on raw time values will in general be quite terrible.
That said the convolutional structure is hard coded in those neural nets, only the weights are learned. It is not that the network discovered on its own that convolutions are a good idea. So NNs too really (damn autocorrect, it's rely, rely) on human insight and structuring upon which they can then build over.
That has not been my experience though, apart from the need of standard data hygiene that one has to maintain for any ML exercise.
Normalising the numeric features to a common range has been adequate. This too is strictly not necessary for DTs, but pretty much mandatory for linear models. (DTs are very robust to scale differences among features, linear models quite vulnerable to the same.)
One can think of each tree path from root to leaf as a data driven formulation/synthesis of a higher level feature built out of logical conjunctions ('AND' operation).
These auto-synthesized / discovered features are then ORed at the top. DTs are good at capturing multi-feature interactions that single layer linear models can't.
NNs certainly synthesize higher level features, but what does not get highlighted enough is that learning-theory motivated Adaboost algorithm and it's derivatives do that too.
Their modus operandi is "BYOWC, bring your own weak classifier, I will reweigh the data in such a way that your unmodified classifier will discover a higher level feature on its own. Later you can combine these higher features linearly, by voting, or by averaging".
I personally favor differentiable models over trees, but have to give credit where it's due, DTs work great.
What leaves me intellectually unsatisfied about decision trees is that the space of trees cannot be searched in any nice principled way.
Or to describe in terms of feature discovery, in DT, there's no notion of progressing smoothly towards better high level features that track the end goal at every step of this synthesis (in fact greedy hill climbing hurts performance in the case of decision trees). DTs use a combinatorial search over the feature space partitioning, essentially by trial and error.
Neural nets have a smooth way of devising these higher level features informed by the end goal. Roll infinitesimally in the direction of steepest progress -- that's so much more satisfying than trial and error.
As for classification performance where DT have struggled are cases where columns are sparse (low entropy to begin with).
Another weakness of DT is the difficulty of achieving high throughput runtime performance, unless the DT is compiled into machine code. Walking a runtime tree structure with not so predictable branching probabilities that doe'snt run very fast on our standard machines. Compared to DNNs though this is a laughable complaint, throughput of DTs are an order of one or two faster.
There are decision trees for what you want do do.
Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME).
Yes.
I mention restricted oblique trees in passing in my original comment. In my experience, oblique trees tend to add considerable complexity, the others more so. Of course whether the complexity is warranted or not will depend on the dataset.
The merit of what I used is in its simplicity. Any simple ML library would have a linear classifier and a tree learner.
Super easy to implement, train, maintain, debug. One to two person team can handle this fine.
> (ii) linear functions are terrible where equi-label regions have a partitioned structure.
Could you explain what "equi-label regions having a partitioned structure" mean?
I missed a word "recursively", that I have edited in my original comment now.
Consider connected regions in the domain that have the same label. Much like countries on a political map. The situation where this has a short description in terms of recursive subdivision of space, is what I am calling a partitioned structure. It's really rather tautological.
"recursively partitioned" sounds like a fractal to me. Not sure what you really mean.
Taken to the limit you are absolutely right.
It turns out many dataset have such a fractal like nature but where the partitioning needs to be cut off at a certain depth and not continued till infinity.
Is this not the heart of the IRM paper by Arjovsky, Bottou et al.?
Oh theirs is a far more sophisticated and deeper idea, to look for invariant representations.
Mine is a quick but effective practical hack fuelled by a little bit of insight.