Man, I’ve been there. Tried throwing BERT at enzyme data once—looked fine in eval, totally flopped in the wild. Classic overfit-on-vibes scenario.
Honestly, for straight-up classification? I’d pick SVM or logistic any day. Transformers are cool, but unless your data’s super clean, they just hallucinate confidently. Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.
Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.
Appreciate this post. Needed that reality check before I fine-tune something stupid again.
Transformers will ace your test set, then faceplant the second they meet reality. I've also done the "wow, 92% accuracy!" dance only to realize later I just built a very confident pattern-matcher for my dataset quirks.
Honestly, if your accuracy/performance metrics are too good, that's almost a sure sign that something has gone wrong.
Source: bitter, bitter experience. I once predicted the placebo effect perfectly using a random forest (just got lucky with the train/test split). Although I'd left academia at that point, I often wonder if I'd have dug in deeper if I'd needed a high impact paper to keep my job.
I believe it's very common. At some point I thought about publishing a paper analyzing some studies with good results (published in journals) and showing where the problem with each lies but at some point I just gave up. I thought I will only make the original authors unhappy, everybody else will not care.
> I believe it's very common.
Yeah, me too. There was a paper doing the rounds a few years back (computer programming is more related to language skill rather than maths) so I downloaded the data and looked at their approach, and it was garbage. Like, polynomial regression on 30 datapoints kind of bad.
And based on my experience during the PhD this is very common. It's not surprising though, given the incentive structure in science.
Peer Review is a thankless job
but that’s how science advances
there should be an arxiv for rebuttals maybe
> Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.
You may know this but many don't -- this is broadly known as "transfer learning".
Is it, even when applied to trivial classifiers (possibly "classical" ones)?
I feel that we're wrong to be focusing so much on the conversational/inference aspect of LLMs. The way I see it, the true "magic" hides in the model itself. It's effectively a computational representation of understanding. I feel there's a lot of unrealized value hidden in the structure of the latent space itself. We need to spend more time studying it, make more diverse and hands-on tools to explore it, and mine it for all kinds of insights.
For this and sibling -- yes. Essentially, using the output of any model as an input to another model is transfer learning.
ohhh yeah that’s the interoperability game. not just crank model size and pray it grows a brain. everyone's hyped on scale but barely anyone’s thinking glue. anthropic saw it early. their interop crew? scary smart folks, some I know personally. zero chill, just pure signal.
if you wanna peek where their heads at, start here https://www.anthropic.com/research/mapping-mind-language-mod... not just another ai blog. actual systems brain behind it.
I agree. Isn't this just utilizing the representation learning that's happened under the hood of the LLM?
Ironically, this comment reads like it was generated from a Transformer (ChatGPT to be specific)
oh yes i recently become a transformer too
its the em dashes?
>Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.
Sure but this is still indirectly using transformers.
Yes, but it's using the understanding they acquired to guide a more reliable tool, instead of also making them generate the final answer, which they're likely to hallucinate in this problem space.
How does this work?
I’m not sure anyone I know could make an em dash with their keyboard off the top of their head.
[meta] Here’s where I wish I could personally flag HN accounts.
a lot of applications auto convert -- to an em dash
and a bunch of phone/tablet keyboards do so, too
I like em dashes I had considered installing a plugin to reliably turn -- into em dash in the past, if I hadn't discarded that idea you would have seen some in this post ;)
And I think I have seen at lest one spell checking browser plugin which does stuff like that.
Oh and some people use 3rd party interfaces to interact with HN, such which do auto convert consecutive dashes to em dashes.
In the places where I have been using AI from time to time it's also not supper common to use em dashes.
So IMHO "em dash" isn't a tall tell sign for something being AI written.
But then wrt. the OP comment I think you might be right anyway. It's writing style is ... strange. Like taking a writing style from a novel and not any writing style but such which over exaggerates that currently a story is told inside a story. But then fills semantics of a HN comment. Like what you might get if you ask a LLM to "tell a story" for you set of bullet points.
But this opens a question, if the story still comes from a human isn't it fine? Or is it offensive that they didn't just give us compact bullet points?
Putten that aside, there is always the option that the author is just very well read/written, maybe a book author, maybe a hobby author and picked up such a writing style.
A lot of phones do this automatically when doing double dash -- -> —
The Android client I use, Harmonic, has a shortcut to report a user, although it just prefills an email to hn@ycombinator.com.
option-shift-minus on a Mac (option-minus for an en dash).
- [deleted]
I’m not sure anyone I know could make an em dash with their keyboard off the top of their head.
I have endash bound to ⇧⌥⌘0, and emdash bound to ⇧⌥⌘=.
What kind of data did you run this on?
> Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.
If I gave a classroom of under grad students a multiple choice test where no answers were correct, I can almost guarantee almost all the tests would be filled out.
Should GPT and other LLMs refuse to take a test?
In my experience it will answer with the closest answer, even if none of the options are even remotely correct.
Not refuse, but remark that none of the answers seems correct. After all, we are only 2 days away from an AGI pro gamer researcher AI according to experts[1], so I would expect this behavior at least.
1: People who have a financial stake in the AI hype
In multiple choice if you don't know then a random guess is your best answer in most cases. In a few tests blank is scored better than wrong but that is rare and the professors will tell you.
as such I would expect students to but in something. However after class they would talk about how bad they think they did because they are all self aware enough to know where they guessed.
I would love to see someone try this. I would guess 85-90% of undergrads would fill out the whole test, but not everyone. There are some people who believe the things that they know.
I think the issue is the confidence with which it lies to you.
A good analogy would be if someone claimed to be a doctor and when I asked if I should eat lead or tin for my health they said “Tin because it’s good for your complexion”.
Yes, it should refuse.
Humans have made progress by admitting when they don’t know something.
Believing an LLM should be exempt from this boundary of “responsible knowledge” is an untenable path.
As in, if you trust an ignorant LLM then by proxy you must trust a heart surgeon to perform your hip replacement.
Just on a practical level, adding a way for the LLM to bail if can detect that things are going wrong saves a lot of trouble. Especially if you are constraining the inference. You still get some false negatives and false positives, of course, but giving the option to say "something else" and explain can save you a lot of headaches when you accidentally send it down the wrong path entirely.