hacker news

Hi HN,

My team and I are building Tabstack to handle the "web layer" for AI agents. Launch Post: https://tabstack.ai/blog/intro-browsing-infrastructure-ai-ag...

Maintaining a complex infrastructure stack for web browsing is one of the biggest bottlenecks in building reliable agents. You start with a simple fetch, but quickly end up managing a complex stack of proxies, handling client-side hydration, and debugging brittle selectors. and writing custom parsing logic for every site.

Tabstack is an API that abstracts that infrastructure. You send a URL and an intent; we handle the rendering and return clean, structured data for the LLM.

How it works under the hood:

- Escalation Logic: We don't spin up a full browser instance for every request (which is slow and expensive). We attempt lightweight fetches first, escalating to full browser automation only when the site requires JS execution/hydration.

- Token Optimization: Raw HTML is noisy and burns context window tokens. We process the DOM to strip non-content elements and return a markdown-friendly structure that is optimized for LLM consumption.

- Infrastructure Stability: Scaling headless browsers is notoriously hard (zombie processes, memory leaks, crashing instances). We manage the fleet lifecycle and orchestration so you can run thousands of concurrent requests without maintaining the underlying grid.

On Ethics: Since we are backed by Mozilla, we are strict about how this interacts with the open web.

- We respect robots.txt rules.

- We identify our User Agent.

- We do not use requests/content to train models.

- Data is ephemeral and discarded after the task.

The linked post goes into more detail on the infrastructure and why we think browsing needs to be a distinct layer in the AI stack.

This is obviously a very new space and we're all learning together. There are plenty of known unknowns (and likely even more unknown unknowns) when it comes to agentic browsing, so we’d genuinely appreciate your feedback, questions, and tips.

Happy to answer questions about the stack, our architecture, or the challenges of building browser infrastructure.

sippeangelo ・ 20 hours ago

With all respect to Mozilla, "respects robots.txt" makes this effectively DoA. AI agents are a form of user agent like any other when initiated by a human, no matter the personal opinion of the content publisher (unlike the egregious automated /scraping/ done for model training).

MrTravisB ・ 19 hours ago

This is a valid perspective. Since this is an emerging space, we are still figuring out how to show up in a healthy way for the open web.
We recognize that the balance between content owners and the users or developers accessing that content is delicate. Because of that, our initial stance is to default to respecting websites as much as possible.
That said, to be clear on our implementation: we currently only respond to explicit blocks directed at the Tabstack user agent. You can read more about how this works here: https://docs.tabstack.ai/trust/controlling-access
- x3haloed ・ 14 hours ago
  
  This tension is so close to a fundamental question we’re all dealing with, I think: “Who is the web for? Humans or machines?”
  I think too often people fall completely on one side of this question or the other. I think it’s really complicated, and deserves a lot of nuance. I think it mostly comes down to having a right to exert control over how our data should be used, and I think most of it’s currently shaped by Section 230.
  Generally speaking, platforms consider data to be owned by the platform. GDPR and CCPA/CPRA try to be the counter to that, but those are also too-crude a tool.
  Let’s take an example: Reddit. Let’s say a user is asking for help and I post a solution that I’m proud of. In that act, I’m generally expecting to help the original person who asked the question, and since I’m aware that the post is public, I’m expecting it to help whoever comes next with the same question.
  Now (correct me if I’m wrong, but) GDPR considers my public post to be my data. I’m allowed to request that Reddit return it to me or remove it from the website. But then with Reddit’s recent API policies, that data is also Reddit’s product. They’re selling access to it for … whatever purposes they outline in the use policy there. That’s pretty far outside what a user is thinking when they post on Reddit. And the other side of it as well — was my answer used to train a model that benefits from my writing and converts it into money for a model maker? (To name just an example).
  I think ultimately, platforms have too much control, and users have too little specificity in declaring who should be allowed to use their content and for what purposes.
Findecanor ・ 16 hours ago

There is still a difference between "fetch this page for me and summarise" and "go find pages for me, and cross-reference". And what makes you think that all AI agents using Tabstack would be directly controlled in real time with a 1:1 correspondence between human and agent, and not in some automated way?
I'm afraid that Tabstack would be powerful enough to bypass some existing countermeasures against scrapers, and once allowed in its lightweight mode be used to scrape data it is not supposed to be allowed to. I'd bet that someone will at least try.
Then there is the issue of which actions and agent is allowed to do on behalf of a user. Many sites have in their Terms of Service that all actions must be by done directly by a human, or that all submitted content be human-generated and not from a bot. I'd suppose that an AI agent could find and interpret the ToS, but that is error-prone and not the proper level to do it at. Some kind of formal declaration of what is allowed is necessary: robots.txt is such a formal declaration, but very coarsely grained.
There have been several disparate proposals for formats and protocols that are "robots.txt but for AI". I've seen that at least one of them allow different rules for AI agents and machine learning. But these are too disparate, not widely known ... and completely ignored by scrapers anyway, so why bother.
mossTechnician ・ 19 hours ago

I agree with you in spirit, but I find it hard to explain that distinction. What's the difference between mass web scraping and an automated tool using this agent? The biggest differences I assume would be scope and intent... But because this API is open for general development, it's difficult to judge the intent and scope of how it could be used.
- jakelazaroff ・ 18 hours ago
  
  What's difficult to explain? If you're having an agent crawl a handful of pages to answer a targeted query, that's clearly not mass scraping. If you're pulling down entire websites and storing their contents, that's clearly not normal use. Sure, there's a gray area, but I bet almost everyone who doesn't work for an AI company would be able to agree whether any given activity was "mass scraping" or "normal use".
  
  1shooner ・ 18 hours ago
  ・ 4 more
  
  What is worse: 10,000 agents running daily targeted queries on your site, or 1 query pulling 10,000 records to cache and post-process your content without unnecessarily burdening your service?
  
  PurpleRamen ・ 4 hours ago
  
  The single agent pulling regularly 10k records, which nobody will ever use, is worse than the 10k agents coming from the same source, and using the same cache, they fill when doing a targeted request. But even worse are 10k agents from 10k different sources, scraping 10k sites each, of which 9999 pages are not relevant for their request.
  At the end it's all about the impact on the servers, and those can be optimized, but this does not seem to happen at the moment at large. So in that regard, centralizing usage and honouring the rules is a good step, and the rest are details to figure out on the way.
  
  jakelazaroff ・ 17 hours ago
  ・ 2 more
  
  I apprehend that you want me to say the first one is worse, but it's impossible with so few details. Like: worse for whom? in what way? to what extent?
  If (for instance) my content changes often and I always want people to see an up-to-date version, the second option is clearly worse for me!
  
  1shooner ・ 17 hours ago
  
  No, I've been turning it over in my mind since this question started to emerge and I think it's complicated, I don't have an answer myself. After all, the first option is really just the correlate to today's web traffic, it's just no longer your traffic. You created the value, but you do not get the user attention.
  My apprehension is not with AI agents per se, it is the current, and likely future implementation: AI vendors selling the search and re-publication of other parties' content. In this relationship, neither option is great: either these providers are hammering your site on behalf of their subscribers' individual queries, or they are scraping and caching it, and reselling potentially stale information about you.
ugh123 ・ 20 hours ago

100%
observationist ・ 19 hours ago

Exactly. robots.txt with regards to AI is not a standard and should be treated like the performative, politicized, ideologically incoherent virtue signalling that it is.
There are technical improvements to web standards that can and should be made that doesn't favor adtech and exploitative commercial interests over the functionality, freedom, and technically sound operation of the internet

Diti ・ 20 hours ago

Pricing page is hidden behind a registration form. Why?

I also wanted to see how/if it handled semantic data (schema.org and Wikidata ontologies), but the hidden pricing threw me off.

MrTravisB ・ 20 hours ago

Thanks for the feedback. We are definitely not trying to hide it. We actually do have pricing listed in the API section regarding the different operations, but we could definitely work on making this clearer and easier to parse.
We are simply in an early stage and still finalizing our long-term subscription tiers. Currently, we use a simple credit model which is $1 per 10,000 credits. However, every account receives 50,000 credits for free every month ($5 value). We will have a dedicated public pricing page up as soon as our monthly plans are finalized.
Regarding semantic data, our JSON extraction endpoint is designed to extract any data on the page. That said, we would love to know your specific use cases for those ontologies to see if we can further improve our support for them.
undefined ・ 20 hours ago

[deleted]

srameshc ・ 20 hours ago

This looks good , but if Pay-as-you-go pricing can have some more information about what your actual are charges are per unit or whatever metrics, that would be helpful. I signed up but still can not find the actual pricing.

shanev ・ 15 hours ago

Congrats on the launch! It would be useful to have a matrix somewhere showing how this compares to Jina, Firecrawl, etc.

ushakov ・ 16 hours ago

> We don't spin up a full browser instance for every request (which is slow and expensive)

there's really no excuse for not spinning up a browser every request. a Firecracker VM boots ~50ms nowadays

> We respect robots.txt rules.

you might, but most companies in the market for your service don't want this

MrTravisB ・ 15 hours ago

Regarding the browser instances: While VM boot times have definitely improved, accessing a site through a full browser render isn't always the most efficient way to retrieve information. Our goal is to get the most up-to-date information as fast as possible.
For example, something we may consider for the future is balancing when to implement direct API access versus browser rendering. If a website offers the same information via an API, that would almost always be faster and lighter than spinning up a headless browser, regardless of how fast the VM boots. While we don't support that hybrid approach yet, it illustrates why we are optimizing for the best tool for the job rather than just defaulting to a full browser every time.
Regarding robots.txt: We agree. Not all potential customers are going to want a service that respects robots.txt or other content-owner-friendly policies. As I alluded to in another comment, we have a difficult task ahead of us to do our best by both the content owners and the developers trying to access that content.
As part of Mozilla, we have certain values that we work by and will remain true to. If that ultimately means some number of potential customers choose a competitor, that is a trade-off we are comfortable with.
- ushakov ・ 14 hours ago
  
  thank you so much, great to hear the thinking behind these considerations :)

shevy-java ・ 18 hours ago

Mozilla giving up on Firefox every day ...

intherdfield ・ 16 hours ago

https://hg-edge.mozilla.org/mozilla-central/shortlog
- crote ・ 8 hours ago
  
  Just because the engine is running doesn't mean the car is moving forwards.

Show HN: Tabstack – Browser infrastructure for AI agents (by Mozilla)