hacker news

Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities.

Problem: I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise.

Solution: I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough!

If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is:

Before searching for anything, go to the Sleuth page Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested in

If you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav bar

Some other people have also taken a crack at this, notably:

https://dell-research-harvard.github.io/resources/americanst... (very good attempt) https://labs.loc.gov/work/experiments/newspaper-navigator/ (focused on images)

benwills ・ 42 minutes ago

As someone who has done a lot of downloading/parsing, this is so awesome and impressive to see.

One thing to think about, which I also struggle with when it comes to large and complicated datasets, is the UI. Even being in the search industry for a long time, it's difficult for me to concretely see how I would use this.

I'd suggest taking a small sample of the dataset that might be reflective of how people would use it, then make that segment public and immediately searchable without registering. eg: One year of articles related to the Olympics.

What I've found is that it's hard for a lot of people to imagine how they would use something without actually using it. So giving people the actual experience of searching the archive and interacting with the results would go a long way.

Again, congrats on the work. This is really impressive work.

brettnbutter ・ 21 minutes ago

Thank you, I really appreciate it. I will see if I can figure out how to do that, or like "if you're authed, you can try the Sleuth or get x free searches a month"? The balance is trying to do that without (potentially) overwhelming the databases, more so than trying intentionally to gate people out from anything. I'll figure it out!
I don't know if you looked at the "Label Specific" search, but I think I could fairly easily isolate that to a particular label and sub-type for people to search within without much risk to the backend. Any thoughts on a good category?

Show HN: Large Scale Article Extract of Newspapers 1730s-1960s