Back

Ask HN: What's the 2025 stack for a self-hosted photo library with local AI?

28 points1 hour

First of all, this is purely a personal learning project for me, aiming to combine three of my passions: photography, software engineering, and my family memories. I have a large collection of family photos and want to build an interactive experience to explore them, ala Google or Apple Photo features.

My goal is to create a system with smart search capabilities, and one of the most important requirements is that it must run entirely on my local hardware. Privacy is key, but the main driver is the challenge and joy of building it myself (an obviously learn).

The key features I'm aiming for are:

Automatic identification and tagging of family members (local face recognition).

Generation of descriptive captions for each photo.

Natural language search (e.g., "Show me photos of us at the beach in Luquillo from last summer").

I've already prompted AI tools for a high-level project plan, and they provided a solid blueprint (eg, Ollama with LLaVA, a vector DB like ChromaDB, you know it). Now, I'm highly interested in the real-world human experience. I'm looking for advice, learning stories, and the little details that only come from building something similar.

What tools, models, and best practices would you recommend for a project like this in 2025? Specifically, I'm curious about combining structured metadata (EXIF), face recognition data, and semantic vector search into a single, cohesive application.

Any and all advice would be deeply appreciated. Thanks!

crobibero1 hour ago

I think Immich checks a lot of these

https://immich.app/

sz4kerto45 minutes ago

This. It's a fascinating project, it is hard to believe how can an FLOSS project be so high quality. In my book it's on the level of Postgres (although it's a smaller project, probably).

denysvitali35 minutes ago

Their frontend is amazing, their apps are not as performant, and the backend is (IMHO) the worst of them all.

No hate here, I'm really grateful for what they've achieved so far, but I think there's a lot of room for improvement (e.g: proper R/W query split, native S3 integration, faster endpoints, ...). I already mentioned it in their channel (they're a really welcoming community!) and I'm working on an alternative drop-in replacement backend (written in Go) [1] that will hopefully bring all the needed improvements.

TL;DR: It's definitely good, especially for an open-source project, and the team is very dedicated - but it's definitely not Postgres-good

[1]: https://github.com/denysvitali/immich-go-backend

mossTechnician1 hour ago

This may not interest you, but Ente checks most of these boxes for me. It has face recognition and AI-based object search out of the box, and you can self-host their open-source server without any restrictions. The models they used might be of interest to you.

barbazoo46 minutes ago

Their pricing page doesn't say anything as far as I can find but do you still pay pay Ente if you self host the server as well as the photos ("S3-compatible object storage")?

simonw34 minutes ago

There are some spectacular local models for generating text descriptions of images now. I suggest starting with Mistral Small 3.2, Gemma 3 and Qwen 2.5VL - all available via Ollama.

I expect we will see a Qwen 3VL soon.

SirFatty35 minutes ago

It looks as you are primarily using a phone to view and share? We often (visually) share via our living room TV (via attached computer). Is that something you're looking to incorporate?

coffeecoders57 minutes ago

I have been building something like this but for personal use.

As of now, I use SentenceTransformer model to chunk files, blip for captioning (“Family vacation in Banff, February 2025”)) and mtcnn with InsightFace for face detection. My index stores captions, face embeddings, and EXIF metadata (date, GPS) for queries like “show photos of us in Banff last winter.” I’m working on integrating ChromaDB for faster searches.

Eventually, I aim to store indexes as:

{

  "filename": "/Vacation/Banff/Wife.jpg",

  "chunk_id": 0,

  "text": "Family at Banff, February 2025",

  "caption_embedding": [0.1, 0.2, ...],

  "face_embeddings": [{"name": "NT", "embedding": [0.3, 0.4, ...]}, ...],

  "exif": {
     
     "DateTimeOriginal": "2025:02:15",

     "GPSCoordinates": "18.387, -65.992"

    }
}

I also built an UI (like Spotlight Search) to search through these indexes.

Code (in progress): https://github.com/neberej/smart-search

ProfessorZoom55 minutes ago

I'm also curious as to the best local high quality background removal, such as for gradation images where people are wearing tassels

stormfather1 hour ago

I would try the Qwen models before LLaVa

Do you need the embeddings to be private? Or just the photos?

kreyenborgi37 minutes ago

Have you tried all of these? How are they with very large photo collections?

owebmaster48 minutes ago

The Browser. Just pure JavaScript, HTML, CSS and WebGPU running on a bulletproof sandbox.