MUSER

SPEC-01.04 · RESEARCH · OSS

Topographical embedding map of your images and text files.

GitHub

Your image library and your notes are landscapes, not lists — clusters of similarity, valleys between them, occasional outliers. Muser embeds both into the same kind of map: density becomes elevation, semantic neighborhoods become regions, and you navigate them like terrain. Image embeddings come from CLIP; text from sentence-transformers; the rendering is a topographic map with a radial phylogeny tree underneath for time-ordered similarity.

The visual grammar comes from two places. First, the location stamps films use under an establishing shot — TANGIER, MOROCCO typed across the frame as the camera lands somewhere new (The Bourne Identity is the canonical late-90s example, though the grammar is older). Second, the way region names fade in over the map as you cross into a new territory in open-world games (Red Dead Redemption 2's region titling is the cleanest version of this — a name materializes at the edge of the screen, holds, fades out). Both are giving you a name for where you are in space. Muser's cluster labels are the same idea over an embedding space instead of a geographic one.

Stack

CLIP (ViT-B/32): 512-dim image embeddings, pre-trained on 400M image-text pairs. Runs locally on MPS. Easy to swap for ViT-L/14 if quality matters more than speed.
sentence-transformers: Separate text encoder — CLIP's text tower is weaker than dedicated sentence models for prose.
UMAP (cosine): 2D reduction; preserves local structure better than t-SNE, faster at 500+ points, deterministic with random_state.
HDBSCAN: Density clustering with no k — outliers labeled -1 instead of forced into a cluster.
D3.js (d3-contour, d3.stratify): Topographic isolines and the radial phylogeny tree.
Vanilla JS + Vite: Zero framework overhead; D3 fights React's DOM control, and HMR for tweaking visual parameters is instant.
Python pipeline → static JSON: All embedding work offline; the frontend loads a JSON file. No backend server, no inference at view time.

Process

The pipeline is one direction: images or text in, JSON out. CLIP gives 512-dim vectors; UMAP collapses to 2D with cosine distance to match the normalized embeddings; HDBSCAN groups them. Density is estimated by KDE over the 2D coordinates and rendered by d3-contour as elevation lines. Cluster labels land at the density peak of each cluster. The frontend is dumb on purpose — it just paints what the JSON says.

Muser main topographic text view with clusters labeled PHILOSOPHY, APPLE, WALL, SOIL — Text mode at the top level. Each colored blob is an HDBSCAN cluster; contour lines are KDE density.

Two view modes share the surface — IMG and TXT — and use different embedding pipelines underneath. The toggle isn't cosmetic; CLIP's space and the sentence-transformer space don't compose. Treating them as two separate maps over the same UI was cleaner than trying to unify them.

Clicking a cluster drops into a detail view: image clusters expose actual thumbnails joined by lines weighted by similarity; text clusters render as a network graph with document titles, timestamps, and excerpts.

Muser image cluster detail with thumbnails connected by curved similarity lines — Image cluster detail. Edges encode proximity in CLIP space.

The phylogeny tree is the part I spent the most time on. The naive version — connect every pair of points above some similarity threshold — produces a hairball. Instead: build a cosine similarity matrix, constrain edges to pairs that are either temporally close (< 30 days apart) or very similar (> 0.85 cosine), then take the minimum spanning tree. Earliest timestamp becomes root, so the tree flows past → present. You get genuine "viral spread" of an aesthetic without long-range nonsense edges.

Muser text cluster detail — network graph of document nodes with semantic edges — Text cluster detail. Pink nodes are selected; edges are semantic similarity.