Engineering Computer Engineering Database

From Yahoo's Directory to Vector Databases: How Search Rewrote the Rules of Data

A 30-year arc from hand-curated links to billion-dimensional AI memory

S.J. Nam 13 min read
From Yahoo's Directory to Vector Databases: How Search Rewrote the Rules of Data

Every search you've ever done — clicking a category on Yahoo in 1995, typing keywords into Google in 2001, pinning a coffee shop on Google Maps in 2008, or asking an AI chatbot to summarize a document today — was powered by a database underneath. But the kind of database that made each of those experiences possible changed dramatically over time.

This is the story of that evolution: from rows and columns, through maps and coordinates, all the way to the n-dimensional vector spaces that now power modern AI.


Chapter 1: The Library Card Catalog (1994–1997)

Yahoo and the Age of Human Curation

In January 1994, two PhD students at Stanford — Jerry Yang and David Filo — were struggling to keep track of websites they liked. Their solution was elegantly simple: a hand-organized list of links, sorted into hierarchical categories. "Jerry and David's Guide to the World Wide Web" was renamed Yahoo! by March 1994, and by January 1995, it had grown into a directory of 10,000 sites receiving more than 100,000 unique visitors a day.

Yahoo wasn't really a search engine. It was a directory — the digital equivalent of a library card catalog. You browsed it. You clicked "Computers," then "Software," then "Productivity," until you found what you wanted. The underlying data structure was a tree: a hierarchical taxonomy with categories, subcategories, and links.

The database behind it was essentially a relational database (RDBMS), the workhorse of business computing since the 1970s. Tables of records. Rows and columns. SQL queries like:

SELECT url, description
FROM sites
WHERE category_id = 42
ORDER BY added_date DESC;

Searching the Yahoo Directory meant matching a text string against a column of category names and descriptions. Fast, exact, and completely literal. If Yahoo's editors hadn't categorized your topic yet, it simply didn't exist in the database.

graph TD
    A[Yahoo Directory Root] --> B[Computers]
    A --> C[Entertainment]
    A --> D[Business]
    B --> E[Software]
    B --> F[Hardware]
    E --> G[site: netscape.com]
    E --> H[site: adobe.com]

This worked fine when the internet had tens of thousands of pages. It stopped working when it had millions.


Chapter 2: The Keyword Index (1998–2004)

Google and the Inverted Index

By 1998, the web had exploded past anything a team of human editors could categorize. Larry Page and Sergey Brin, also at Stanford, launched Google on September 4, 1998, with a fundamentally different idea: instead of organizing the web by topic, rank it by authority — specifically, by how many other pages linked to a given page, and how authoritative those pages were.

This was PageRank. But equally important was what sat underneath it: the inverted index.

An inverted index flips the question around. Instead of asking "what's in this document?", it asks "which documents contain this word?" Every word on every crawled page gets mapped to a list of pages that contain it. When you search for "cheap flights," the database doesn't scan every webpage. It looks up "cheap" and "flights" in the index and intersects two lists — in milliseconds.

graph LR
    W1["cheap"] --> D1["doc_12"]
    W1 --> D2["doc_47"]
    W1 --> D3["doc_891"]
    W2["flights"] --> D2["doc_47"]
    W2 --> D4["doc_103"]
    W2 --> D5["doc_891"]
    D2 --> R["Result: doc_47, doc_891"]
    D5 --> R

Mathematically, a keyword search is a one-dimensional lookup. The query is a string; the index is a sorted list; the match is exact. The word either appears in the document or it doesn't. Relevance is scored, but the fundamental operation is text matching.

This was a massive leap over Yahoo's tree structure. But it still had a core limitation: words don't know what they mean. Search for "bank" and you get financial institutions and riverbanks. Search for "jaguar" and you get both the car and the animal. The database has no concept of semantic context — only character sequences.


Chapter 3: The First Step Into Space (2001–2010)

Spatial Databases and the Two-Dimensional World

While Google was perfecting keyword search, a different class of database problem was being solved for maps.

The foundational data structure for spatial search is the R-tree, first described by Antonin Guttman at UC Berkeley's ACM SIGMOD conference in 1984. An R-tree organizes spatial objects — points, rectangles, polygons — into a hierarchy of bounding boxes, allowing the database to quickly answer questions like "find all objects within this rectangle."

Think of it as a recursive zoom. The R-tree divides the world into large regions, then subdivides each into smaller ones, all the way down to individual points. To find a coffee shop within 500 meters, the database doesn't check every location on earth — it prunes huge branches of the tree that couldn't possibly be nearby.

PostGIS, the spatial extension for PostgreSQL, brought this capability directly into the SQL world that developers already knew:

SELECT name, address
FROM businesses
WHERE ST_DWithin(location, ST_MakePoint(-122.4194, 37.7749), 500);

This query returns every business within 500 meters of a given latitude/longitude point — a fundamentally two-dimensional search. The coordinates are no longer just data; they're the search key.

When Google launched Google Maps on February 8, 2005, it made this capability visible to the entire world. Type "pizza near me," get pins on a map. The query was spatial before it was textual. Google Maps quietly assembled its platform through key acquisitions: Where 2 Technologies (the original C++ mapping program), Keyhole (geospatial visualization), and ZipDash (real-time traffic) — fused into what became the most widely used spatial search interface ever built.

graph TD
    Q["User Query: pizza near 37.77°N, 122.41°W"] --> RT[R-tree Index]
    RT --> B1["Bounding Box: SF North"]
    RT --> B2["Bounding Box: SF South ✓"]
    B2 --> L1["Location A: 0.3km ✓"]
    B2 --> L2["Location B: 0.8km ✗"]
    B2 --> L3["Location C: 0.4km ✓"]
    L1 --> R["Results"]
    L3 --> R

The key insight here: we had moved from one-dimensional keyword matching to two-dimensional spatial matching. The "similarity" between a query and a result was now a geometric distance in 2D space — Euclidean or geodesic.

The question that followed naturally: what if you could do the same thing with more dimensions?


Chapter 4: Adding Dimensions (2010–2018)

From 2D Maps to High-Dimensional Spaces

A two-dimensional coordinate locates a point on a map. But a point in higher-dimensional space can represent something far more abstract.

Consider a 3D point cloud used in autonomous vehicles. LIDAR sensors capture millions of points per second, each with x, y, and z coordinates. To recognize a pedestrian, the system searches this 3D point cloud for clusters of points that match the spatial shape of a human body. The index structure — still a variant of the spatial tree — now operates in three dimensions.

Then consider a 4D database used in spatiotemporal analysis: latitude, longitude, altitude, and time. Air traffic control systems track aircraft in exactly this way. The query "find all aircraft within 10km of this point in the next 5 minutes" is a 4D nearest-neighbor search.

Each dimension added is conceptually identical to the previous one — we're just asking "what objects are close to this query point?" in a higher-dimensional space. The math generalizes cleanly. The Euclidean distance between two points in n-dimensional space is:

d=(x1y1)2+(x2y2)2++(xnyn)2d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2}

Where each xix_i and yiy_i is a coordinate along one dimension. In 2D, this is the familiar straight-line distance on a map. In 768D — the size of a typical text embedding — it's the same formula, applied across 768 axes simultaneously.

The critical conceptual leap: any measurable property of an object can become a dimension. Height, weight, color, sound frequency, word meaning — if you can quantify it, you can make it a coordinate axis. And once something is a point in space, you can search for it.

graph LR
    A["2D: latitude, longitude\n(Google Maps, 2005)"] --> B["3D: x, y, z\n(LiDAR, point clouds)"]
    B --> C["4D: x, y, z, time\n(Air traffic, SpaceTime DB)"]
    C --> D["nD: 768–1536 dimensions\n(Text embeddings, 2013+)"]

Chapter 5: Meaning Becomes a Coordinate (2013–2022)

Word2Vec and the Birth of Semantic Search

The breakthrough that connected spatial databases to language came from Google Research in 2013: Word2Vec.

Word2Vec is a technique that converts words into points in a high-dimensional vector space — typically 100 to 300 dimensions. The remarkable property is that words with similar meanings end up geometrically close to each other. The famous example:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

Meaning became arithmetic. Relationships between concepts became distances and directions in space.

This was no longer just a curiosity. It was a new kind of database primitive. If words have coordinates, then finding similar words is a spatial search problem — exactly the same class of problem that R-trees were invented to solve in 1984.

The challenge was scale. Finding the nearest neighbor in a 300-dimensional space across millions of vectors is computationally expensive if done naively. Facebook AI Research released FAISS (Facebook AI Similarity Search) as an open-source library specifically designed to solve this: approximate nearest-neighbor (ANN) search across billions of high-dimensional vectors, optimized for both CPU and GPU.

As language models grew larger — from Word2Vec's 300 dimensions to BERT's 768 dimensions to modern models with 1,536 or more — the "points" became increasingly rich representations of meaning. A sentence embedding isn't just a word: it's a compressed summary of the context and intent of an entire passage.

graph TD
    T["Input: 'What restaurants are open late?'"] --> E["Embedding Model"]
    E --> V["768-dimensional vector\n[0.23, -0.71, 0.08, ..., 0.44]"]
    V --> ANN["Approximate Nearest Neighbor Search\nin Vector Database"]
    ANN --> R1["Result: doc about late-night dining"]
    ANN --> R2["Result: restaurant hours FAQ"]
    ANN --> R3["Result: 24-hour food options"]

Chapter 6: The Vector Database Era (2019–Present)

Dedicated Infrastructure for High-Dimensional Search

By 2019, it was clear that traditional relational databases — built around the assumption that you'd search for exact values in discrete columns — were not well suited for the new workload. Finding the 10 most semantically similar documents to a query isn't a WHERE clause problem. It's a nearest-neighbor problem in a space with hundreds of dimensions.

A new category of dedicated infrastructure emerged:

  • Pinecone (2019): A fully managed cloud service that abstracts all infrastructure. Send vectors via API, get results. Popular for teams that want to ship fast without managing servers.
  • Milvus (2019): Open-source, built for massive scale. Supports billions of vectors and diverse indexing algorithms (IVF, HNSW, GPU-accelerated). Strong in high-volume production systems.
  • Weaviate (2019): AI-native database with built-in support for generating embeddings, hybrid search (keyword + vector), and complex metadata filtering.
  • FAISS (Meta, open-source): Not a full database, but the underlying similarity search library that powers many of the above.

The market validated the need quickly. The global vector database market was valued at roughly $2.38 billion in 2025 and is projected to reach $18.86 billion by 2035 — a compound annual growth rate above 23%.

Traditional databases are adapting too. pgvector, an extension for PostgreSQL, adds vector search directly to the world's most popular open-source relational database. The worlds are converging: SQL developers can now write:

SELECT content
FROM documents
ORDER BY embedding <=> '[0.23, -0.71, 0.08, ...]'
LIMIT 5;

The <=> operator is cosine distance — a spatial proximity measure — running inside a database originally designed for inventory tables and customer records.


Chapter 7: The Full Stack of Modern AI (2022–Present)

Why LLMs Need Vector Databases

Large language models like GPT-4 or Claude have a fundamental constraint: a finite context window. They can only "see" a limited amount of text at once — perhaps 128,000 tokens, which sounds large but is tiny compared to an organization's entire document library, codebase, or knowledge base.

Retrieval-Augmented Generation (RAG) is the standard solution. The architecture works in two phases:

flowchart LR
    subgraph Ingestion
        D["Documents"] --> EM1["Embedding Model"]
        EM1 --> VDB["Vector Database\n(Pinecone / Milvus / Weaviate)"]
    end
    subgraph Retrieval
        Q["User Query"] --> EM2["Embedding Model"]
        EM2 --> ANN2["ANN Search"]
        VDB --> ANN2
        ANN2 --> TOP["Top-k relevant chunks"]
    end
    subgraph Generation
        TOP --> CTX["Context + Query"]
        CTX --> LLM["LLM (GPT-4 / Claude)"]
        LLM --> ANS["Answer"]
    end
  1. Ingestion: Every document in your knowledge base is converted into an embedding — a high-dimensional vector — and stored in the vector database.

  2. Retrieval: When a user asks a question, the query is also converted into an embedding, and the vector database finds the most semantically similar document chunks.

  3. Generation: Those retrieved chunks are passed to the LLM as context, allowing it to answer questions about documents it was never trained on.

The vector database is the memory layer. The LLM is the reasoning layer. Neither is sufficient alone.


The 30-Year Arc

Here is the full progression, from a Stanford dorm room in 1994 to billion-parameter AI systems in 2025:

timeline
    title The Evolution of Search and Databases
    1994 : Yahoo Directory launched
         : Hierarchical tree, human-curated
         : Relational DB with text columns
    1998 : Google launched
         : PageRank + inverted index
         : Keyword matching in 1D
    2005 : Google Maps launched
         : R-tree spatial indexing
         : 2D coordinate search
    2010 : 3D and 4D spatial DBs
         : LiDAR, spatiotemporal data
         : n-dimensional extensions
    2013 : Word2Vec by Google
         : Words as 300D vectors
         : Semantic meaning = geometry
    2019 : Pinecone, Milvus, Weaviate
         : Dedicated vector databases
         : ANN search at scale
    2022 : RAG + LLMs go mainstream
         : Vector DB as AI memory
         : 768–1536D embeddings

The thread that runs through all of it is a single question, asked with increasing sophistication:

What data is closest to this query?

In 1994, "closest" meant "in the same category." In 1998, it meant "contains the same keywords." In 2005, it meant "geographically near." Today, it means "semantically similar in a 1,536-dimensional space that encodes meaning, context, and intent."

The database didn't just evolve. It learned to understand what you meant, not just what you typed.


References

  1. Yahoo Inc. / Wikipedia (2026). Yahoo Directory. https://en.wikipedia.org/wiki/Yahoo_Directory

  2. Internet History Podcast (2015). On the 20th Anniversary – The History of Yahoo's Founding. https://www.internethistorypodcast.com/2015/03/on-the-20th-anniversary-the-history-of-yahoos-founding/

  3. Engineering and Technology History Wiki (2024). Milestones: PageRank and the Birth of Google, 1996–1998. https://ethw.org/Milestones:PageRank_and_the_Birth_of_Google,_1996-1998

  4. Sitecentre (2026). The History of Google Search — 1998 to 2026. https://www.sitecentre.com.au/blog/history-of-google-search

  5. Google / Fandom Wiki. Google Maps. https://google.fandom.com/wiki/Google_Maps

  6. EduPub (2025). A Brief History of Google Maps. https://www.edupub.org/2025/10/a-brief-history-of-google-maps.html

  7. Guttman, A. (1984). R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference, pp. 47–57. https://www.semanticscholar.org/paper/R-trees:-a-dynamic-index-structure-for-spatial-Guttman/c5847fb3899eea98d544cced63d49886ecb17d9b

  8. PostGIS Project. PostGIS Spatial and Geographic Objects for PostgreSQL. https://postgis.net/

  9. Mikolov, T. et al. / Medium summary (2024). Word2Vec and Vector Embeddings in RAG Applications. https://medium.com/thedeephub/vector-embeddings-in-rag-applications-9ea8043c172b

  10. Meta AI Research. FAISS: Facebook AI Similarity Search. Referenced via: https://medium.com/@sergiopr89/vector-databases-for-efficient-rag-a-first-look-at-faiss-and-redis-b18314cd30b6

  11. Fundamental Business Insights (2025). Vector Database Market Size & Share — America, Europe, & APAC Outlook 2026–2035. https://www.fundamentalbusinessinsights.com/industry-report/vector-database-market-13287

  12. SNS Insider / GlobeNewswire (2025). Vector Database Market to Reach USD 10.6 Billion by 2032. https://www.globenewswire.com/news-release/2025/03/07/3039040/0/en/Vector-Database-Market-to-Reach-USD-10-6-Billion-by-2032-SNS-Insider.html

  13. Emasterlabs (2026). Pinecone vs Milvus: The Ultimate Vector Database Comparison for 2026. https://emasterlabs.com/pinecone-vs-milvus

  14. Meta Intelligence (2025). Vector Databases: Pinecone vs Weaviate vs Milvus. https://www.meta-intelligence.tech/en/insight-vector-database