Old Millenials, LLMs, and the Internet

I have been thinking a lot about the internet (the web?) lately, especially vis-a-vis LLMs, and I keep trying and failing to write down my thoughts. So rather than keep attempting to wring a coherent argument out of myself, I’m going to take the coward’s way out and write up some loosely-connected bullet points.

  • I’m approximately 40, and nearly all of the decision makers driving the recent AI/LLM explosion are roughly my age or older. I think that if you experienced the late 90’s through iPhone explosion of the web—if you were old enough to experience a need and then (still) flexible enough and curious enough to dive into meeting it with new technology—then LLMs can feel like the apotheosis of the internet. If you are old enough, then you have sat with friends wondering what actor was in a film, or what the name of a musician’s side project was, or how to make guacamole, and been entirely unable to answer the question in that moment. And as time went on you would have experienced asking Lycos or Yahoo or AltaVista or Google about more and subtler things, and been delighted as more and more of them had useful results. I think the magic of living through that transformation established a kind of trust in the internet as a communal body of knowledge (even if most of those now-40-year-olds were also posting nonsense on Something Awful or 4chan).
  • That content was coming from a wide variety of sources, too. There were central consolidators of information like Wikipedia and The IMDB, but it was also very common to find your answer on someone’s personal webpage on Geocities or Angelfire, or on a dedicated community / fansite, or a phpBB forum, or one of seemingly innumerable university professors’ websites. My recollection is that simply making a website—any website—was fun, and novel, and seemed like a way to build a presumed-to-be useful skill, so people would set out primarily to learn HTML and only secondarily (at least at first) to share their knowledge about fish keeping or making sourdough or the lyrics of The Doors. The hosting was free, the time was an investment, and beyond “maybe put AdWords on it” there wasn’t much of a business case to be made and thus little in the way of an invisible hand forcing consolidation in the market to provide information about the South Jersey ska-punk scene.
  • Now much of that kind of personal content has moved within platforms. Some of those, like Reddit or Fandom, are within view of a Google search, but the most devoted “interest community” activity is now enclosed within Instagram, Facebook, Discord, Telegram, TikTok, etc, etc. Self-hosting an open community on the internet has a lot of challenges (technical, financial, moderation, etc) so this is pretty understandable, but it feels like a tragedy to me. To whit, film photography is seeing a resurgence, but if you Google for help with your parent’s old camera the best results will probably be on Flickr from more than a decade ago, because the relevant subreddits are more casual and there’s no easy way to find the right Discords and Facebook Groups when you are new to the hobby.
  • Personally, I think Google would be better off today if they had invested in solving those problems to enable people to keep such conversations happening out in the open. And not only for LLM purposes; search results are terrible now partly because there is so much keyword spam content, but also because there’s scant new hobbyist content (leaving only the 2010 Flickr threads about that hand-me-down Nikon). There is, as people have found, Reddit, but attempts to game Google through spam on Reddit are already testing the dedication of volunteer moderators.
  • I hope that LLM-mania will prompt the tech giants to encourage a return of the kind of self-consciously open knowledge sharing that flowered on the early web—and to be clear, I hope they can encourage that by creating financial support for that knowledge sharing in addition to relying on good UX and public spiritedness. But Google has YouTube, Meta has Facebook and Instagram and Threads, and for the moment OpenAI has ambiguous laws and an Uber-style habit of taking first and only asking for permission later if forced to. While it is true that advances in AI make it easier (or at least possible) to get value out of existing pools of data, I think there is something almost mercantilist in the way companies are focused solely on the value of their proprietary datasets. So I don’t think a Google or a Meta will feel that they have anything to gain from enabling the creation and sharing of data that is not exclusive to their own use.
  • Reporter James Surowiecki’s The Wisdom of Crowds turned 20 years old this May. I’ve never read it! But I did it hear highly condensed accounts of it in both more and less formal presentations in the Bay Area in the years immediately following its 2004 release. Such crowds were at their wisest when averaging a number, as in guessing the weight of an ox at a county fair (the anecdote which invariably accompanied any mention of the book), or assessing the quality of a restaurant. So for a time people were very interested in things like tagging systems that might offer a way to make anything from LinkedIn resumes to Flickr photos roughly reducible to something that could be navigated and ranked with simple arithmetic.
  • I’m no machine learning PhD, but as I understand it the current explosion of LLMs is enabled by a combination of techniques going back to the likes of Word2Vec that can represent text (among other things) as many-dimensional grids of numbers, and the increasing power of graphics cards to process more of those matrices in much less time. So now an LLM can summarize a page of text, and it is similarly a matter of math. But at a product level, I think people (likely my age!) are mistakenly imagining that recipes can be aggregated with the same kind of deterministic Excel math used to summarize movie review scores. One prank pizza recipe can cause subtler havoc than one prank product rating.
  • For now, there are broadly two ways that an LLM can answer your question.
    • One, is for the LLM to answer the question “itself”, drawing from its base dataset (which we can shorthand as “the entire web”) and modulated by whatever training or fine-tuning it received (shorthand as showing it example question/ answer pairs, or people giving proposed answers a thumbs up or thumbs down). This is fun and impressive, because it seems like the internet has become sentient, it feels like talking to an all-knowing demigod, and at its best these answers appear to distill the entirety of human knowledge about a subject into a concise response. On the other hand, because the answer is being sieved out of such an enormous soup of data, these responses sometimes exhibit inaccuracies that weren’t present in the source material.
    • The other way (called Retrieval Augmented Generation, or RAG) is for a process to search a body of content for text likely to have the answer, and then have the LLM answer the question drawing only from the text of those sources. This tends to create responses that hew closer to the source text, but it’s a bit less magical because reading a Wikipedia page and summarizing the first paragraph is something most people could just do themselves, and seeing the sources cited reminds people of how much they may not trust Wikipedia, or a given newspaper, or Reddit, or a blogpost.
  • Something that I imagine is tantalizing to founders about the Wisdom of Crowds framing is that each individual contribution is presumed to be almost worthless—so Yelp, for example, is providing more value by aggregating than you do by writing your individual review. I like the idea of a universe where there can be millions of individual pages and conversations about (say) film photography, but I think that (perhaps outdated!) view of the internet may play into the idea that none of those non-professional sources are particularly valuable individually, and that an LLM trained on them summarizing an answer out of so much photography text soup is what is really providing value. Meanwhile, if there really are only a handful of sources to draw from, it becomes clearer that they are the actual resource, that they are providing the value—and that they should probably be paid.
  • I will say that I think both answering strategies are and will continue to be useful in practice, and that different people may want different kinds of answers to the same questions—I may want the zeitgeist-thru-LLM’s quick answer to what temperature to bake salmon at, but lots of context around what is the best laser printer; you may want the opposite. But if the authentic human conversations remain in closed communities (maybe even moreso in flight from AI), and the display-ad websites disappear along with their Google traffic, and if everywhere on the internet is flooded by okay-ish AI-generated empty content, then I don’t know where new source material is going to come from. It’s genuinely exciting that we’ve achieved the early 2000’s goal of creating the greatest interactive encyclopedia ever (boy did computer nerds love talking about the Primer in Neal Stephenson’s The Diamond Age). But the information ecosystem of the internet will have to evolve if that encyclopedia is going to continue to be updated into the 2030’s.

Leave a comment