You may have heard the tongue-in-cheek depiction of a statistician: someone who can sit on a large block of ice with her hair on fire and, when asked how she feels, calmly reply, “you know, on average, I feel pretty good.”
Genealogy Gophers is unique in several ways. One of the most interesting is how it searches its large library of genealogy publications in order to zero in on the people you’re trying to find. The great results our users are getting has led to lots of questions around “what’s your search engine’s secret sauce?” (yes, it has to do with a lot of ice and fire)
In answering that secret sauce question we’ll try not to be too techno-geeky. OK, we’ll make some statistics references and throw in some cool sounding words like “algorithm”, “precision”, and “indexing”. But we’ll use a simple example to illustrate how GenGophers.com’s unique technology works relative to typical book search engines. Even with that, if you’re still squirming with the details, don’t feel guilty about bailing out at any time and waiting for the next blog. Or, find your nearest freezing/overheated statistician if you need some help.
Here’s our example situation: You’re trying to find more information about your ancestors, William David Smith and Harriett Susan Smith, who were married in Michigan in 1895. With tens of thousands of family and regional history books on the web, you know that some of them can add some interesting new facts to what you already know. But how do you dig through all those books (and all the pages in all those books) to find your Bill and Hattie?!?
Your typical book search engine would have already done a few things to make hunting for them easier for you:
- It would have searched through every digitized page of every book in its library, including both genealogy and non-genealogy books alike
- As it did that it created an index of every word it found in every book
- The indexing process attaches to each found word the name of the book and the page where the word appeared (the word could represent a person’s name or anything else, the search engine doesn’t know or care)
- That resulting index becomes a handy and fast searchable “word” look-up table for the search engine
To start your search for Bill and Hattie you enter into the search form the name, William Smith (without quotes). The engine will search its index for those two words and return every page from every book having both the word William and the word Smith. It could be that you find your relative William Smith as shown in Example A below. Bingo!
Unfortunately (as you probably already know), it’s more likely that the given name William it finds is not associated with the surname Smith it finds. What’s probably more likely is that Smith appeared, say, at the top of a page associated with a different person’s given name, as in Martin Smith. And the given name William was at the bottom of the page associated with a different person’s surname, say, William Jones. Ugh. That page isn’t showing your William Smith. It’s shown below as Example B:
The common solution most of us use to solve this problem – the search engine not correctly associating the names William and Smith together as one person – is to use quotes around the name “William Smith” when initiating the search. Forcing the association in that way could again create a “Bingo!” moment as shown in Example A above.
Although the additional precision in a search using the term “William Smith” (in quotes) can help in some situations, it can unfortunately introduce a different set of frustrating quality-of-search problems. That can happen, for example, when your relative William Smith is actually included in a book, but isn’t found by the search engine because:
- his name is displayed only when linked together with his wife’s, Harriett (shown in Example C)
- the publication used his full name, William David Smith (Example D)
- his given name is abbreviated as Wm. (Example E), not unusual in older books and publications
In each of these common cases the search engine will miss returning a hit on your relative (even though this is actually him!) because the words it’s indexed don’t match the precise term you supplied in quotes, “William Smith”. Some search engines, such as Google’s, provide some search feature (wildcards, for example) that may return matches in some of these cases, but not all.
Genealogy Gophers’ search engine uses a different and unique indexing approach when adding new publications to its library. To increase the probability of finding William and Harriett and solve some of these search technology problems, the GenGophers.com process looks something like this:
- Like other search engines, GenGophers.com will start by searching through every digitized page of every book in its library, although its library contains only genealogy publications (resulting in search results that are only genealogy related)
- Unlike other search engines, GenGophers.com:
- begins its indexing process by identifying and indexing only those words likely to be the name of a person, a date, or a geographical place (e.g., William, Harriett, Smith, 1895, Michigan)
- then uses statistical algorithms (just think ice and fire) to try and associate the names, dates, and places it’s found, and recognize combinations of them that are likely to represent real people. Using our William Smith search instance:
- in Example B it would associate the names “Martin” and “Smith” as a person and “William” and “Jones” as a different person, and recognize that neither person is “William Smith”
- in Example C it would recognize that the surname “Smith” is associated with the given name “William” (also with “Harriett”) that precedes it, and predict that this is likely your “William Smith”
- in Example D it would recognize that the names “William”, “David”, and “Smith” are associated and make a match with your “William Smith” search
- in Example E it would know that “Wm.” is a common abbreviation for “William” and then associate that given name (even as an abbreviation here) with the following surname “Smith”, again matching your “William Smith” search
- although not described in these examples, the Genealogy Gophers search engine uses other statistical and machine learning tools in similar ways to associate dates and places with the person names it finds, and return to users the best possible genealogy-only search results
This is probably enough ice and fire for one blog post. But we’ll mention in closing that the Genealogy Gophers’ search technology also attaches statistical probabilities to each of these name, date, and place associations it makes. It then uses those probabilities to rank order the results it returns on each search you make. We’ll save that exciting story for a future blog! We know the statisticians among you can’t wait.
— Your friends at Genealogy Gophers