July 20, 2009...7:21 am

Population, Signale, Technologie und Output automatisierter Dienste

Jump to Comments

Diese Seite ist Work-in-Progress. Hier versuche ich für jeden Dienst, der mit Data-Mining-Techniken ein Medienprodukt hertstellt, die vier Aspekte Population, Signal, Technologie und Output in kleinen Fallstudien zu kategorisieren:

1) Population: Welche Grundgesamtheit von Elementen wird vom Dienst in eine Rangreihe gebracht?
2) Signale: Welche Daten werden als Signale fürs Ranking beigezogen und verarbeitet
3) Technologie: Wie wird die Technologie vom Anbieter offiziell gennant und wie funktioniert sie.
4) Output: Nach welchen Kriterien wird der Output des Dienstes sortiert und wie lautet die offizielle Terminologie?

Inhalt

1. News-Aggregatoren
1.1. Google News
1.2.Techmeme
1.3. Hashtags.org
1.4. Bing News Aggregator
1.4. Tweetmeme

2. Mediaplanung und Werbeplatzvermarktung
2.1. Google AdWords
2.2. Yahoo Search Engine Marketing / Yahoo Sponsored Links
2.3. Google AdSense
2.4. Google AdPlanner
2.5. Double Click Media Visor
2.6. Microsoft Search Advertising
2.7. Ask Sponsored Links
2.8. Adelio
2.9. Facebook

3. Reputation Systeme
3.1. Ebay Feedback
3.2. Everything2
3.3. Slashdot
3.4. Technorati
3.5. Postrank
3.6. Tweet Rank

4. Social Bookmark Aggregatoren
4.1. Delicious

5. Social News Aggregatoren
5.1. Digg

6. Recommender Systeme
6.1. Amazon Recommendations
6.2. LastFM
6.3. Apple Genius
6.4. Pandora
6.5. Stumble Upon
6.6. Netflix
6.7. Youtube Recommender System (Related Videos)
6.8. Google Reader Recommender System
6.9. Facebook Friend Suggestions

7. Suchmaschinen
7.1. Google Suche
7.2. Yahoo Suche
7.3. Bing Suche
7.4. Ask Suche
7.5. Baidu Suche
7.6. DeepDyve Suche
7.7. Twitter Suche

8. Citation Indexing
8.1. Google Scholar
8.2. CiteSeerX

9. Vergleichs-Systeme
9.1. Bing Airtravel Flugticket Preis Vorhersage
9.2. Zillow.com Immobilien Preis Vorhersage
9.3. Google Product Search

10. Semantic Web Apps
10.1. Powerset Suche

11. Location-based Systeme
11.1. Citysense

12. Anderes
12.1. Flick Interestingness
12.2. Google Trends
12.3. Bing xRank
12.4. Hunch.com

1.1. Google News

Population

  • News-Stories der 25’000 wichtigsten Online-Newsmedien weltweit. (Quelle)
  • Deutsche Ausgabe: Mehr als 700 deutschsprachige Nachrichtenquellen (Quelle)

Signale

  • Häufigkeit, mit welcher eine News-Story aufgegriffen wird
  • Medium, von welcher eine Story aufgegriffen wird
  • freshness
  • location
  • relevance
  • diversity
  • personalized interests (Quelle)
  • Webprotokoll für Personalisierung
  • Vom User am häufigsten geklickte News (für personalisierung)
  • Von anderen am häufigsten geklickte News (Bereich besonders beliebt)
  • Sprache
  • Region
  • Thema (Quelle)

Technologie

  • Computer-generated Newssite that aggregates headlines from news sources worldwide, ranked by computers that evaluate. (Quelle)
  • Computergenerierte News-Website (Quelle
  • Gruppierungstechnologie, Clustering-Algorithmen Quelle
  • “Auswahl und Anordnung der Artikel auf dieser Seite werden durch ein Computerprogramm automatisch bestimmt.” Disclaimer auf Startseite
  • “The selection and placement of stories on this page were determined automatically by a computer program” Disclaimer auf Startseite

Output

  • Sortiert nach Ressort
  • Relevanzmass (berechnet aus Signalen)
  • Gruppierung der Stories

1.2.Techmeme

Population

  • “the must-read stories in technology …. across hundreds of news sites” (Quelle)
  • Online Tech-News, englischsprachig

Signal

  • importance
  • number of links to the story’s web page
  • freshness.
  • “Anti-gaming” a high number of links in a short period of time, or by a small number of people
    (Wikipedia)

Technologie

  • computer algorithm extended with direct human editorial input” (a href=”http://techmeme.com/”>Techmeme
  • “Techmeme arranges all of these links into a single, easy-to-scan page. Story selection is accomplished via computer algorithm extended with direct human editorial input.” (Techmeme)
  • Technology news aggregator. Wikpedia
  • “The full set of sites it monitors is constructed automatically, and even changes in real time based on linking. A small “seeding” list I construct manually is used to help the system build the complete list.” (Wired)

Output

  1. Relevanz der Story (berechnet nach Algorithmus
  2. Innerhalb der Story nach Relevanz der Diskussions-Quellen

1.3. Hashtags.org

hashtags.org

Population

Alle Hashtags von Twitter

Signal

  • Häufigkeit der Hashtags in den letzten sechs Stunden
  • Offizieller Name der Technology: Hashtag-Aggregator

Output

Liste von Hashtags nauch Häufigkeit

1.4. Bing News Aggregator

1.5. Tweetmeme

Population:

links on twitter

Signale

  • Häufigkeit der Links
  • Zeit
  • Thema

Technologie

  • “service which aggregates all the popular links on twitter”
  • “categorize these links into categories and subcategories”

Output

Die populärsten Links auf Twitter geordnet nach Kategorie

2. Mediaplanung und Werbeplatzvermarktung

2.1. Google AdWords

Population

Alle Google AdWords-Werbeanzeigen

Signal

  • Sprache
  • Location
  • Landing Page Quality Score (Relevance, Originality, Transparency, Navigability Quelle
  • Keyword Quality Score (historical clickthrough rate, account history, historical Clickthrough Rate of the display URL, Landing Page Quality Score, Relevance of the keyword to the ads, relevance of the keyword and the matched ad to the search query, accounts performance in the geographical region where the will be shown, other relevance factors, Quelle)
  • Bidding Price
  • “The bids themselves are only a part of what ultimately determines the auction winners. The other major determinant is something called the quality score. This metric strives to ensure that the ads Google shows on its results page are true, high-caliber matches for what users are querying. If they aren’t, the whole system suffers and Google makes less money. Google determines quality scores by calculating multiple factors, including the relevance of the ad to the specific keyword or keywords, the quality of the landing page the ad is linked to, and, above all, the percentage of times users actually click on a given ad when it appears on a results page. (Other factors, Google won’t even discuss.) There’s also a penalty invoked when the ad quality is too low—in such cases, the company slaps a minimum bid on the advertiser.” (Wired)
  • ”To determine whether an ad is relevant to a particular query, this system weighs an advertiser’s willingness to pay for prominence in the ad listings (cost-per-click or cost-per-impression bid) and interest from users in the ad as measured by the click through rate and other factors….Also assign minimum bids to advertisers keywords based on the Quality scores of those keywords. Quality score is determined by an advertiser’s keyword click-through rate, the relevance of the ad text, historical keyword performance, the quality of the ad’s landing page” (Quelle: Google Inc. Annual Report 2008)

Technologie

  • “pay-per-click (PPC) advertising, and site-targeted advertising” (Wikipedia)
  • “AdWords, Google’s unique method for selling online advertising. AdWords analyzes every Google search to determine which advertisers get each of up to 11 “sponsored links” on every results page. It’s the world’s biggest, fastest auction” (Wired)
  • “It was called a two-sided matching market. “The mathematical structure of the Google auction,” Varian says, “is the same as those two-sided matching markets.” (Herman Leonard)”(Wired
  • The Keyword Pricing Index Wired
  • AdWords Ranking: ”Ads are ranked for display in AdWords based on a combination of the maximum cost-per-click pricing set by the advertisers and click through rates and other factors” (Quelle: Google Inc. Annual Report 2008)
  • AdWords Discounter: ”automatically lowers the amount advertisers actually pay to the minimum needed to maintain their ad position” (Quelle: Google Inc. Annual Report 2008)
  • Site Targeting: Based on third-party opt-in panel data. (Quelle: Google Inc. Annual Report 2008)
  • Google AdWord Auction System: Auction Based System, Automated execution of an auction. (Quelle: Google Inc. Annual Report 2008)

Output

Werbeanzeigen passend zum Keyword der Suchabfrage, sortiert nach Relevanz und bezahltem Preis.

2.2. Yahoo Search Engine Marketing / Yahoo Sponsored Links

Population

Alle Werbeanzeigen der Werbetreibenden

Signal

  • Location
  • Sprache
  • Auktionsgebote
  • Qualitätsindex (Relevanz der Anzeige im vergleich mit Wettbewerber, Klickrate) (Quelle 1, Quelle 2)
  • Qualität Landingpage,
  • algorithmische Seitenplatzierung, algorithmische Relevanz (Quelle)

Technologie

  • Offizieller Name der Technology: Yahoo! Sponsored Search
  • paid placement, contextual advertising (Wikipedia Search Engine Marketing)
  • Computational Advertising
  • Computational advertising is a new scientific sub-discipline, at the intersection of information retrieval, machine learning, optimization, and microeconomics. Its central challenge is to find the best ad to present to a user engaged in a given context, such as querying a search engine (“sponsored search”), reading a web page (“content match”), watching a movie, and IM-ing. Yahoo Research.

Output

Werbeanzeigen passend zum Keyword der Suchabfrage, sortiert nach Qualität und bezahltem Preis.

2.3. Google AdSense

Population

Alle über AdSense gebuchten Anzeigen

Signal

  • Sprache
  • Keyword-Analyse
  • Worthäufigkeit
  • Schriftgröße
  • Linkstruktur des Webs (AdSense Help)
  • Placement-Targeting (Ausrichtung auf Anzeigenplatzierungen). Beim Placement-Targeting wählen Inserenten bestimmte Anzeigenplatzierungen oder Teilabschnitte von Publisher-Websites aus.
    AdSense Help

Technologie

  • Content Targeting: Contextual advertising option (Google Inc. Annual Report 2008)
  • AdSense for Content: “automated technology to analyzes the meaning of the content on the web site and serve relevant ads based on the meaning of the content” Google Inc. Annual Report 2008
  • AdSense Contextual Advertising Technology:”…techniques that consider factors such as keyword analysis, word frequency and the overall link structure of the web to analyze the content of individual web pages to match ads to them almost instantaneously….automatically serve contextually relevant ads….we employ similar techniques for matching advertisements to other forms of textual content such as email messages and Google Groups postings” Google Inc. Annual Report 2008
  • Content-Targeting: Bei dieser Technologie werden Faktoren wie Sprache, Keyword-Analyse, Worthäufigkeit, Schriftgröße und die gesamten Linkstruktur des Webs genutzt, um das Thema einer Webseite zu bestimmen und genau passende Google-Anzeigen für die einzelnen Webseiten zu finden. Was ist AdSense für Suchergebnisseiten AdSense Help

Output

Liste mit einer vorgebenen Anzahl Anzeigen, die relevant sind für den Inhalte einer Seite, sortiert nach Relevanz und bezahltem Preis.

2.4. Google AdPlanner

Population

Alle Webseiten, die ein gewisses Traffic-Volumen haben, Google nicht via robots.txt aussperren und den Qualitäts-Guidelines entsprechen Ad Planner Help)

Signal

  • aggregated Google search data
  • opt-in anonymous Google Analytics data
  • opt-in external consumer panel data
  • other third-party market research AdPlanner Help

Technologie

  • Google Ad Planner: “research and media planning tool that allows agencies and advertisers to identify the web sites their target customeRs are likely to visit. MediaVisor improves media buying by replacing formerly manual tasks in the media buying process.” (Google Inc. Annual Report 2008)
  • “computer algorithms”: Google Ad Planner combines information from a variety of sources, such as aggregated Google search data, opt-in anonymous Google Analytics data, opt-in external consumer panel data, and other third-party market research. The data is aggregated over millions of users and powered by computer algorithms; it doesn’t contain personally-identifiable information. AdPlanner Help
  • automated analysis of millions of search queries and site visits Ad Planner Help
  • algorithms that improve the demographic estimates (Ad Planner Help

Output

Liste von Webseiten, sortiert nach Relevanz für eine Zielgruppe

2.5. Double Click Media Visor

Google Ad Planner and MediaVisor
“research and media planning tool that allows agencies and advertisers to identify the web sites their target customeRs are likely to visit. MediaVisor improves media buying by replacing formerly manual tasks in the media buying process.” Google Inc. Annual Report 2008

2.6. Microsoft Search Advertising

Population

All Search Ads

Signal

  • Bids
    Relevanz

Technologie

Microsoft Search Advertising

Output

Search Ads geordnet nach Relevanz für die Suchanfrage

2.7. Ask Sponsored Links

http://sponsoredlistings.ask.com/

2.8. Adelio

keine Techinfo

2.9. Facebook

keine Techinfo

3. Reputation Systeme

3.1. Ebay Feedback

Population

Alle eBay-Nutzer

Signal

  • Nutzer-Feedbacks
  • Ratings

Technologie

Feedback System (Wikipedia)

Output

Ein Rating der Vertrauenswürdigkeit der User in Prozenten (Wikipedia)

3.2. Everything2

Population

Alle nutzergenerierten Artikel auf http://everything2.com

Signal

Technologie

Everything2 Voting/Experience System (Everything2)

Output

Artikel-Ratings

3.3. Slashdot

Population

Alle Kommentatoren auf Slashdot

Signal

  • Ratings von Nutzer
  • +1 point” or “-1 point”;
  • redefined labels, such as Flamebait or Informative. (Wikipedia

Technologie

Karma

Output

Liste von Kommentaren, gefiltert nach der Reputation des Kommentars und Reputation des Kommentar-Autors

3.4. Technorati

Population

Alle Weblogs und Weblog-Beiträge, die den Quality-Guidelines entsprechen. Keine Foren, Social Networks, Aggregation Sites.
(Technorati)

Signal

Number of Unique Blogs that Link to a website (Technorati)

Technologie

Technorati Authority: Technorati Authority is the number of blogs linking to a website in the last six months. The higher the number, the more Technorati Authority the blog has.(Technorati Blog, Technorati Support)

Output

Nach Relevanz der jeweiligen Blogs gewichtete oder nach Aktualität sortierte Suche über Blogbeiträge
Blogranking

3.5. Postrank

Population

all kind of online content (Postrank What)

Signal

  • social engagement,
  • writing a blog post in response to someone else,
  • bookmarking an article
  • leaving a comment on a blog
  • clicking a link to read a news item.
  • frequency of an audience’s interaction with online content.
  • analysis of the “5 Cs” of engagement: creating, critiquing, chatting, collecting, and clicking.
    (Postrank What

Technologie

PostRank: scoring system…to rank any kind of online content, such as RSS feed items, blog posts, articles, or news stories. (Postrank)

Output

Nach Aktivät gewichtete Liste von Web-Dokumenten

3.6. Tweet Rank

Dieses Produkt wurde noch nicht gelauncht, wurde aber bei Techcrunch im Zusammenhang mit den veröffentlichten geheimen Twitter-Strategie-Dokumenten erwähnt:

“next gen search results page” and a (much-needed) reputation system which internally is being called “Tweet rank.”

4. Social Bookmark Aggregatoren

4.1. Delicious

Input

Population

Alle von Nutzern abgelegten Webseiten-Bookmarks

Signal

Popularität (Anzahl User, die ein Webseite als Bookmark gespeichert haben)
Aktualität (recency). Bookmarking Zeitpunkt

Technologie

social bookmarking service (Delicious About

Output

Liste von Webdokumenten, die nach Relevanz geordnet sind, wobei Relevanz bedeutet, wie viele Leute haben eine bestimmte Webseite zu ihren persönlichen Bookmarks hinzugefügt.

5. Social News Aggregatoren

5.1. Digg

Population

Alle von Nutzern beigtragenen Links zu Stories

Signal

  • Anzahl Diggs (positve Ratings)
  • Anzahl Burries (negative Ratings)
  • Aktualität

Technologie

“Digg promotes what users like best” (Digg About

Output

Liste von aktuellen News-Stories geordnet nach Relevanz, wie sie der aggregierten Ratings der DIGG-Community entsprechen.

6. Recommender Systeme

6.1. Amazon Recommendations

Population

Alle Bücher/Produkte im Katalog von Amazon

Signal

  • customer’s interests
  • Gekaufte Items
  • Nutzerbedürfnisse
  • Kaufgeschichte
  • personalisierte Daten
  • Ähnlichkeitsmass von Büchern

Technologie

  • “Recommendation Algorithm”
  • “item-to-item collaborative filtering”
  • “Rather than matching the user to similar customers, item-to-item collaborative filtering matches each of the user’s purchased and rated items to similar items, then combines those similar items into a recommendation list”
    Linden et. al. 2003)

Output

list of recommended items

6.2. LastFM

Population

Alle Musiksstücke in der Last.fm-Datenbank

Signal

  • Empfehlungen
  • Lieblingslieder
  • Kommentare zu Lied
    Quelle

Technologie

  • “Musikdienst, der lernt, was du magst”
  • “Audioscrobbler” (Vorgängertechnologie) (Quelle)
  • “Scrobbeln” (scrobbling/to scrobble) ist das Übermitteln von Musiktiteln an unsere Datenbank
  • “Tasteometer”: To compare one’s music taste with another (a href=”http://www.last.fm/api/show?service=258″>Quelle API Documentation.

Output

Liste mit Songs, die dem eigenen Geschmack entsprechen.

Wikipedia English

6.3. Apple Genius

Population

Alle Songs auf auf iTunes und in der iTunes-Store Datenbank

Signal

  • Playlists
  • Song

Technologie

“makes playlists from songs in your iTunes library that go great together” (Apple Support
Collaborative Filtering (Wikipedia

Output

Automatisiert das erstellen von Playlists

6.4. Pandora

Population

Signal

  • Music Analyisis
  • hundreds of musical details on every song (about
  • Music Genome Project (about
  • 400 distinct musical characteristics analyzed by a trained music analyst

Technologie

Output

Music Recommendations

6.5. Stumble Upon

Population

Alle Webdokumente

Signal

  • User Ratings
  • Peer Ratings
  • Similar Users
  • Page Recommendations
  • Selected Interests.

Technologie

  • “Personalized Recommendation”
  • “Clustering Engine”
  • “Classification Engine”
  • “Recommendation Engine”
  • “emergent content referral system”
  • “patent-pending Toolbar system automates the collection, distribution and review of web content within an intuitive social framework”
    (Quelle: SU Technology

Output

Liste mit personalisierte Empfehlungen für Webseiten.

6.6. Netflix

Population

Alle Filme in der Netflix-Datenbank

Signal

Technologie

Output

Movie Recommendations

Netflix on Wikipedia

6.7. Youtube Recommender System (Related Videos)

Population

Alle Filme auf Youtube

Signal?

Technologie

  • “Ähnliche Videos – Wenn du dir ein Video auf YouTube ansiehst, wird rechts unter dem Video eine Liste mit ähnlichen Videos angezeigt. Die Videos in dieser Liste haben eventuell ein ähnliches Thema wie das aktuelle Video und machen es dir leichter, andere Videos zu demselben oder einem ähnlichen Thema zu finden.” (Youtube Help
  • “Related Videos – When you watch a video on the YouTube, a list of related videos will show up to the bottom-right of the video. This list contains many videos which might be related to the video by subject matter, so that you may find it easier to search out other videos based on the same or similar subject.” Related Videos

Output

Liste mit Videos, die ähnlich sind, wie das gegenwärtig geschaute Video

6.8. Google Reader Recommender System

Population

Alle Feeds, die mit Google Reader abonniert werden können

Signal

Technologie

  • “Recommendations”:
  • “Your recommendations list is automatically generated. It takes into account the feeds you’re already subscribed to, as well as information from your Web History, including your location. Aggregated across many users, this information can indicate which feeds are popular among people with similar interests. For instance, if a lot of people subscribe to feeds about both peanut butter and jelly, and you only subscribe to feeds about peanut butter, Reader will recommend that you try some jelly. This process is completely automated and anonymous; your personal information will be protected in accordance with our privacy policy.”
    (Quelle: Google Reader Help)

Output

Liste mit empfohlenen Feeds

6.9. Facebook Friend Suggestions

Population

Alle Facebook-Nutzer

Signal

  • Network you are part of
  • mutual friends
  • work and education information
  • contacts imported
    (Facebook Help Center)

Technologie

Friends: Suggestions: Suggestions is a feature that helps you connect with people and Pages you are likely to know. Facebook calculates Suggestions based on the networks you are a part of, mutual friends, work and education information, contacts imported using the Friend Finder, and many other factors. (Facebook Help Center)

Output

Liste mit Facebook-Usern, die man kennen könnte.

7. Suchmaschinen

7.1. Google Suche

Population

Alle Webdokumente

Signal

Brin in Founders Letter 2008

  • Personalization
  • PageRank
  • Text-Matching Techniques

Google Keeps Tweaking Its Search Engine – New York Times.

  • -about a half-dozen major and minor changes a week to the ranking algorithm
  • -search engine has many thousands of interlocking equations
  • -Freshness describes how many recently created or changed pages are included in a search result,
  • -QDF “query deserves freshness.” = mathematical model that tries to determine when users want new information and when they don’t. THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information.
  • -system for ranking pages with 200 types of information
  • -Page Rank is one of those 200 signals
  • -Some are drawn from the history of how pages have changed over time
  • -Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years
  • -Increasingly, Google is using signals that come from its history of what individual users have searched for in the past
  • -Google feeds the “signals” into formulas it calls “classifiers”
  • “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query
  • diversity – a measurement to have different pages in the ten most relevant results

200 Signals Hal Varian in the corporate Blog.

Google Blog: Introduction to Google Ranking

Richtlinien für Webmaster

  • Richtlinien zur Gestaltung und zum Content
  • klar strukturiertem Aufbau und Textlinks.
  • nützliche, informative Website
  • Suchbegriffe verwenden
  • TITLE-Elemente und ALT-Attribute sollen aussagekräftig sein
  • wenige und kurze Parameter.
  • Höchstens 100 Links pro Seite
  • keine manipulativen Verfahren (z. B. Täuschung von Nutzern durch Registrierung von absichtlich falsch geschriebenen Namen bekannter Websites).
  • Spamreport
  • Seiten in erster Linie für Nutzer, nicht für Suchmaschinen.
  • keinen Content für Suchmaschinen bereit, den Sie nicht für Ihre Besucher verwenden. Dies wird als “Cloaking” bezeichnet.
  • Vermeiden Sie Tricks, die das Suchmaschinen-Ranking verbessern sollen.
  • Keine Linktauschprogrammen
  • Keine Links zu Webspammern oder “schlechter Nachbarschaft” im Web
  • Kein verborgener Text oder verborgene Links.
  • Kein Cloaking oder irreführende Weiterleitungen.
  • Keine automatischen Suchanfragen an Google.
  • Keine Seiten mit irrelevanten Suchbegriffen hoch.
  • Keine doppelten Seiten, Sub-Domains oder Domains, die im Grunde denselben Content haben.
  • Keine “Brückenseiten”, die speziell für Suchmaschinen erstellt wurden

Aus dem Buch “Webseiten Ranking”

  • 1. Zahl und Qualität eingehender Links
  • 2. Webseitentitel, Seitenbeschreibung, Überschriften, zu viele Themen pro Seite vermeiden
  • 3. Domainnamen, Dateinamen, Ordner, Subdomains
  • 4. Häufigkeit, Position, Hervorhebung von Schlüsselbegriffen (Keyword-Density)
  • 5. Typische Fehler vermeiden (doppelte Inhalte bei verschiedenen Domains, Bilder anstatt Text, Frames, Flash, Skripte, Spielereien, Suchmaschinen Spam, verbotene Tricks)

Technologie

  • Automated Search Technology
  • PageRank
  • Text-Matching Techniques Google Inc. Annual Report 2008
  • Hypertext-Matching Analysis
    -”analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user’s query” Technology Overview
  • Ranking Algorithm
    “The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it’s not just the language, it’s how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing)”. Udi Manber, VP of engineering, Google Blog
    -”a collection of algorithms used to find the most relevant documents for a user query” Amit Singhal, Google Blog
  • -Ranking Algorithm = the formulas that decide which Web pages best answer each user’s question (New York Times

Output

Liste an nach Relevanz bezüglich einer Suchanfrage geordnete Webdokumente

7.2. Yahoo Suche

Population

Signal

  • Unique Content
  • Pages designed for humans
  • intersting links
  • accurate metadata
  • Good Webdesign
  • No Pages that harm accuracy, diversity or relevance
  • No doorway pages
  • No Double Content
  • No affiliate content
  • No numerous, unnecessary virtual hostnames
  • No Pages in great quantity, automatically generated or of little value (cookie-cutter pages)
  • No Pages using methods to artificially inflate search engine ranking
  • No use of text or links hidden from the user
  • No Pages that give the search engine different content than what the end user sees (cloaking)
  • No Sites cross-linked excessively with other sites to inflate a site’s apparent popularity (link schemes)
  • No Pages built primarily for the search engines or pages with excessive or irrelevant keywords
  • No Misuse or inaccurate use of competitor or brand names
  • No Sites that use excessive pop-ups, install malware (i.e. spyware, viruses, trojans), or interfering with user navigation
  • No Pages that seem deceptive, fraudulent, or provide a poor user experience

Yahoo Search Content Quality Guidelines

Technologie

  • crawling, indexing, and ranking algorithms (hier, hier,
  • ranking model hier, hier
  • “Search Ranking” (Yahoo Help
  • Search Technology (YST) (Yahoo Help)
  • Machine Learning: The Machine Learning group is a team of experts in computer science, statistics, mathematical optimization, and automatic control. We focus on making computers learn abstractions, patterns, conditional probability distributions, and policies from web scale data with the goal to improve the online experience for Yahoo users, partner publishers, and advertisers. Link
  • Search & Web Mining. Search technologies are an international team of experts in search, algorithms, data mining, natural language, and data processing. Together, we build systems and algorithms to analyze user needs, then synthesize and deliver the right responses from data sources around the globe. Link Yahoo Research
  • Microeconomics & Social Systems (Algorithmic Game Theory, Network analysis) Yahoo Research.

Five Years of Yahoo Search
Yahoo Search on Wikipedia

Output

Nach Relevanz für einen bestimmten Suchbegriff geordnete Liste von Webdokumenten

7.3. Bing Suche

Population

Alle Webdokumente

Signal

  • web page content,
  • the number and quality of websites that link to your pages,
  • relevance of your website’s content to keywords.
  • Use unique tags on each page
  • Use unique description tags on each page
  • Use H1 tags
  • Use text navigation links
  • Create content for your human visitors, not the Bing web crawler
  • Incorporate keywords into URL strings

Technologie

  • “Decision Engine, designed to help people cut through Internet clutter to make smarter, more informed decisions” (Bing Press Kit)
  • “Decision Engine” (Press Release)
  • “Best match, sets up the best result in a distinct, easily accessible space at the top of the page” (Bing Blog)
  • Complex Tasks and Decision Making Perhaps the most interesting insight is that people are turning to search engines not only for information, but to help them complete complex tasks and make decisions They no longer want to just find a web page; they want to learn, shop, be entertained, accomplish tasks and make important decisions. This requires better conceptual organization, a unified experience, deeper task specific content selection and support for longer sessions. (Bing Blog: User Needs, Features and the Science behind Bing)
  • “algorithmic Web results”
  • “intent-specific ranking”
  • “technologies and algorithms for extracting structure from unstructured data and applying organizational taxonomies. Even though the organization is not nearly as intuitive as one done by a human editorial process, we are able to achieve it in a fully automated way allowing for fast scaling and reacting to changes.”
    “real time indexing and ranking”
    “RankNet (our ranking system)”
    “technologies in HTML parsing, core Natural Language Processing, entity extraction, and document classification.”
    (Bing Blog: User Needs, Features and the Science behind Bing)
  • Bing website ranking
    Bing website ranking is completely automated. The Bing ranking algorithm analyzes factors such as web page content, the number and quality of websites that link to your pages, and the relevance of your website’s content to keywords. Site ranks change as we review the factors that make up the ranking. Although you can’t directly change your website’s ranking, you can optimize its design and technical implementation to enable appropriate ranking by most search engines. For information about improving your website’s ranking, we suggest you check out the Webmaster Center tools. We highly recommend that you use Webmaster Center’s Crawl Issues tool to determine if your site is being penalized for issues such as malware. Also review our post on what to do if you are not ranking. (Crawling and Ranking FAQs)

Output

Nach Relevanz für eine Suchanfrage geordnete Web-Dokumente

7.4. Ask Suche

Population

Alle Web-Dokumente

Signal

  • ExpertRank
  • Clustering

Technologie

  • ExpertRank algorithm
    “Our ExpertRank algorithm provides relevant search results by identifying the most authoritative sites on the Web. With Ask search technology, it’s not just about who’s biggest: it’s about who’s best. Our ExpertRank algorithm goes beyond mere link popularity (which ranks pages based on the sheer volume of links pointing to a particular page) to determine popularity among pages considered to be experts on the topic of your search. This is known as subject-specific popularity. Identifying topics (also known as “clusters”), the experts on those topics, and the popularity of millions of pages amongst those experts — at the exact moment your search query is conducted — requires many additional calculations that other search engines do not perform. The result is world-class relevance that often offers a unique editorial flavor compared to other search engines.” (Ask Search Technology
  • Teoma algorithm, now known as ExpertRank (Webmaster Help)
  • click popularity search technology: (Webmaster Help)
  • “clustering concept of subject-specific popularity”, “subject-specific popularity. Identifying topics (also known as “clusters”)” (Webmaster Help)

Output

Nach Relevanz geordnete Liste von Webdokumenten

7.5. Baidu Suche

7.6. DeepDyve Suche

Population

Expert Sources. Our expert sources include the world’s most trusted publishers, scholarly societies, universities, and government agencies. In addition, DeepDyve monitors over 30,000 thousand trade, industry, and specialist sources to create specialized collections of current news and information in life sciences, clean energy, and technology. (DeepDyve: Expert Sources

Signal

  • Patterns & Symbols
  • Expert Sources

Technologie

  • research engine: (About)
  • KeyPhrase algorithm
    “KeyPhrase algorithm, applies indexing techniques from the field of genomics. The algorithm matches patterns and symbols on a scale that traditional search engines cannot match, and it is perfectly suited for complex data found on the Deep Web.” (About)
  • KeyPhrase technology
    “KeyPhrase technology extracts substantially more information from documents than typical keywords. It indexes every word, as well as every phrase in each document, and weighs their informational impact using advanced statistical computation” (Deep Dyve Research Engine
  • dynamic grouping technology.
    “dynamic grouping technology. Dynamic grouping allows you to quickly skim the entire universe of related topics and understand their relationships with each other.” (Deep Dyve Research Engine

Output

Nach Relevanz geordnete Recherche-Suchresultate

7.7. Twitter Suche

Population

Alle Tweets

Signal

real-time Information

Technologie

Twitter Search helps you filter all the real-time information coursing through our service. (About

Output

Real time Trending Topics

8. Citation Indexing

8.1. Google Scholar

Population

wissenschaftlicher Literatur…von Kommilitonen bewertete Seminararbeiten, Magister-, Diplom- sowie Doktorarbeiten, Bücher, Zusammenfassungen und Artikel, die aus Quellen wie akademischen Verlagen, Berufsverbänden, Magazinen für Vorabdrucke, Universitäten und anderen Bildungseinrichtungen stammen. (About DE)

Signal

  • vollständiger Text eines Artikels
  • Autor
  • Veröffentlichungsort
  • Anzahl Zitate in wissenschaftlicher Literatur
    (About DE)

Technologie

  • “Ranking Technologie”
    “Google Scholar ordnet Ihre Suchergebnisse nach Relevanz an. So wie bei der Webseitensuche mit Google werden die nützlichsten Verweise oben auf der Seite angezeigt. Die Ranking-Technologie von Google berücksichtigt den vollständigen Text eines Artikels, den Autor, wo der Artikel veröffentlicht wurde und wie oft der Text in der wissenschaftlichen Literatur zitiert wurde.” (About DE)
  • “weighing the full text of each article, the author, the publication in which the article appears, and how often the piece has been cited in other scholarly literature.” (a href=”http://scholar.google.com/intl/en/scholar/about.html”>About EN

Output

liste von akademischen Publikationen sortiert nach Relevanz

8.2. CiteSeerX

Population

Scientific Literature

Signal

  • Citation statistics
  • Reference linking
  • Citation context
  • Awareness and tracking
  • Full-text indexing
  • Query-sensitive summaries
    About
  • “Autonomous Citation Indexing (ACI)”: ACI autonomously creates citation indices similar to the Science Citation Index R. (Lawrence et al 1999

Technologie

  • “algorithms, data, metadata, services, techniques, and software that can be used to promote other digital libraries.” About
  • “Autonomous Citation Indexing (ACI)”: ACI autonomously creates citation indices similar to the Science Citation Index R. (Lawrence et al 1999
  • “automated citation indexing and citation linking using the method of autonomous citation indexing”. About

Output

Liste von wissenschaftlichen Artikeln geordnet nach Relevanz (Aktualität oder Anzahl Zitationen)

9. Vergleichs-Systeme

9.1. Bing Airtravel Flugticket Preis Vorhersage

http://www.bing.com/travel/

Population

Alle Flugticket-Angebote

Signal

  • Preise
  • Zeit bis zum Abflug
  • Historische Daten zum Abflug
    (Ayres 2007)

historical data
(Bing Travel: Technology and Data)

Technologie

  • “smart travel search site”
  • “Price Predictor”, “Our Price Predictor shows if fares are rising or dropping. Based on the prediction, we provide a recommendation to buy now or buy later.
  • “hotel Rate Indicator”, “Our Rate Indicator indicates whether or not today’s rate for a specific hotel is a deal. It compares an individual hotel’s
  • current rate found to its observed historical rates.”
    (About
  • “predictive technology”
  • “predictive, statistical and data management algorithms”
  • “Data Aggregation and Analysis”
  • “statistics, data mining and machine learning”
  • “algorithms that can identify patterns and conditions from our history of accumulated airfare data”
  • “predictive models”

(Bing Travel: Technology and Data)

Output

Kauf-Empfehlung für Flugtickets

9.2. Zillow.com Immobilien Preis Vorhersage

Population

Alle Immobilien in den USA

Signal

  • “make zillions of data points for homes accessible to everyone” (About)
  • publicly available information such as comparables and tax information (Zillow: The Big Idea)

Technologie

“Zestimate”, “Zindex” “Zillow calculates a Zestimate® home valuation as a starting point for anyone to see — for free — for most homes in the U.S. So, using the Zestimate as the foundation, we built a Web page for each home and then filled it with data and maps and layered it with publicly available information such as comparables and tax information.” (Zillow: The Big Idea)

Output

Schätzungen über den Wert von Immobilien

9.3. Google Product Search

Population

Alle Produkt-Angebote: Quelle sind Anbieter, die ihre Produkt-Webseiten eintragen plus alle Produkt-Webseiten, die von Google gespidert werden

Signal

?

Technologie

  • “ranking software”
  • “Google’s product search results are automatically generated by our ranking software. Google does not accept payment for inclusion of products in our search results, nor do we place sellers’ sites higher in our results if they’re advertisers or offer to pay for that placement.”
  • “search engine”
    (About

Output

“Our job is to find the product you want and point you to the store that sells it based on our assessment of what’s most relevant to your search” (About

10. Semantic Web Apps

10.1. Powerset Suche

Jetzt Bing.com

Population

Wikipedia articles

Signal

Factz (Meanings of Sentences)

Technologie

  • “natural language processing”
  • “Powerset is first applying its natural language processing to search, aiming to improve the way we find information by unlocking the meaning encoded in ordinary human language.”
  • “enabling computers to understand our language”

()

“Factz”
“Factz are concise representations of information extracted from sentences. They are represented in three parts: the subject, relation and object (e.g. Oswald shot JFK). Factz are one way that Powerset represents the meaning of a sentence.” (FAQ)

Output

Webdokument nach Relevanz geordnet

11. Location-based Systeme

11.1. Citysense

Population

: Night Life Hotspots in San Francisco

Signal
“billion points of GPS and WiFi positioning data – plus real-time feeds ” (CitySense More info)
history and preferences (in nächster Version)
Mobile Location Data

Technologie

  • Sense Networks Macrosense platform
    Sense Networks Macrosense platform, which analyzes massive amounts of aggregate, anonymous location data in real-time. Macrosense is already being used by business people for things like selecting store locations and understanding retail demand.
  • “MacroSense”
    “Relevant Recommendation”
    “Personalization”
    “Discovery from Mobile Location Data”
    “Real-Time Activity Analysis”
    (Sensenetworks
  • “Location Tracking” (Techcrunch)

Output

:
Karte mit den realtime hotspots in einer Stadt

12. Anderes

12.1. Flick Interestingness

Population

Alle Bilder auf Flicker

Signal

Technologie

  • “Interestingness”
    “There are lots of elements that make something ‘interesting’ (or not) on Flickr. Where the clickthroughs are coming from; who comments on it and when; who marks it as a favorite; its tags and many more things which are constantly changing. Interestingness changes over time, as more and more fantastic content and stories are added to Flickr.” (About Interestingness)
  • “interestingness is a ranking algorithm based on user behavior around the photos” (Flickr Blog: Ten new things)
  • “Geotagging” (Flickr Help Map

Output

Übersichtsseite mit den interessantesten Bildern (Last 7 days)

12.2. Google Trends

Population

Alle Suchabfragen bei Google

Signal

  • Search Volume
  • News-Stories

Technologie

  • “Enter up to five topics and see how often they’ve been searched on Google over time. Google Trends also shows how frequently your topics have appeared in Google News stories, and in which geographic regions people have searched for them most.”
  • “Search Volume Index”
    “News reference volume”, “the number of times your topic appeared in Google News stories”
    (About Google Trends).
  • “hot trend”: “The top 100 fastest-rising search queries right now (U.S. only). Updates throughout the day. ” Hot Trends
  • Google Zeitgeist: “Zeitgeist” means “the spirit of the times”, and Google reveals this spirit through the aggregation of millions of search queries we receive every day. We have several tools that give insight into global, regional, past and present search trends. These tools are available for you to play with, explore, and learn from. Use them for everything from business research to trivia answers. (Google Zeitgeist

Output

Vergleich zwischen Seiten/Suchbegriffen nach Relevanz in Search Volume gemessen

12.3. Bing xRank

Population

notable people

Signal

Web Searches

Technologie

xRank keeps track of notable people and puts them in order for you. We count Bing web searches for movie stars, musicians, and other famous people. Then, we compile our findings into an insightful ranking formula that tells you who the world is searching for most. The result is a cultural snapshot of who’s hot and who’s not! (Bing xRank, What is xRank?

Output

Artist ranking “a cultural snapshot of who’s hot and who’s not”

12.4. Hunch.com

Population

Fragen und Entscheidungen

Signale

  • Questions
  • answers
  • personal answering history
  • what you’ve already been asked
  • how you’ve answered
    (Hunch.com: How Hunch works)

Technologie

  • “The Hunch algorithm”
  • “question selection algorithm”
  • “machine learning”
  • question selection algorithm: “In choosing what to ask you, Hunch’s question selection algorithm tries to do two things. First, it tries to find a question which will discriminate well among the remaining possible decision outcomes for you – thus filtering the remaining choices from “many” to “fewer”. Second, the algorithm looks for a question which can help optimize and rank the remaining decision results to present you with the ones you’ll like the most. It’s trying to ensure that you’ll like outcome #1 better than outcome #5.”
  • “machine learning based on statistical inferences” “The academic name for this sort of algorithm is machine learning”
  • “decision making algorithm”
    (Hunch.com: How Hunch works)

Output

Antworten


2 Comments


Leave a Reply