Diese Seite ist Work-in-Progress. Hier versuche ich für jeden Dienst, der mit Data-Mining-Techniken ein Medienprodukt hertstellt, die vier Aspekte Population, Signal, Technologie und Output in kleinen Fallstudien zu kategorisieren:
1) Population: Welche Grundgesamtheit von Elementen wird vom Dienst in eine Rangreihe gebracht?
2) Signale: Welche Daten werden als Signale fürs Ranking beigezogen und verarbeitet
3) Technologie: Wie wird die Technologie vom Anbieter offiziell gennant und wie funktioniert sie.
4) Output: Nach welchen Kriterien wird der Output des Dienstes sortiert und wie lautet die offizielle Terminologie?
Inhalt
1. News-Aggregatoren
1.1. Google News
1.2.Techmeme
1.3. Hashtags.org
1.4. Bing News Aggregator
1.4. Tweetmeme
2. Mediaplanung und Werbeplatzvermarktung
2.1. Google AdWords
2.2. Yahoo Search Engine Marketing / Yahoo Sponsored Links
2.3. Google AdSense
2.4. Google AdPlanner
2.5. Double Click Media Visor
2.6. Microsoft Search Advertising
2.7. Ask Sponsored Links
2.8. Adelio
2.9. Facebook
3. Reputation Systeme
3.1. Ebay Feedback
3.2. Everything2
3.3. Slashdot
3.4. Technorati
3.5. Postrank
3.6. Tweet Rank
4. Social Bookmark Aggregatoren
4.1. Delicious
5. Social News Aggregatoren
5.1. Digg
6. Recommender Systeme
6.1. Amazon Recommendations
6.2. LastFM
6.3. Apple Genius
6.4. Pandora
6.5. Stumble Upon
6.6. Netflix
6.7. Youtube Recommender System (Related Videos)
6.8. Google Reader Recommender System
6.9. Facebook Friend Suggestions
7. Suchmaschinen
7.1. Google Suche
7.2. Yahoo Suche
7.3. Bing Suche
7.4. Ask Suche
7.5. Baidu Suche
7.6. DeepDyve Suche
7.7. Twitter Suche
8. Citation Indexing
8.1. Google Scholar
8.2. CiteSeerX
9. Vergleichs-Systeme
9.1. Bing Airtravel Flugticket Preis Vorhersage
9.2. Zillow.com Immobilien Preis Vorhersage
9.3. Google Product Search
10. Semantic Web Apps
10.1. Powerset Suche
11. Location-based Systeme
11.1. Citysense
12. Anderes
12.1. Flick Interestingness
12.2. Google Trends
12.3. Bing xRank
12.4. Hunch.com
1.1. Google News
Population
- News-Stories der 25’000 wichtigsten Online-Newsmedien weltweit. (Quelle)
- Deutsche Ausgabe: Mehr als 700 deutschsprachige Nachrichtenquellen (Quelle)
Signale
- Häufigkeit, mit welcher eine News-Story aufgegriffen wird
- Medium, von welcher eine Story aufgegriffen wird
- freshness
- location
- relevance
- diversity
- personalized interests (Quelle)
- Webprotokoll für Personalisierung
- Vom User am häufigsten geklickte News (für personalisierung)
- Von anderen am häufigsten geklickte News (Bereich besonders beliebt)
- Sprache
- Region
- Thema (Quelle)
Technologie
- Computer-generated Newssite that aggregates headlines from news sources worldwide, ranked by computers that evaluate. (Quelle)
- Computergenerierte News-Website (Quelle
- Gruppierungstechnologie, Clustering-Algorithmen Quelle
- “Auswahl und Anordnung der Artikel auf dieser Seite werden durch ein Computerprogramm automatisch bestimmt.” Disclaimer auf Startseite
- “The selection and placement of stories on this page were determined automatically by a computer program” Disclaimer auf Startseite
Output
- Sortiert nach Ressort
- Relevanzmass (berechnet aus Signalen)
- Gruppierung der Stories
1.2.Techmeme
Population
- “the must-read stories in technology …. across hundreds of news sites” (Quelle)
- Online Tech-News, englischsprachig
Signal
- importance
- number of links to the story’s web page
- freshness.
- “Anti-gaming” a high number of links in a short period of time, or by a small number of people
(Wikipedia)
Technologie
- “computer algorithm extended with direct human editorial input” (a href=”http://techmeme.com/”>Techmeme
- “Techmeme arranges all of these links into a single, easy-to-scan page. Story selection is accomplished via computer algorithm extended with direct human editorial input.” (Techmeme)
- Technology news aggregator. Wikpedia
- “The full set of sites it monitors is constructed automatically, and even changes in real time based on linking. A small “seeding” list I construct manually is used to help the system build the complete list.” (Wired)
Output
- Relevanz der Story (berechnet nach Algorithmus
- Innerhalb der Story nach Relevanz der Diskussions-Quellen
1.3. Hashtags.org
Population
Alle Hashtags von Twitter
Signal
- Häufigkeit der Hashtags in den letzten sechs Stunden
- Offizieller Name der Technology: Hashtag-Aggregator
Output
Liste von Hashtags nauch Häufigkeit
1.4. Bing News Aggregator
1.5. Tweetmeme
Population:
links on twitter
Signale
- Häufigkeit der Links
- Zeit
- Thema
Technologie
- “service which aggregates all the popular links on twitter”
- “categorize these links into categories and subcategories”
Output
Die populärsten Links auf Twitter geordnet nach Kategorie
2. Mediaplanung und Werbeplatzvermarktung
2.1. Google AdWords
Population
Alle Google AdWords-Werbeanzeigen
Signal
- Sprache
- Location
- Landing Page Quality Score (Relevance, Originality, Transparency, Navigability Quelle
- Keyword Quality Score (historical clickthrough rate, account history, historical Clickthrough Rate of the display URL, Landing Page Quality Score, Relevance of the keyword to the ads, relevance of the keyword and the matched ad to the search query, accounts performance in the geographical region where the will be shown, other relevance factors, Quelle)
- Bidding Price
- “The bids themselves are only a part of what ultimately determines the auction winners. The other major determinant is something called the quality score. This metric strives to ensure that the ads Google shows on its results page are true, high-caliber matches for what users are querying. If they aren’t, the whole system suffers and Google makes less money. Google determines quality scores by calculating multiple factors, including the relevance of the ad to the specific keyword or keywords, the quality of the landing page the ad is linked to, and, above all, the percentage of times users actually click on a given ad when it appears on a results page. (Other factors, Google won’t even discuss.) There’s also a penalty invoked when the ad quality is too low—in such cases, the company slaps a minimum bid on the advertiser.” (Wired)
- ”To determine whether an ad is relevant to a particular query, this system weighs an advertiser’s willingness to pay for prominence in the ad listings (cost-per-click or cost-per-impression bid) and interest from users in the ad as measured by the click through rate and other factors….Also assign minimum bids to advertisers keywords based on the Quality scores of those keywords. Quality score is determined by an advertiser’s keyword click-through rate, the relevance of the ad text, historical keyword performance, the quality of the ad’s landing page” (Quelle: Google Inc. Annual Report 2008)
Technologie
- “pay-per-click (PPC) advertising, and site-targeted advertising” (Wikipedia)
- “AdWords, Google’s unique method for selling online advertising. AdWords analyzes every Google search to determine which advertisers get each of up to 11 “sponsored links” on every results page. It’s the world’s biggest, fastest auction” (Wired)
- “It was called a two-sided matching market. “The mathematical structure of the Google auction,” Varian says, “is the same as those two-sided matching markets.” (Herman Leonard)”(Wired
- The Keyword Pricing Index Wired
- AdWords Ranking: ”Ads are ranked for display in AdWords based on a combination of the maximum cost-per-click pricing set by the advertisers and click through rates and other factors” (Quelle: Google Inc. Annual Report 2008)
- AdWords Discounter: ”automatically lowers the amount advertisers actually pay to the minimum needed to maintain their ad position” (Quelle: Google Inc. Annual Report 2008)
- Site Targeting: Based on third-party opt-in panel data. (Quelle: Google Inc. Annual Report 2008)
- Google AdWord Auction System: Auction Based System, Automated execution of an auction. (Quelle: Google Inc. Annual Report 2008)
Output
Werbeanzeigen passend zum Keyword der Suchabfrage, sortiert nach Relevanz und bezahltem Preis.
2.2. Yahoo Search Engine Marketing / Yahoo Sponsored Links
Population
Alle Werbeanzeigen der Werbetreibenden
Signal
- Location
- Sprache
- Auktionsgebote
- Qualitätsindex (Relevanz der Anzeige im vergleich mit Wettbewerber, Klickrate) (Quelle 1, Quelle 2)
- Qualität Landingpage,
- algorithmische Seitenplatzierung, algorithmische Relevanz (Quelle)
Technologie
- Offizieller Name der Technology: Yahoo! Sponsored Search
- paid placement, contextual advertising (Wikipedia Search Engine Marketing)
- Computational Advertising
- Computational advertising is a new scientific sub-discipline, at the intersection of information retrieval, machine learning, optimization, and microeconomics. Its central challenge is to find the best ad to present to a user engaged in a given context, such as querying a search engine (“sponsored search”), reading a web page (“content match”), watching a movie, and IM-ing. Yahoo Research.
Output
Werbeanzeigen passend zum Keyword der Suchabfrage, sortiert nach Qualität und bezahltem Preis.
2.3. Google AdSense
Population
Alle über AdSense gebuchten Anzeigen
Signal
- keyword analysis
- word frequency
- overall link structure of the web (Google Inc. Annual Report 2008)
- Sprache
- Keyword-Analyse
- Worthäufigkeit
- Schriftgröße
- Linkstruktur des Webs (AdSense Help)
- Placement-Targeting (Ausrichtung auf Anzeigenplatzierungen). Beim Placement-Targeting wählen Inserenten bestimmte Anzeigenplatzierungen oder Teilabschnitte von Publisher-Websites aus.
AdSense Help
Technologie
- Content Targeting: Contextual advertising option (Google Inc. Annual Report 2008)
- AdSense for Content: “automated technology to analyzes the meaning of the content on the web site and serve relevant ads based on the meaning of the content” Google Inc. Annual Report 2008
- AdSense Contextual Advertising Technology:”…techniques that consider factors such as keyword analysis, word frequency and the overall link structure of the web to analyze the content of individual web pages to match ads to them almost instantaneously….automatically serve contextually relevant ads….we employ similar techniques for matching advertisements to other forms of textual content such as email messages and Google Groups postings” Google Inc. Annual Report 2008
- Content-Targeting: Bei dieser Technologie werden Faktoren wie Sprache, Keyword-Analyse, Worthäufigkeit, Schriftgröße und die gesamten Linkstruktur des Webs genutzt, um das Thema einer Webseite zu bestimmen und genau passende Google-Anzeigen für die einzelnen Webseiten zu finden. Was ist AdSense für Suchergebnisseiten AdSense Help
Output
Liste mit einer vorgebenen Anzahl Anzeigen, die relevant sind für den Inhalte einer Seite, sortiert nach Relevanz und bezahltem Preis.
2.4. Google AdPlanner
Population
Alle Webseiten, die ein gewisses Traffic-Volumen haben, Google nicht via robots.txt aussperren und den Qualitäts-Guidelines entsprechen Ad Planner Help)
Signal
- aggregated Google search data
- opt-in anonymous Google Analytics data
- opt-in external consumer panel data
- other third-party market research AdPlanner Help
Technologie
- Google Ad Planner: “research and media planning tool that allows agencies and advertisers to identify the web sites their target customeRs are likely to visit. MediaVisor improves media buying by replacing formerly manual tasks in the media buying process.” (Google Inc. Annual Report 2008)
- “computer algorithms”: Google Ad Planner combines information from a variety of sources, such as aggregated Google search data, opt-in anonymous Google Analytics data, opt-in external consumer panel data, and other third-party market research. The data is aggregated over millions of users and powered by computer algorithms; it doesn’t contain personally-identifiable information. AdPlanner Help
- automated analysis of millions of search queries and site visits Ad Planner Help
- algorithms that improve the demographic estimates (Ad Planner Help
Output
Liste von Webseiten, sortiert nach Relevanz für eine Zielgruppe
2.5. Double Click Media Visor
Google Ad Planner and MediaVisor
“research and media planning tool that allows agencies and advertisers to identify the web sites their target customeRs are likely to visit. MediaVisor improves media buying by replacing formerly manual tasks in the media buying process.” Google Inc. Annual Report 2008
2.6. Microsoft Search Advertising
Population
All Search Ads
Signal
- Bids
Relevanz
Technologie
Output
Search Ads geordnet nach Relevanz für die Suchanfrage
2.7. Ask Sponsored Links
http://sponsoredlistings.ask.com/
2.8. Adelio
keine Techinfo
2.9. Facebook
keine Techinfo
3. Reputation Systeme
3.1. Ebay Feedback
Population
Alle eBay-Nutzer
Signal
- Nutzer-Feedbacks
- Ratings
Technologie
Feedback System (Wikipedia)
Output
Ein Rating der Vertrauenswürdigkeit der User in Prozenten (Wikipedia)
3.2. Everything2
Population
Alle nutzergenerierten Artikel auf http://everything2.com
Signal
- Votes
- C (Everything2)
Technologie
Everything2 Voting/Experience System (Everything2)
Output
Artikel-Ratings
3.3. Slashdot
Population
Alle Kommentatoren auf Slashdot
Signal
- Ratings von Nutzer
- +1 point” or “-1 point”;
- redefined labels, such as Flamebait or Informative. (Wikipedia
Technologie
Karma
Output
Liste von Kommentaren, gefiltert nach der Reputation des Kommentars und Reputation des Kommentar-Autors
3.4. Technorati
Population
Alle Weblogs und Weblog-Beiträge, die den Quality-Guidelines entsprechen. Keine Foren, Social Networks, Aggregation Sites.
(Technorati)
Signal
Number of Unique Blogs that Link to a website (Technorati)
Technologie
Technorati Authority: Technorati Authority is the number of blogs linking to a website in the last six months. The higher the number, the more Technorati Authority the blog has.(Technorati Blog, Technorati Support)
Output
Nach Relevanz der jeweiligen Blogs gewichtete oder nach Aktualität sortierte Suche über Blogbeiträge
Blogranking
3.5. Postrank
Population
all kind of online content (Postrank What)
Signal
- social engagement,
- writing a blog post in response to someone else,
- bookmarking an article
- leaving a comment on a blog
- clicking a link to read a news item.
- frequency of an audience’s interaction with online content.
- analysis of the “5 Cs” of engagement: creating, critiquing, chatting, collecting, and clicking.
(Postrank What
Technologie
PostRank: scoring system…to rank any kind of online content, such as RSS feed items, blog posts, articles, or news stories. (Postrank)
Output
Nach Aktivät gewichtete Liste von Web-Dokumenten
3.6. Tweet Rank
Dieses Produkt wurde noch nicht gelauncht, wurde aber bei Techcrunch im Zusammenhang mit den veröffentlichten geheimen Twitter-Strategie-Dokumenten erwähnt:
“next gen search results page” and a (much-needed) reputation system which internally is being called “Tweet rank.”
4. Social Bookmark Aggregatoren
4.1. Delicious
Input
Population
Alle von Nutzern abgelegten Webseiten-Bookmarks
Signal
Popularität (Anzahl User, die ein Webseite als Bookmark gespeichert haben)
Aktualität (recency). Bookmarking Zeitpunkt
Technologie
social bookmarking service (Delicious About
Output
Liste von Webdokumenten, die nach Relevanz geordnet sind, wobei Relevanz bedeutet, wie viele Leute haben eine bestimmte Webseite zu ihren persönlichen Bookmarks hinzugefügt.
5. Social News Aggregatoren
5.1. Digg
Population
Alle von Nutzern beigtragenen Links zu Stories
Signal
- Anzahl Diggs (positve Ratings)
- Anzahl Burries (negative Ratings)
- Aktualität
Technologie
“Digg promotes what users like best” (Digg About
Output
Liste von aktuellen News-Stories geordnet nach Relevanz, wie sie der aggregierten Ratings der DIGG-Community entsprechen.
6. Recommender Systeme
6.1. Amazon Recommendations
Population
Alle Bücher/Produkte im Katalog von Amazon
Signal
- customer’s interests
- Gekaufte Items
- Nutzerbedürfnisse
- Kaufgeschichte
- personalisierte Daten
- Ähnlichkeitsmass von Büchern
Technologie
- “Recommendation Algorithm”
- “item-to-item collaborative filtering”
- “Rather than matching the user to similar customers, item-to-item collaborative filtering matches each of the user’s purchased and rated items to similar items, then combines those similar items into a recommendation list”
Linden et. al. 2003)
Output
list of recommended items
6.2. LastFM
Population
Alle Musiksstücke in der Last.fm-Datenbank
Signal
- Empfehlungen
- Lieblingslieder
- Kommentare zu Lied
Quelle
- “musikalische Nachbarn” (Quelle: Wikipedia)
Technologie
- “Musikdienst, der lernt, was du magst”
- “Audioscrobbler” (Vorgängertechnologie) (Quelle)
- “Scrobbeln” (scrobbling/to scrobble) ist das Übermitteln von Musiktiteln an unsere Datenbank
- “Tasteometer”: To compare one’s music taste with another (a href=”http://www.last.fm/api/show?service=258″>Quelle API Documentation.
Output
Liste mit Songs, die dem eigenen Geschmack entsprechen.
6.3. Apple Genius
Population
Alle Songs auf auf iTunes und in der iTunes-Store Datenbank
Signal
- Playlists
- Song
Technologie
“makes playlists from songs in your iTunes library that go great together” (Apple Support
Collaborative Filtering (Wikipedia
Output
Automatisiert das erstellen von Playlists
6.4. Pandora
Population
Signal
- Music Analyisis
- hundreds of musical details on every song (about
- Music Genome Project (about
- 400 distinct musical characteristics analyzed by a trained music analyst
Technologie
- Music Genome Project (Wikipedia: Music Genome Project)
- Music Recommendation (Wikipedia)
Output
Music Recommendations
6.5. Stumble Upon
Population
Alle Webdokumente
Signal
- User Ratings
- Peer Ratings
- Similar Users
- Page Recommendations
- Selected Interests.
Technologie
- “Personalized Recommendation”
- “Clustering Engine”
- “Classification Engine”
- “Recommendation Engine”
- “emergent content referral system”
- “patent-pending Toolbar system automates the collection, distribution and review of web content within an intuitive social framework”
(Quelle: SU Technology
Output
Liste mit personalisierte Empfehlungen für Webseiten.
6.6. Netflix
Population
Alle Filme in der Netflix-Datenbank
Signal
- user
- movie
- date of grade
- grade
Wikipedia: Netflix Prize
Technologie
- “Cinematch Algorithm”
- “straightforward statistical linear models with a lot of data conditioning” (FAQ)
- “personalized video-recommendation system” (Netflix on Wikipedia)
- “video-recommendation algorithm” (Netflix on Wikipedia)
- “collaborative filtering algorithm”Wikipedia: Netflix Prize, Official Netflix-Prize Page)
Output
Movie Recommendations
6.7. Youtube Recommender System (Related Videos)
Population
Alle Filme auf Youtube
Signal?
Technologie
- “Ähnliche Videos – Wenn du dir ein Video auf YouTube ansiehst, wird rechts unter dem Video eine Liste mit ähnlichen Videos angezeigt. Die Videos in dieser Liste haben eventuell ein ähnliches Thema wie das aktuelle Video und machen es dir leichter, andere Videos zu demselben oder einem ähnlichen Thema zu finden.” (Youtube Help
- “Related Videos – When you watch a video on the YouTube, a list of related videos will show up to the bottom-right of the video. This list contains many videos which might be related to the video by subject matter, so that you may find it easier to search out other videos based on the same or similar subject.” Related Videos
Output
Liste mit Videos, die ähnlich sind, wie das gegenwärtig geschaute Video
6.8. Google Reader Recommender System
Population
Alle Feeds, die mit Google Reader abonniert werden können
Signal
- Subscribed Feeds
- Web History
- Location
- Aggregated data accross many users
(Quelle: Google Reader Help)
Technologie
- “Recommendations”:
- “Your recommendations list is automatically generated. It takes into account the feeds you’re already subscribed to, as well as information from your Web History, including your location. Aggregated across many users, this information can indicate which feeds are popular among people with similar interests. For instance, if a lot of people subscribe to feeds about both peanut butter and jelly, and you only subscribe to feeds about peanut butter, Reader will recommend that you try some jelly. This process is completely automated and anonymous; your personal information will be protected in accordance with our privacy policy.”
(Quelle: Google Reader Help)
Output
Liste mit empfohlenen Feeds
6.9. Facebook Friend Suggestions
Population
Alle Facebook-Nutzer
Signal
- Network you are part of
- mutual friends
- work and education information
- contacts imported
(Facebook Help Center)
Technologie
Friends: Suggestions: Suggestions is a feature that helps you connect with people and Pages you are likely to know. Facebook calculates Suggestions based on the networks you are a part of, mutual friends, work and education information, contacts imported using the Friend Finder, and many other factors. (Facebook Help Center)
Output
Liste mit Facebook-Usern, die man kennen könnte.
7. Suchmaschinen
7.1. Google Suche
Population
Alle Webdokumente
Signal
- Personalization
- PageRank
- Text-Matching Techniques
Google Keeps Tweaking Its Search Engine – New York Times.
- -about a half-dozen major and minor changes a week to the ranking algorithm
- -search engine has many thousands of interlocking equations
- -Freshness describes how many recently created or changed pages are included in a search result,
- -QDF “query deserves freshness.” = mathematical model that tries to determine when users want new information and when they don’t. THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information.
- -system for ranking pages with 200 types of information
- -Page Rank is one of those 200 signals
- -Some are drawn from the history of how pages have changed over time
- -Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years
- -Increasingly, Google is using signals that come from its history of what individual users have searched for in the past
- -Google feeds the “signals” into formulas it calls “classifiers”
- “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query
- diversity – a measurement to have different pages in the ten most relevant results
200 Signals Hal Varian in the corporate Blog.
Google Blog: Introduction to Google Ranking
- Richtlinien zur Gestaltung und zum Content
- klar strukturiertem Aufbau und Textlinks.
- nützliche, informative Website
- Suchbegriffe verwenden
- TITLE-Elemente und ALT-Attribute sollen aussagekräftig sein
- wenige und kurze Parameter.
- Höchstens 100 Links pro Seite
- keine manipulativen Verfahren (z. B. Täuschung von Nutzern durch Registrierung von absichtlich falsch geschriebenen Namen bekannter Websites).
- Spamreport
- Seiten in erster Linie für Nutzer, nicht für Suchmaschinen.
- keinen Content für Suchmaschinen bereit, den Sie nicht für Ihre Besucher verwenden. Dies wird als “Cloaking” bezeichnet.
- Vermeiden Sie Tricks, die das Suchmaschinen-Ranking verbessern sollen.
- Keine Linktauschprogrammen
- Keine Links zu Webspammern oder “schlechter Nachbarschaft” im Web
- Kein verborgener Text oder verborgene Links.
- Kein Cloaking oder irreführende Weiterleitungen.
- Keine automatischen Suchanfragen an Google.
- Keine Seiten mit irrelevanten Suchbegriffen hoch.
- Keine doppelten Seiten, Sub-Domains oder Domains, die im Grunde denselben Content haben.
- Keine “Brückenseiten”, die speziell für Suchmaschinen erstellt wurden
Aus dem Buch “Webseiten Ranking”
- 1. Zahl und Qualität eingehender Links
- 2. Webseitentitel, Seitenbeschreibung, Überschriften, zu viele Themen pro Seite vermeiden
- 3. Domainnamen, Dateinamen, Ordner, Subdomains
- 4. Häufigkeit, Position, Hervorhebung von Schlüsselbegriffen (Keyword-Density)
- 5. Typische Fehler vermeiden (doppelte Inhalte bei verschiedenen Domains, Bilder anstatt Text, Frames, Flash, Skripte, Spielereien, Suchmaschinen Spam, verbotene Tricks)
Technologie
- Automated Search Technology
- PageRank
- Text-Matching Techniques Google Inc. Annual Report 2008
- Hypertext-Matching Analysis
-”analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user’s query” Technology Overview - Ranking Algorithm
“The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it’s not just the language, it’s how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing)”. Udi Manber, VP of engineering, Google Blog
-”a collection of algorithms used to find the most relevant documents for a user query” Amit Singhal, Google Blog - -Ranking Algorithm = the formulas that decide which Web pages best answer each user’s question (New York Times
Output
Liste an nach Relevanz bezüglich einer Suchanfrage geordnete Webdokumente
7.2. Yahoo Suche
Population
- Alle Webdokumente
- “billions of web documents” (Quelle Yahoo Help)
Signal
- link analysis Quelle
- PageRank Yahoo Research
- web page text
- title
- description accuracy
- source
- associated links
- other unique document characteristics.
(Quelle Yahoo Help)
- Unique Content
- Pages designed for humans
- intersting links
- accurate metadata
- Good Webdesign
- No Pages that harm accuracy, diversity or relevance
- No doorway pages
- No Double Content
- No affiliate content
- No numerous, unnecessary virtual hostnames
- No Pages in great quantity, automatically generated or of little value (cookie-cutter pages)
- No Pages using methods to artificially inflate search engine ranking
- No use of text or links hidden from the user
- No Pages that give the search engine different content than what the end user sees (cloaking)
- No Sites cross-linked excessively with other sites to inflate a site’s apparent popularity (link schemes)
- No Pages built primarily for the search engines or pages with excessive or irrelevant keywords
- No Misuse or inaccurate use of competitor or brand names
- No Sites that use excessive pop-ups, install malware (i.e. spyware, viruses, trojans), or interfering with user navigation
- No Pages that seem deceptive, fraudulent, or provide a poor user experience
Yahoo Search Content Quality Guidelines
Technologie
- crawling, indexing, and ranking algorithms (hier, hier,
- ranking model hier, hier
- “Search Ranking” (Yahoo Help
- Search Technology (YST) (Yahoo Help)
- Machine Learning: The Machine Learning group is a team of experts in computer science, statistics, mathematical optimization, and automatic control. We focus on making computers learn abstractions, patterns, conditional probability distributions, and policies from web scale data with the goal to improve the online experience for Yahoo users, partner publishers, and advertisers. Link
- Search & Web Mining. Search technologies are an international team of experts in search, algorithms, data mining, natural language, and data processing. Together, we build systems and algorithms to analyze user needs, then synthesize and deliver the right responses from data sources around the globe. Link Yahoo Research
- Microeconomics & Social Systems (Algorithmic Game Theory, Network analysis) Yahoo Research.
Five Years of Yahoo Search
Yahoo Search on Wikipedia
Output
Nach Relevanz für einen bestimmten Suchbegriff geordnete Liste von Webdokumenten
7.3. Bing Suche
Population
Alle Webdokumente
Signal
- web page content,
- the number and quality of websites that link to your pages,
- relevance of your website’s content to keywords.
- Use unique tags on each page
- Use unique description tags on each page
- Use H1 tags
- Use text navigation links
- Create content for your human visitors, not the Bing web crawler
- Incorporate keywords into URL strings
- domain score (Webmaster Help)
- RankNet (Bing Blog: User Needs, Features and the Science behind Bing)
- Cashback (How Cashback works
freshness “Superfresh” (Bing Blog: User Needs, Features and the Science behind Bing)
RankNet
Technologie
- “Decision Engine, designed to help people cut through Internet clutter to make smarter, more informed decisions” (Bing Press Kit)
- “Decision Engine” (Press Release)
- “Best match, sets up the best result in a distinct, easily accessible space at the top of the page” (Bing Blog)
- Complex Tasks and Decision Making Perhaps the most interesting insight is that people are turning to search engines not only for information, but to help them complete complex tasks and make decisions They no longer want to just find a web page; they want to learn, shop, be entertained, accomplish tasks and make important decisions. This requires better conceptual organization, a unified experience, deeper task specific content selection and support for longer sessions. (Bing Blog: User Needs, Features and the Science behind Bing)
- “algorithmic Web results”
- “intent-specific ranking”
- “technologies and algorithms for extracting structure from unstructured data and applying organizational taxonomies. Even though the organization is not nearly as intuitive as one done by a human editorial process, we are able to achieve it in a fully automated way allowing for fast scaling and reacting to changes.”
“real time indexing and ranking”
“RankNet (our ranking system)”
“technologies in HTML parsing, core Natural Language Processing, entity extraction, and document classification.”
(Bing Blog: User Needs, Features and the Science behind Bing) - Bing website ranking
Bing website ranking is completely automated. The Bing ranking algorithm analyzes factors such as web page content, the number and quality of websites that link to your pages, and the relevance of your website’s content to keywords. Site ranks change as we review the factors that make up the ranking. Although you can’t directly change your website’s ranking, you can optimize its design and technical implementation to enable appropriate ranking by most search engines. For information about improving your website’s ranking, we suggest you check out the Webmaster Center tools. We highly recommend that you use Webmaster Center’s Crawl Issues tool to determine if your site is being penalized for issues such as malware. Also review our post on what to do if you are not ranking. (Crawling and Ranking FAQs)
Output
Nach Relevanz für eine Suchanfrage geordnete Web-Dokumente
7.4. Ask Suche
Population
Alle Web-Dokumente
Signal
- ExpertRank
- Clustering
Technologie
- ExpertRank algorithm
“Our ExpertRank algorithm provides relevant search results by identifying the most authoritative sites on the Web. With Ask search technology, it’s not just about who’s biggest: it’s about who’s best. Our ExpertRank algorithm goes beyond mere link popularity (which ranks pages based on the sheer volume of links pointing to a particular page) to determine popularity among pages considered to be experts on the topic of your search. This is known as subject-specific popularity. Identifying topics (also known as “clusters”), the experts on those topics, and the popularity of millions of pages amongst those experts — at the exact moment your search query is conducted — requires many additional calculations that other search engines do not perform. The result is world-class relevance that often offers a unique editorial flavor compared to other search engines.” (Ask Search Technology - Teoma algorithm, now known as ExpertRank (Webmaster Help)
- click popularity search technology: (Webmaster Help)
- “clustering concept of subject-specific popularity”, “subject-specific popularity. Identifying topics (also known as “clusters”)” (Webmaster Help)
Output
Nach Relevanz geordnete Liste von Webdokumenten
7.5. Baidu Suche
7.6. DeepDyve Suche
Population
Expert Sources. Our expert sources include the world’s most trusted publishers, scholarly societies, universities, and government agencies. In addition, DeepDyve monitors over 30,000 thousand trade, industry, and specialist sources to create specialized collections of current news and information in life sciences, clean energy, and technology. (DeepDyve: Expert Sources
Signal
- Patterns & Symbols
- Expert Sources
Technologie
- research engine: (About)
- KeyPhrase algorithm
“KeyPhrase algorithm, applies indexing techniques from the field of genomics. The algorithm matches patterns and symbols on a scale that traditional search engines cannot match, and it is perfectly suited for complex data found on the Deep Web.” (About) - KeyPhrase technology
“KeyPhrase technology extracts substantially more information from documents than typical keywords. It indexes every word, as well as every phrase in each document, and weighs their informational impact using advanced statistical computation” (Deep Dyve Research Engine - dynamic grouping technology.
“dynamic grouping technology. Dynamic grouping allows you to quickly skim the entire universe of related topics and understand their relationships with each other.” (Deep Dyve Research Engine
Output
Nach Relevanz geordnete Recherche-Suchresultate
7.7. Twitter Suche
Population
Alle Tweets
Signal
real-time Information
Technologie
Twitter Search helps you filter all the real-time information coursing through our service. (About
Output
Real time Trending Topics
8. Citation Indexing
8.1. Google Scholar
Population
wissenschaftlicher Literatur…von Kommilitonen bewertete Seminararbeiten, Magister-, Diplom- sowie Doktorarbeiten, Bücher, Zusammenfassungen und Artikel, die aus Quellen wie akademischen Verlagen, Berufsverbänden, Magazinen für Vorabdrucke, Universitäten und anderen Bildungseinrichtungen stammen. (About DE)
Signal
- vollständiger Text eines Artikels
- Autor
- Veröffentlichungsort
- Anzahl Zitate in wissenschaftlicher Literatur
(About DE)
Technologie
- “Ranking Technologie”
“Google Scholar ordnet Ihre Suchergebnisse nach Relevanz an. So wie bei der Webseitensuche mit Google werden die nützlichsten Verweise oben auf der Seite angezeigt. Die Ranking-Technologie von Google berücksichtigt den vollständigen Text eines Artikels, den Autor, wo der Artikel veröffentlicht wurde und wie oft der Text in der wissenschaftlichen Literatur zitiert wurde.” (About DE) - “weighing the full text of each article, the author, the publication in which the article appears, and how often the piece has been cited in other scholarly literature.” (a href=”http://scholar.google.com/intl/en/scholar/about.html”>About EN
Output
liste von akademischen Publikationen sortiert nach Relevanz
8.2. CiteSeerX
Population
Scientific Literature
Signal
- Citation statistics
- Reference linking
- Citation context
- Awareness and tracking
- Full-text indexing
- Query-sensitive summaries
About
- “Autonomous Citation Indexing (ACI)”: ACI autonomously creates citation indices similar to the Science Citation Index R. (Lawrence et al 1999
Technologie
- “algorithms, data, metadata, services, techniques, and software that can be used to promote other digital libraries.” About
- “Autonomous Citation Indexing (ACI)”: ACI autonomously creates citation indices similar to the Science Citation Index R. (Lawrence et al 1999
- “automated citation indexing and citation linking using the method of autonomous citation indexing”. About
Output
Liste von wissenschaftlichen Artikeln geordnet nach Relevanz (Aktualität oder Anzahl Zitationen)
9. Vergleichs-Systeme
9.1. Bing Airtravel Flugticket Preis Vorhersage
http://www.bing.com/travel/
Population
Alle Flugticket-Angebote
Signal
- Preise
- Zeit bis zum Abflug
- Historische Daten zum Abflug
(Ayres 2007)
historical data
(Bing Travel: Technology and Data)
Technologie
- “smart travel search site”
- “Price Predictor”, “Our Price Predictor shows if fares are rising or dropping. Based on the prediction, we provide a recommendation to buy now or buy later.
- “hotel Rate Indicator”, “Our Rate Indicator indicates whether or not today’s rate for a specific hotel is a deal. It compares an individual hotel’s
- current rate found to its observed historical rates.”
(About
- “predictive technology”
- “predictive, statistical and data management algorithms”
- “Data Aggregation and Analysis”
- “statistics, data mining and machine learning”
- “algorithms that can identify patterns and conditions from our history of accumulated airfare data”
- “predictive models”
(Bing Travel: Technology and Data)
Output
Kauf-Empfehlung für Flugtickets
9.2. Zillow.com Immobilien Preis Vorhersage
Population
Alle Immobilien in den USA
Signal
- “make zillions of data points for homes accessible to everyone” (About)
- publicly available information such as comparables and tax information (Zillow: The Big Idea)
Technologie
“Zestimate”, “Zindex” “Zillow calculates a Zestimate® home valuation as a starting point for anyone to see — for free — for most homes in the U.S. So, using the Zestimate as the foundation, we built a Web page for each home and then filled it with data and maps and layered it with publicly available information such as comparables and tax information.” (Zillow: The Big Idea)
Output
Schätzungen über den Wert von Immobilien
9.3. Google Product Search
Population
Alle Produkt-Angebote: Quelle sind Anbieter, die ihre Produkt-Webseiten eintragen plus alle Produkt-Webseiten, die von Google gespidert werden
Signal
?
Technologie
- “ranking software”
- “Google’s product search results are automatically generated by our ranking software. Google does not accept payment for inclusion of products in our search results, nor do we place sellers’ sites higher in our results if they’re advertisers or offer to pay for that placement.”
- “search engine”
(About
Output
“Our job is to find the product you want and point you to the store that sells it based on our assessment of what’s most relevant to your search” (About
10. Semantic Web Apps
10.1. Powerset Suche
Jetzt Bing.com
Population
Wikipedia articles
Signal
Factz (Meanings of Sentences)
Technologie
- “natural language processing”
- “Powerset is first applying its natural language processing to search, aiming to improve the way we find information by unlocking the meaning encoded in ordinary human language.”
- “enabling computers to understand our language”
()
“Factz”
“Factz are concise representations of information extracted from sentences. They are represented in three parts: the subject, relation and object (e.g. Oswald shot JFK). Factz are one way that Powerset represents the meaning of a sentence.” (FAQ)
Output
Webdokument nach Relevanz geordnet
11. Location-based Systeme
11.1. Citysense
Population
: Night Life Hotspots in San Francisco
Signal
“billion points of GPS and WiFi positioning data – plus real-time feeds ” (CitySense More info)
history and preferences (in nächster Version)
Mobile Location Data
Technologie
- Sense Networks Macrosense platform
Sense Networks Macrosense platform, which analyzes massive amounts of aggregate, anonymous location data in real-time. Macrosense is already being used by business people for things like selecting store locations and understanding retail demand. - “MacroSense”
“Relevant Recommendation”
“Personalization”
“Discovery from Mobile Location Data”
“Real-Time Activity Analysis”
(Sensenetworks - “Location Tracking” (Techcrunch)
Output
:
Karte mit den realtime hotspots in einer Stadt
12. Anderes
12.1. Flick Interestingness
Population
Alle Bilder auf Flicker
Signal
- Clickthroughs
- who’s favorite
- when marked as a favorite
- Tags
- Change over time
- (About Interestingness)
- relationship between the person who uploaded the photo and the people who are commenting (Flickr Blog: Ten new things).
Technologie
- “Interestingness”
“There are lots of elements that make something ‘interesting’ (or not) on Flickr. Where the clickthroughs are coming from; who comments on it and when; who marks it as a favorite; its tags and many more things which are constantly changing. Interestingness changes over time, as more and more fantastic content and stories are added to Flickr.” (About Interestingness) - “interestingness is a ranking algorithm based on user behavior around the photos” (Flickr Blog: Ten new things)
- “Geotagging” (Flickr Help Map
Output
Übersichtsseite mit den interessantesten Bildern (Last 7 days)
12.2. Google Trends
Population
Alle Suchabfragen bei Google
Signal
- Search Volume
- News-Stories
Technologie
- “Enter up to five topics and see how often they’ve been searched on Google over time. Google Trends also shows how frequently your topics have appeared in Google News stories, and in which geographic regions people have searched for them most.”
- “Search Volume Index”
“News reference volume”, “the number of times your topic appeared in Google News stories”
(About Google Trends). - “hot trend”: “The top 100 fastest-rising search queries right now (U.S. only). Updates throughout the day. ” Hot Trends
- Google Zeitgeist: “Zeitgeist” means “the spirit of the times”, and Google reveals this spirit through the aggregation of millions of search queries we receive every day. We have several tools that give insight into global, regional, past and present search trends. These tools are available for you to play with, explore, and learn from. Use them for everything from business research to trivia answers. (Google Zeitgeist
Output
Vergleich zwischen Seiten/Suchbegriffen nach Relevanz in Search Volume gemessen
12.3. Bing xRank
Population
notable people
Signal
Web Searches
Technologie
xRank keeps track of notable people and puts them in order for you. We count Bing web searches for movie stars, musicians, and other famous people. Then, we compile our findings into an insightful ranking formula that tells you who the world is searching for most. The result is a cultural snapshot of who’s hot and who’s not! (Bing xRank, What is xRank?
Output
Artist ranking “a cultural snapshot of who’s hot and who’s not”
12.4. Hunch.com
Population
Fragen und Entscheidungen
Signale
- Questions
- answers
- personal answering history
- what you’ve already been asked
- how you’ve answered
(Hunch.com: How Hunch works)
Technologie
- “The Hunch algorithm”
- “question selection algorithm”
- “machine learning”
- question selection algorithm: “In choosing what to ask you, Hunch’s question selection algorithm tries to do two things. First, it tries to find a question which will discriminate well among the remaining possible decision outcomes for you – thus filtering the remaining choices from “many” to “fewer”. Second, the algorithm looks for a question which can help optimize and rank the remaining decision results to present you with the ones you’ll like the most. It’s trying to ensure that you’ll like outcome #1 better than outcome #5.”
- “machine learning based on statistical inferences” “The academic name for this sort of algorithm is machine learning”
- “decision making algorithm”
(Hunch.com: How Hunch works)
Output
Antworten
2 Comments
July 21, 2009 at 7:36 am
[...] About Population, Signale, Technologie und Output automatisierter Dienste [...]
July 22, 2009 at 7:03 am
[...] unvorhergesehen haben die Fallstudien der 50 Dienste, die automatisierte Ranking Algorithmen einsetzen und die anschliessende Auswertung nach Terminologie über zwei Wochen gedauert. So lange, dass ich [...]