https://pixelastic.github.io/marvel/

Marvel Super-Search!

Tim Carry

1. Getting the data

2. Tips and tricks

3. Building the UI

Getting the data

Wikipedia

Wikipedia logo

List of all 2056 characters

Scrapping with x-ray

X-Ray

import xray from 'x-ray';
let x = xray();

const targetUrl = 'https://en.wikipedia.org/wiki/Category:Marvel_Comics_superheroes';
const selector = '#mw-pages .mw-category-group li a@href';

x(targetUrl, selector)((urlList) => {
  // urlList is an array of all `href` values
});

Wikipedia API

https://en.wikipedia.org/w/api.php; ?titles=Captain_America &format=json&action=query&prop=revisions&rvprop=content

Raw dump

Custom markup

No way!

{{Other uses}} {{About|Steve Rogers|the subsequent versions of the character|List of incarnations of Captain America}} {{pp-vandalism|expiry=03:36, 23 May 2016|small=yes}} {{Infobox comics character  | image = CaptainAmerica109.jpg | converted = y | caption = ''Captain America'' #109 (Jan. 1969).Cover art by [[Jack Kirby]] and [[Syd Shores]]. | alt = Captain America bursting […]

DBPedia

http://dbpedia.org/data/Thor_(Marvel_Comics).json

DBPedia logo

Unofficial project

Community effort

structured data

Frozen in time (2015/08)

{
  "abstract": [
   {
     "lang": "en",
     "value": "Thor is a fictional character, a superhero […]"
   }
  ],
  "aliases": [
   {
     "lang": "en",
     "value": "Dr. Donald Blake, Jake Olson, Sigurd Jarlson, Eric Masterson"
   }
  ]
}

Infobox

Contains all the info

Mostly structured

npm module wiki-infobox

wiki-infobox

import infobox from 'wiki-infobox';

infobox('Hulk_(comics)', 'en', (err, data) => {
 // {
 //   "character_name": {
 //     "type": "text",
 //     "value": "The Incredible Hulk"
 //   },
 //   "aliases": [
 //     {
 //       "type": "text",
 //         "value": "<br>Green Scar<br>World-Breaker<br>Jade Giant"
 //     },
 //     […]
 //   ],
 //   […]
 // }
});

Wikidata

https://www.wikidata.org/w/api.php ?titles=Spider-Man &action=wbgetentities&sites=enwiki&format=json

Wikidata logo

Only metadata

Just for aliases

npm module wikidata-sdk

{
  "pageid": 81585,
  "aliases": [
    {
      "value": "Peter Parker"
    },
    {
      "value": "Webhead"
    },
    {
      "value": "Spidey"
    }
  ],
  […]
}

stats.grok.se

http://stats.grok.se/json/en/latest90/Iron_Man

Pageviews count

Personal project

Dead after 2016/01

Biased by Netflix

{
  "title": "Iron_Man",
  "rank": 2111,
  "daily_views": {
    "2016-01-20": 2476,
    "2016-01-19": 2394,
    "2016-01-18": 2359,
    "2016-01-17": 2196,
    "2016-01-16": 2563,
    "2016-01-15": 2661,
    "2016-01-14": 2393,
    […]
  }
}

Old school scrapping

Wolverine illustration

No image in any API

Manual scrapping

x-ray again

x(
  'https://wikipedia.org/Wolverine',
  '.infobox a.image img@src'
)(imageUrl => {
  console.log(imageUrl);
});

Marvel API

http://gateway.marvel.com/v1/public/characters/1009262

2 years old

Unreliable:

  • Timeouts
  • Infinite loops
  • Empty results
  • Slow
  • Rate limit

retryUntilItWorks()

Marvel logo

Daredevil thumbnail

Abandoned by his mother, Matt Murdock was raised by his father, boxer "Battling Jack" Murdock […]

827 comics, 1326 stories

Marvel Website

http://marvel.com/characters/11/daredevil

Different than API

Awesome for design

Manual scrapping

Daredevil comic panels

Tips & tricks

Isolated scripts

Various npm run scripts

One per source

Run in isolation

Temporary save on disk

Scraping is easy and slow

Parsing is hard and fast

$ npm run dbpedia
./download/dbpedia
  ├── 8-Ball_(comics).json
  ├── Abdul_Alhazred_(comics).json
  ├── Abigail_Brand.json
  […]
  ├── Zombie_(comics).json
  ├── Zom.json
  └── Zzzax.json

Consolidate

Merge all sources

Define fallbacks

Committed in git

Ordered keys

$ npm run consolidate
./download
  ├── dbpedia
  ├── images
  ├── infobox
  ├── marvel
  │   ├── api
  │   └── website
  ├── pageviews
  ├── urls
  └── wikidata
  
./records
  ├── 8-Ball_(comics).json
  […]
  └── Zzzax.json
  

Asynchronous code

Various patterns

Callbacks and promises

Make it chainable

f(input, (err, data) => {});
f(input)((err, data) => {});
HelperPath.createDir(infoboxDir)
    .then(getUrls)
    .then(getInfoboxes)
    .then(saveToDisk)
    .then(teardown);

Bluebird

import Promise from 'bluebird';

  // infobox(url, (err, data) => {});
  function infoboxAsPromise(url) {
    return Promise.promisify(infobox)(url)
  }

  // x(url, context, selectors)((err, data) => { });
  function xrayAsPromise(url, context, selector) {
    let deferred = Promise.pending();
    x(url, context, selectors)(err, data) => {
      if (err) {
        return deferred.reject(err);
      }
      deferred.resolve(data);
    });
    return deferred.promise;
  }
  

TDD saved my life

    malformed data

+ untrusted data

----------------

= unit testing!

$ npm run test

  HelperDBPedia
    isHero
       should be true if hero
    getPowers
       should split on new lines
       should split on commas
       should work on arrays
       should remove comments
      […]

    319 passing (598ms)

Building the UI

Building blocks

https://community.algolia.com/instantsearch.js/

library of UI widgets

fully customizable

eat your own dog food

10k records free

Facet screenshot

<h3>Teams</h3>
<div id="teams"></div>
[…].refinementList({
  container: '#teams',
  attributeName: 'teams',
  operator: 'and',
  limit: 10
})
.ais-refinement-list--label {
  cursor: pointer;
  font-weight: normal;
}

Cloudinary

http://res.cloudinary.com/pixelastic-marvel/image/fetch/ w_450,q_90,e_colorize:40,co_rgb:3F0606/ http://i.marvel.com/i/03/537ba78541492.gif

Image CDN

Resize and compress

Tons of effects

7.5k operations free

Deadpool comic panels

Tainted Deadpool comic panels

Let's build stuff!

    Free data

+ Free software

+ Free hosting

+ Free search

-------------

= Awesome

Data everywhere meme

tim@
.com