Posts Tagged ‘puppeteer’

Finding, fetching, and rendering favicons with puppeteer

I’ve been working a bit with fetching favicons and noted some of the complexity I encountered:

  • The original way to adding favicons to a site, placing /favicon.ico file in the root directory, is alive and well; browsers will make an HTTP GET request to try and fetch this file.
  • Within the HTML document, <link rel="icon" is the correct way to specify the icon. However, a link tag with <link rel="shortcut icon" is also valid and acceptable, but “shortcut” is redundant and has no meaning (of course, if you’re trying to parse or query the DOM, it’s a case you need to consider).
  • Like other web content, the path in a <link> tag can an absolute URL, with may or may not declare a protocol, or a relative URL.
  • While there is really good support for PNG favicons, ICO files are still common, even on popular sites (as of writing this Github, Twitter, and Gmail, all use ICO favicons).
  • When not using ICO files, they is usually multiple <link> tags, with different values for the sizes attribute, in order to declare different resolutions of the same icon (ICO is a container format, so all the different resolution icons are packaged together).
  • The correct MIME type for ICO files is image/vnd.microsoft.icon, but the non-standard image/x-icon is much more common.
  • Despite the popularity of ICOs and PNGs, there’s a bunch of other formats with varying degrees of support across browsers: GIF (animated/non-animated), JPEG, APNG, SVG. Of particular note is SVG, as it’s the only non-bitmap format on this list, and is increasing being supported.

The goal was to generate simple site previews for ScratchGraph, like this:

ScratchGraph Site Preview

Finding the favicon URL was one concern. My other concern was rendering the icon to a common format, while this isn’t technically necessary, it does lower the complexity in the event that I wanted to do something with the icon, other than just rendering within the browser.

Finding the favicon URL

I wrote the following code to try and find the URL of the “best” favicon using Puppeteer (Page is the puppeteer Page class):

/** * * @param {Page} page * @param {String} pageUrl * @returns {Promise<String>} */ const findBestFaviconURL = async function(page, pageUrl) { const rootUrl = (new URL(src)).protocol + "//" + (new URL(src)).host; const selectorsToTry = [ `link[rel="icon"]`, `link[rel="shortcut icon"]` ]; let faviconUrlFromDocument = null; for(let i=0; i<selectorsToTry.length; i++) { const href = await getDOMElementHRef(page, selectorsToTry[i]); if(typeof href === 'undefined' || href === null || href.length === 0) { continue; } faviconUrlFromDocument = href; break; } if(faviconUrlFromDocument === null) { // No favicon link found in document, best URL is likley favicon.ico at root return rootUrl + "/favicon.ico"; } if(faviconUrlFromDocument.substr(0, 4) === "http" || faviconUrlFromDocument.substr(0, 2) === "//") { // absolute url return faviconUrlFromDocument; } else if(faviconUrlFromDocument.substr(0, 1) === '/') { // favicon relative to root return (rootUrl + faviconUrlFromDocument); } else { // favicon relative to current (src) URL return (pageUrl + "/" + faviconUrlFromDocument); } };

This will try to get a favicon URL via:

  • Try to get the icon URL referenced in the first link[rel="icon"] tag
  • Try to get the icon URL referenced in the first link[rel="icon shortcut"] tag
  • Assume that if we don’t find an icon URL in the document, there’s a favicon.ico relative to the site’s root URL

Getting different sizes of the icon or trying to get a specific size is not supported. Also, for URLs pulled from the document via link[rel=… tags, there’s some additional code to see if URL is absolute, relative to the site/document root, or relative to the current URL and, if necessary, construct and return an absolute URL.

The getDOMElementHRef function to query the href attribute is as follows:

/** * * @param {Page} page * @param {String} query * @returns {String} */ const getDOMElementHRef = async function(page, query) { return await page.evaluate((q) => { const elem = document.querySelector(q); if(elem) { return (elem.getAttribute('href') || ''); } else { return ""; } }, query); };

Fetching & rendering to PNG

Puppeteer really shines at being able to load and render the favicon, and providing the mechanisms to save it out as a screenshot. You could attempt to read the favicon image data directly, but there is significant complexity here given the number of different image formats you may encounter.

Rendering the favicon is relatively straightfoward:

  • Render the favicon onto the page by having the Page goto the favicon URL
  • Query the img element on the page
  • Make the Page’s document.body background transparent (to capture any transparency in the icon when we take the screenshot)
  • Take a screenshot of that img element, such that a binary PNG is rendered

Here is the code to render the favicon onto the page:

/** * * @param {Page} page * @returns {ElementHandle|null} */ const renderFavicon = async function(page) { let faviconUrl = await findBestFaviconURL(page, src); try { console.info(`R${reqId}: Loading favicon from ${faviconUrl}`); await page.goto(faviconUrl, {"waitUntil" : "networkidle0"}); } catch(err) { console.error(`R${reqId}: failed to get favicon`); } const renderedFaviconElement = await page.$('img') || await page.$('svg'); return renderedFaviconElement; };

Finally, here’s the snippet to render the favicon to a PNG:

if(renderedFaviconElement) { const renderedFaviconElementTagName = await (await renderedFaviconElement.getProperty('tagName')).jsonValue(); if(renderedFaviconElementTagName === 'IMG') { await page.evaluate(() => document.body.style.background = 'transparent'); } const faviconPngBinary = await renderedFaviconElement.screenshot( { "type":"png", "encoding": "binary", "omitBackground": true } ); }

EDIT 4/7/2020: Updated code snippets to correctly handle SVG favicons. With SVGs, an <svg> element will be rendered on the page (instead of an <img> element). Also, there is no <body> element, as the SVG is rendered directly and not embedded within an HTML document, and hence no need to set the document’s body background to transparent.

Rendering HTML to images with SVG foreignObject

Motivation

For applications that allow users to create visual content, being able to generate images of their work can be important in a number of scenarios: preview/opengraph images, allowing users to display content elsewhere, etc. This popped up as a need for ScratchGraph and led me to research a few possible solutions. Using the SVG <foreignObject> element was one of the more interesting solutions I came across, as all rendering and image creation is done client-side.

<foreignObject> to Image

<foreignObject> is a somewhat strange element. Essentially, it allows you to load and render arbitrary HTML content within SVG. This in and of itself isn’t helpful for generating an image, but we can take advantage of two other aspects of modern browsers to make this a reality:

  • SVG markup can be dynamically loaded into an Image by transforming the markup into a data URL
  • Data URL length limits are no longer a concern. We no longer have the kilobyte-scale limits we were dealing with a few years ago

Sketching it out, the process looks something like this (contentHtml is a string with the HTML content we want to render):

The code for this is pretty straightforward:

// build SVG string
const svg = `
<svg xmlns='http://www.w3.org/2000/svg' width='
${width}' height='${height}'>
<foreignObject x='0' y='0' width='
${width}' height='${height}'>
${contentHtml}
</foreignObject>
</svg>`
;

// convert SVG to data-uri
const dataUri = `data:image/svg+xml;base64,${window.btoa(svg)}`;

Here I’m assuming contentHtml is valid and can be trusted. If that’s not the case, you’ll likely need some pre-processing steps before sticking it into a string like this.

The code above works, to a degree; there’s a few key limitations to be aware of:

  • Cross-origin images served without CORS headers won’t load within <foreignObject>
  • Styles declared via stylesheets do not pass through to the contents of <foreignObject>
  • External resources (images, fonts, etc.) won’t be in the generated Image, as the browser doesn’t wait for these resources to be loaded before rendering out the image

The cross-origin issue may be annoying and unexpected (as the browser does load these images), but it’s a valid security measure and CORS provides the mechanism around it.

Handling stylesheets and external resources are more important concerns, and addressing them allows for a much more robust process.

Handling stylesheets

This isn’t anything too fancy, here are the steps involved:

  • Copy all the style rules, from all the stylesheets, in the parent document
  • Wrap all those rules in a <style> tag
  • Prepend that string to the contentHtml string

The code for this precursor step looks something like this:

const styleSheets = document.styleSheets;
let cssStyles = "";
let urlsFoundInCss = [];

for (let i=0; i<styleSheets.length; i++) {
for(let j=0; j<styleSheets[i].cssRules.length; j++) {
const cssRuleStr = styleSheets[i].cssRules[j].cssText;
cssStyles += cssRuleStr;
}
}

const styleElem = document.createElement("style");
styleElem.innerHTML = cssStyles;
const styleElemString = new XMLSerializer().serializeToString(styleElem);

...

contentHtml = styleElemString + contentHtml;

...

Handling external resources

My solution here is somewhat curd, but it’s functional.

  • Find url values in the CSS code or src attribute values in the HTML code
  • Make XHR requests to get these resources
  • Encode the resources as Base64 and construct data URLs
  • Replace the original URLs (in the CSS url or HTML src) with the new base64 data URLs

The following shows how this is done for the HTML markup (the process is only slightly different for CSS).

const escapeRegExp = function(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
};

let urlsFoundInHtml = getImageUrlsFromFromHtml(contentHtml);
const fetchedResources = await getMultipleResourcesAsBase64(urlsFoundInHtml);
for(let i=0; i<fetchedResources.length; i++) {
const r = fetchedResources[i];
contentHtml = contentHtml.replace(
new RegExp(escapeRegExp(r.resourceUrl),"g"), r.resourceBase64);
}

The getImageUrlsFromFromHtml() and parseValue() methods that extract the value of src attributes from elements:

/**
*
*
@param {String} str
*
@param {Number} startIndex
*
@param {String} prefixToken
*
@param {String[]} suffixTokens
*
*
@returns {String|null}
*/
const parseValue = function(str, startIndex, prefixToken, suffixTokens) {
const idx = str.indexOf(prefixToken, startIndex);
if(idx === -1) {
return null;
}

let val = '';
for(let i=idx+prefixToken.length; i<str.length; i++) {
if(suffixTokens.indexOf(str[i]) !== -1) {
break;
}

val += str[i];
}

return {
"foundAtIndex": idx,
"value": val
}
};

/**
*
*
@param {String} str
*
@returns {String}
*/
const removeQuotes = function(str) {
return str.replace(/["']/g, "");
};

/**
*
*
@param {String} html
*
@returns {String[]}
*/
const getImageUrlsFromFromHtml = function(html) {
const urlsFound = [];
let searchStartIndex = 0;

while(true) {
const url = parseValue(html, searchStartIndex, 'src=', [' ', '>', '\t']);
if(url === null) {
break;
}

searchStartIndex = url.foundAtIndex + url.value.length;
urlsFound.push(removeQuotes(url.value));
}

return urlsFound;
};

The getMultipleResourcesAsBase64() and getResourceAsBase64() methods responsible for fetching resources:

/**
*
*
@param {String} url
*
@returns {Promise}
*/
const getResourceAsBase64 = function(url) {
return new Promise(function(resolve, reject) {
const xhr = new XMLHttpRequest();
xhr.open(
"GET", url);
xhr.responseType =
'blob';

xhr.onreadystatechange =
async function() {
if(xhr.readyState === 4 && xhr.status === 200) {
const resBase64 = await binaryStringToBase64(xhr.response);
resolve(
{
"resourceUrl": url,
"resourceBase64": resBase64
}
);
}
};

xhr.send(
null);
});
};

/**
*
*
@param {String[]} urls
*
@returns {Promise}
*/
const getMultipleResourcesAsBase64 = function(urls) {
const promises = [];
for(let i=0; i<urls.length; i++) {
promises.push( getResourceAsBase64(urls[i]) );
}
return Promise.all(promises);
};

More code

The code for this experiment is up on Github. Most functionality is encapsulated with the ForeignHtmlRenderer method, which contains the code shown in this post.

Other Approaches

  • Similar (same?) approach with dom-to-image
    This library also uses the <foreignObject> element and an approach similar to what I described in this post. I played around with it briefly and remember running to a few issues, but I didn’t keep the test code around and don’t remember what the errors were.
  • Server-side/headless rendering with puppeteer
    This seems to be the defacto solution and, honestly, it’s a pretty good solution. It’s not too difficult to get it up and running as a service, though there will be an infrastructure cost. Also, I’d be willing to bet this is what services like URL2PNG use on their backend.
  • Client-side rendering with html2canvas
    This is a really cool project that will actually parse the DOM tree + CSS and render the page (it’s a rendering engine done in client-side javascript). Unfortunately, only a subset of CSS is supported and SVG is not supported.