Using Puppeteer to download database tables from cPanel

These days I am migrating one of my sites to a new platform and I needed to download some data from MySQL database. Host provider gave me cPanel access to my server, so I can go there and open phpMyAdmin page and from there to manually pick the tables and data I want to download, but, …

I want to automate this task so that I can perform it again later (in case some more data is collected on my old server).

Installing necessary libraries

To start I needed to install puppeteer and coffeescript. Installing puppeteer involves downloading Chromium browser so I prefer to install it globally once, rather than installing it every time I need to automate some browser based task.

> npm i -g puppeteer coffeescript

Now, to import puppeteer installed globally, we need to find path to its location. This can be achieved with the following snippet:

path  = require('path')
puppeteer_path = path.normalize(path.join process.execPath,
    '../../lib/node_modules/puppeteer/')
puppeteer = require(puppeteer_path)

Accessing cPanel

Using puppeteer we instantiate browser, first and then login to cPanel using following function:

log_to_cpanel = (cpanel_url, cpanel_user, cpanel_pass) ->
    browser = await puppeteer.launch headless: false
    page = await browser.newPage()
    await page.goto(cpanel_url)

    loginfunc = (user, pwd) ->
        document.getElementById('user').value = user
        document.getElementById('pass').value = pwd
        document.getElementById('login_submit').click()

    await page.evaluate(loginfunc, cpanel_user, cpanel_pass)
    await page.waitForNavigation()

    {browser, page}

It takes three string arguments:

url of cPanel,
user name
password

It opens url, waits until page is loaded, then fills input fields on page with given user name and password and finally clicks the login button. These actions are performed with the loginfunc function, evaluated inside web page context using page.evaluate method. And finally we wait for the page to finish loading.

At the end, page contains main cPanel page, with given user being logged in.

Getting phpMyAdmin url and token

On the main cPanel page, we can open phpMyAdmin page using a relative url sql/PhpMyAdmin.html. But after loading this page, our browser is redirected to some dynamically generated url which depends on the current session and randomly generated token. Here is the function to extract those two values:

pma_token = (page) ->
    u = partition(page.url(), '?')[0]
    u = rpartition(u, '/')[0] + '/sql/PhpMyAdmin.html'
    await page.goto u
    await page.waitForNavigation()
    # phpMyAdmin url contains some hashes and token in query
    # to find out exact value of this url we use href attribute
    # on a link for accessing list of available databases
    u = await page.evaluate ->
        document.querySelector('a[href*="server_databases.php"]').href
    # all pma urls have '/phpMyAdmin/' in them
    # we extract from url everything upto the end of '/phpMyAdmin/'
    pmaurl = partition(u, '/phpMyAdmin/').slice(0,2).join('')

    # we need also to extract token value
    token = /\btoken=([0-9a-f]{32})/.exec(u)[1]
    {pmaurl, token}

This function uses some helper functions partition and rpartition which are written following their Python implementation. Both of them take two strings as input: a string to be partitioned and a separator. They return exactly three strings which when concatenated produce the original input string. The first returned string is part of the input string before first occurrence of the separator in the input string. The second returned string is the separator if it was found and the third returned string is part of the input string after the first separator occurrence.

The rpartition has the same meaning except that it uses the last separator occurrence in the input string.

partition = (s, sep) ->
    i = s.indexOf(sep)
    return [s, '', ''] if i == -1
    f = s.slice(0, i)
    r = s.slice(i + sep.length)
    [f, sep, r]

rpartition = (s, sep) ->
    i = s.lastIndexOf(sep)
    return [s, '', ''] if i == -1
    f = s.slice(0, i)
    r = s.slice(i + sep.length)
    [f, sep, r]

Downloading a table rows

Using found phpMyAdmin url and token values, we can now open the export page. The relative url address is tbl_export.php and it needs some query parameters:

db - database name
table - table name
token - the security token

Using all these values, we construct the url, then direct our page there and finally wait for page to load.

export_table = (page, dbname, tablename, pmaurl, token) ->
    u = new URL(pmaurl + 'tbl_export.php')
    u.searchParams.set('token', token)
    u.searchParams.set('db', dbname)
    u.searchParams.set('table', tablename)
    await page.goto u
    await page.evaluate ->
        el = (i) -> document.getElementById(i)
        el('radio_custom_export').click()
        el('plugins').value = 'json'
        el('radio_allrows_1').click()
        el('radio_dump_asfile').click()
        el('compression').value = 'gzip'
        el('buttonGo').click()

After page has been loaded, we fill some input fields on it, to select export options (gzip compression, all rows from the table and JSON format). If you need some other options, change this function according to your needs.

At the end Go button is clicked to start download.

Tables one by one

If we have a list of tables we want to download, we have to process them one by one. But if we just use some loop to iter all table names and call our export_table function each time, we would change table name parameter before download has been started, so we would end up with just the last table downloaded. We must make sure not to touch page object again before download finishes. We need to process table names one by one, waiting for each download to finish before processing the next one. One way to do it is to take a single table name, remove it from the list and once we finish download start new iteration and keep doing it until the list of table names is empty.

In order to use await command, this code needs to be inside a function. So, we are creating one and immediately execute it. Here is the code to achieve this:

do ->
    {browser, page} = await log_to_cpanel(cpanel_url, cpanel_user, cpanel_pass)
    {pmaurl, token} = await pma_token(page)
    while table_names.length
        tname = table_names.shift()
        export_table page, database, tname, pmaurl, token
        await page.waitForNavigation(waitUntil: 'networkidle0')
        console.log tname, 'ok'
    await new Promise (f) ->
        setTimeout f, 1000
    browser.close()

It is also necessary to wait a little longer at the end for the final download to be completed. Here we wait a second (1000ms), but you can adjust this waiting time according to your needs.

I hope this can help someone looking for a way to automate some sequence of actions in browser. For some other tasks that don’t include downloads browser can be launched using headless=true mode, but for some reason, downloading files doesn’t work properly in headless mode.