Surviving the Apocalypse with an Offline Wikipedia Server

For many years we’ve taken for granted the ability to settle any argument with Wikipedia. For so long, we’ve been able to settle any trivial dispute with a simple text search.

That could change. I’m not really trying to fear-monger, but it’s always possible that the internet might go out and stay out. And like hell I’m going to sit in quarantine with my partner and not be able to settle up with Wikipedia!

So, I have started this project to replicate Wikipedia on a server on my network to protect humankind’s knowledge long after the end of modern civilization (mostly joking).

Wiki Backups

Once or twice a month, Wikipedia creates a full database backup, which is distributed through their network of mirrors as well as archive.org.

These files are distributed through https://dumps.wikimedia.org/backup-index.html in a variety of formats, but the XML Multistream backup is the most recommended. These files can total nearly 70 gigabytes compressed, and several terabytes uncompressed.

Of course, there are ways to reduce the size of these backups. For example, excluding the page history can reduce storage very significantly.

There are also ‘simple’ versions with limits on page size and the number of pages. I will be using these in the demonstration to save storage space.

XOWA

XOWA is an open source application for managing Wikipedia database exports. XOWA is fully cross platform and can run on practically any computer. It can automatically manage wikipedia versions and languages.

Typically, it’s run for a single user to access, but can also function as a web server. I will be securing the app service, and adding a cache to speed it up.

Setting it Up

For a proof of concept system, I will be running a simple Debian virtual machine on KVM. This config will work fine with Simple Wikipedia with images, but will absolutely not be able to handle a full wikipedia.org database.

Install Debian

Operating System Debian 10.3 (Buster)
Platform Libvirt + Qemu + KVM
CPU 2 vCPUs
Memory 2 GiB
Storage 20 GiB

Install Prereqs

XOWA is built on Java, so we will install the current OpenJDK runtime environment.

sudo apt install openjdk-11-jre openjdk-11-jre-headless

Setup Service Accounts

As a security best practice, the service will run under an unprivileged user.

sudo addgroup --system wiki
sudo adduser --system \
	--disabled-login \
	--ingroup wiki \
	--no-create-home wiki

Create Install Directory

The application will be installed to a directory in /opt/

sudo mkdir -p /opt/wiki
sudo chown -R wiki:wiki /opt/wiki

Install XowA

Next, the application binary is downloaded to the install directory. The latest version can be found at the official GitHub releases page:

cd /opt/wiki
sudo wget https://github.com/gnosygnu/xowa/releases/download/v4.6.5.1911/xowa_app_linux_64_v4.6.5.1911.zip

sudo unzip xowa_app_linux_64_v4.6.5.1911.zip

More information about installing can be found here:

Set up Service

Next, we may create a systemd unit file to control the wiki service. Since the application is distributed as a .jar executable, we can just point the JRE there. Note that the service runs as the wiki service account.

/etc/systemd/system/wiki.service

[Unit]
Description=Offline Wikipedia

[Service]
ExecStart=/usr/bin/java -jar /opt/wiki/xowa_linux_64.jar --app_mode http_server
WorkingDirectory=/opt/wiki
User=wiki
Type=simple
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Start the service:

sudo systemctl daemon-reload
sudo service wiki start

The service should now be running, and listening on tcp/8080 for HTTP requests.

Nginx Reverse Proxy

For performance and security, I will set up Nginx as a caching reverse proxy in front of the XOWA service.

First, set up the cache directory.

sudo mkdir -p /var/www/cache

Then, a tmpfs device is added to /etc/fstab:

tmpfs   /var/www/cache  tmpfs   rw,nodev,nosuid,noexec,size=100M   0  0

Mount the tmpfs device.

sudo mount -a

Next, nginx is installed on the server.

sudo apt install nginx nginx-common ssl-cert

I configure Nginx as a reverse proxy using a cache directory backed by tmpfs. This will have a massive performance improvement for XOWA, since it has to look inside gzipped files to retrieve static assets. This will also expose HTTPS on the standard tcp port.

proxy_cache_path  /var/www/cache levels=1:2 keys_zone=static-cache:8m max_size=99m inactive=600m;
proxy_temp_path /var/www/cache/tmp; 

server {
    listen 80;
    listen [::]:80;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;

    location / {
        proxy_http_version 1.1;
        proxy_pass http://localhost:8080;
        proxy_cache static-cache;
      	proxy_cache_valid  200 302  300s;
      	proxy_cache_valid  404      60s;
    }

    ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
    ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
}

Set up the server

Once the service is running, open a browser and navigate to the IPv4 address of the server. For example, https://10.99.0.100. The main page has a brief introduction to the server software. Next, navigate to the setup section, https://10.99.0.100/home/wiki/Main_Page#Build_a_wiki_while_online

Click the link, “Set up Simple Wikipedia”. This will immediately start downloading about 200MB of compressed text, and set up the simple Wikipedia application.

After about five minutes, the first wiki is ready to go. Navigate to the main page: https://10.99.0.100/simple.wikipedia.org/ and check out some of the articles.

Next, images can be downloaded by going to the download center, https://10.99.0.100/home/wiki/Special:XowaDownloadCentral and select Simple English Wikipedia - Images. This will download about 2GiB of compressed image files.

The Full Wikipedia

Now, obviously if we want the entirety of Humankind’s knowledge to survive, we will need to have a full archive. For this to work, we’ll need to make some changes to the server setup.

  1. 20G is fine for the OS, but we will need a large storage system to hold the terabytes of database files. This could be a NAS or SAN, or a large local RAID array.
  2. 100 MB will likely not be sufficient for cache, so a much larger tmpfs will be required
  3. A very robust internet connection will be needed to download the ~70G of backup files

Naturally, this will present a significant engineering challenge. Sounds like fun, doesn’t it!

comments powered by Disqus