Surviving the Apocalypse with an Offline Wikipedia Server
For many years we’ve taken for granted the ability to settle any argument with Wikipedia. For so long, we’ve been able to settle any trivial dispute with a simple text search.
That could change. I’m not really trying to fear-monger, but it’s always possible that the internet might go out and stay out. And like hell I’m going to sit in quarantine with my partner and not be able to settle up with Wikipedia!
So, I have started this project to replicate Wikipedia on a server on my network to protect humankind’s knowledge long after the end of modern civilization (mostly joking).
Wiki Backups
Once or twice a month, Wikipedia creates a full database backup, which is distributed through their network of mirrors as well as archive.org.
These files are distributed through https://dumps.wikimedia.org/backup-index.html in a variety of formats, but the XML Multistream backup is the most recommended. These files can total nearly 70 gigabytes compressed, and several terabytes uncompressed.
Of course, there are ways to reduce the size of these backups. For example, excluding the page history can reduce storage very significantly.
There are also ‘simple’ versions with limits on page size and the number of pages. I will be using these in the demonstration to save storage space.
XOWA
XOWA is an open source application for managing Wikipedia database exports. XOWA is fully cross platform and can run on practically any computer. It can automatically manage wikipedia versions and languages.
Typically, it’s run for a single user to access, but can also function as a web server. I will be securing the app service, and adding a cache to speed it up.
Setting it Up
For a proof of concept system, I will be running a simple Debian virtual machine on KVM. This config will work fine with Simple Wikipedia with images, but will absolutely not be able to handle a full wikipedia.org database.
Install Debian
Operating System | Debian 10.3 (Buster) |
---|---|
Platform | Libvirt + Qemu + KVM |
CPU | 2 vCPUs |
Memory | 2 GiB |
Storage | 20 GiB |
Install Prereqs
XOWA is built on Java, so we will install the current OpenJDK runtime environment.
sudo apt install openjdk-11-jre openjdk-11-jre-headless
Setup Service Accounts
As a security best practice, the service will run under an unprivileged user.
sudo addgroup --system wiki
sudo adduser --system \
--disabled-login \
--ingroup wiki \
--no-create-home wiki
Create Install Directory
The application will be installed to a directory in /opt/
sudo mkdir -p /opt/wiki
sudo chown -R wiki:wiki /opt/wiki
Install XowA
Next, the application binary is downloaded to the install directory. The latest version can be found at the official GitHub releases page:
cd /opt/wiki
sudo wget https://github.com/gnosygnu/xowa/releases/download/v4.6.5.1911/xowa_app_linux_64_v4.6.5.1911.zip
sudo unzip xowa_app_linux_64_v4.6.5.1911.zip
More information about installing can be found here:
Set up Service
Next, we may create a systemd unit file to control the wiki service. Since the application is distributed as a .jar executable, we can just point the JRE there. Note that the service runs as the wiki
service account.
/etc/systemd/system/wiki.service
[Unit]
Description=Offline Wikipedia
[Service]
ExecStart=/usr/bin/java -jar /opt/wiki/xowa_linux_64.jar --app_mode http_server
WorkingDirectory=/opt/wiki
User=wiki
Type=simple
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Start the service:
sudo systemctl daemon-reload
sudo service wiki start
The service should now be running, and listening on tcp/8080
for HTTP requests.
Nginx Reverse Proxy
For performance and security, I will set up Nginx as a caching reverse proxy in front of the XOWA service.
First, set up the cache directory.
sudo mkdir -p /var/www/cache
Then, a tmpfs device is added to /etc/fstab
:
tmpfs /var/www/cache tmpfs rw,nodev,nosuid,noexec,size=100M 0 0
Mount the tmpfs device.
sudo mount -a
Next, nginx is installed on the server.
sudo apt install nginx nginx-common ssl-cert
I configure Nginx as a reverse proxy using a cache directory backed by tmpfs. This will have a massive performance improvement for XOWA, since it has to look inside gzipped files to retrieve static assets. This will also expose HTTPS on the standard tcp port.
proxy_cache_path /var/www/cache levels=1:2 keys_zone=static-cache:8m max_size=99m inactive=600m;
proxy_temp_path /var/www/cache/tmp;
server {
listen 80;
listen [::]:80;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
location / {
proxy_http_version 1.1;
proxy_pass http://localhost:8080;
proxy_cache static-cache;
proxy_cache_valid 200 302 300s;
proxy_cache_valid 404 60s;
}
ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
}
Set up the server
Once the service is running, open a browser and navigate to the IPv4 address of the server. For example, https://10.99.0.100. The main page has a brief introduction to the server software. Next, navigate to the setup section, https://10.99.0.100/home/wiki/Main_Page#Build_a_wiki_while_online
Click the link, “Set up Simple Wikipedia”. This will immediately start downloading about 200MB of compressed text, and set up the simple Wikipedia application.
After about five minutes, the first wiki is ready to go. Navigate to the main page: https://10.99.0.100/simple.wikipedia.org/ and check out some of the articles.
Next, images can be downloaded by going to the download center, https://10.99.0.100/home/wiki/Special:XowaDownloadCentral and select Simple English Wikipedia - Images. This will download about 2GiB of compressed image files.
The Full Wikipedia
Now, obviously if we want the entirety of Humankind’s knowledge to survive, we will need to have a full archive. For this to work, we’ll need to make some changes to the server setup.
- 20G is fine for the OS, but we will need a large storage system to hold the terabytes of database files. This could be a NAS or SAN, or a large local RAID array.
- 100 MB will likely not be sufficient for cache, so a much larger tmpfs will be required
- A very robust internet connection will be needed to download the ~70G of backup files
Naturally, this will present a significant engineering challenge. Sounds like fun, doesn’t it!