Tech Expert & Vibe Coder

With 14+ years of experience, I specialize in self-hosting, AI automation, and Vibe Coding – building applications using AI-powered tools like Google Antigravity, Dyad, and Cline. From homelabs to enterprise solutions.

Building a Gmail Takeout Incremental Backup Pipeline with Bash and Rclone

Why I Built This

I've been using Gmail since 2007. That's nearly two decades of email, attachments, and conversations sitting in Google's infrastructure. While I trust Google's reliability more than my own, I don't trust any single point of failure for data I can't recreate.

Google Takeout exists for full exports, but it's designed for occasional manual downloads, not continuous backups. Each export is a complete snapshot—potentially hundreds of gigabytes—which makes regular backups impractical. I needed something that would:

  • Run automatically on a schedule
  • Only download what changed since the last backup
  • Store everything encrypted on my own infrastructure
  • Not require me to babysit the process

This is the pipeline I built to solve that problem. It runs on a Proxmox LXC container and uses rclone to handle the heavy lifting.

My Actual Setup

The backup runs on a Debian 12 LXC container with 2GB RAM and 20GB storage. The container itself doesn't store the backups—it just orchestrates the process. The actual data goes to a Synology NAS via NFS mount, which then gets replicated to off-site storage.

Here's what I'm using:

  • rclone – for interfacing with Google Drive and handling incremental sync
  • Bash scripts – to automate the Takeout request and download process
  • Cronicle – my scheduler of choice, running on the same Proxmox host
  • NFS mount – connects the container to my Synology for storage

The pipeline has two distinct phases: requesting the export from Google, and downloading it once it's ready.

How Google Takeout Actually Works

Before building automation around it, I needed to understand Takeout's behavior through actual use.

When you request a Gmail export through Takeout, Google doesn't generate it instantly. For large mailboxes, it can take hours or even a full day. Once ready, Google uploads the export to your Drive account in a folder called "Takeout". The export is split into 50GB chunks (you can configure this, but 50GB is the default maximum).

Here's what I learned from repeated manual exports:

  • Each export is a complete snapshot, not a delta
  • The files are named with timestamps, like takeout-20240115T120000Z-001.zip
  • Google keeps these files in your Drive for a limited time (I've seen 7-30 days depending on account settings)
  • You can have multiple exports in progress or stored simultaneously

The key insight: I could use rclone's sync capabilities to only download new export files, then extract and deduplicate the actual email data locally.

Phase 1: Requesting the Export

Google doesn't provide a clean API for requesting Takeout exports programmatically. The official method is through their web interface, which requires manual clicking.

I tried several approaches:

  • Google Takeout API – doesn't exist for automated requests
  • Selenium automation – too brittle, breaks with UI changes
  • Manual scheduling – defeats the purpose of automation

What actually worked: I set up a recurring Takeout export through Google's interface. Google allows you to schedule automatic exports (monthly, every 2 months, or yearly). I configured mine for monthly exports of just Gmail data.

This isn't perfect—I can't trigger exports on-demand through code—but it's reliable. Google handles the scheduling, and my pipeline just needs to detect and download whatever appears in the Takeout folder.

Configuration I Used

In Google Takeout settings:

  • Selected only "Mail" (unchecked everything else)
  • Set frequency to "Export once every month for 6 months"
  • Delivery method: "Add to Drive"
  • File type: .zip
  • File size: 50GB (maximum allowed)

This creates a new export around the same day each month. Google sends an email when it's ready.

Phase 2: Detecting and Downloading New Exports

This is where the actual automation happens. I wrote a Bash script that runs daily via Cronicle to check for and download new Takeout files.

Setting Up Rclone for Google Drive

First, I configured rclone to access my Google Drive. On the LXC container:

rclone config

I walked through the interactive setup:

  • Chose "Google Drive" as the storage type
  • Used my own OAuth client ID (not strictly necessary, but I prefer controlling my own API credentials)
  • Granted full Drive access (needed to read the Takeout folder)
  • Named the remote "gdrive"

To test it:

rclone lsd gdrive:
rclone ls gdrive:Takeout

This confirmed I could see the Takeout folder and its contents.

The Download Script

Here's the core script I'm running. I've stripped out my specific paths, but this is the actual logic:

#!/bin/bash

REMOTE="gdrive:Takeout"
LOCAL_STAGING="/mnt/nas/backups/gmail-takeout/staging"
LOCAL_ARCHIVE="/mnt/nas/backups/gmail-takeout/archive"
LOG_FILE="/var/log/gmail-takeout-sync.log"

# Create directories if they don't exist
mkdir -p "$LOCAL_STAGING" "$LOCAL_ARCHIVE"

# Log function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log "Starting Gmail Takeout sync"

# Sync only new .zip files from Takeout folder
rclone sync "$REMOTE" "$LOCAL_STAGING" \
    --include "*.zip" \
    --log-file="$LOG_FILE" \
    --log-level INFO \
    --transfers 4 \
    --checkers 8 \
    --drive-chunk-size 64M

if [ $? -eq 0 ]; then
    log "Sync completed successfully"
    
    # Move downloaded files to archive with timestamp
    TIMESTAMP=$(date '+%Y%m%d_%H%M%S')
    ARCHIVE_DIR="$LOCAL_ARCHIVE/$TIMESTAMP"
    
    if [ "$(ls -A $LOCAL_STAGING)" ]; then
        mkdir -p "$ARCHIVE_DIR"
        mv "$LOCAL_STAGING"/*.zip "$ARCHIVE_DIR/"
        log "Moved files to $ARCHIVE_DIR"
    else
        log "No new files to archive"
    fi
else
    log "ERROR: Sync failed with exit code $?"
fi

log "Sync process finished"

What This Actually Does

The script uses rclone sync, not rclone copy. This is important. Sync makes the destination match the source—if Google deletes old Takeout files from Drive, they'll be removed from staging too. But I immediately move downloaded files to an archive directory, so they're preserved locally even after Google cleans them up.

Key parameters I settled on after testing:

  • --transfers 4 – downloads 4 files in parallel (my NAS can handle this)
  • --checkers 8 – checks 8 files at once for differences
  • --drive-chunk-size 64M – larger chunks for better throughput on big files

I tried higher transfer counts, but 4 was the sweet spot before I started seeing network congestion and slower overall speeds.

Handling the Extracted Data

The downloaded .zip files contain .mbox files—standard Unix mailbox format. Each export includes the complete mailbox, which means significant duplication between monthly exports.

I wrote a second script that runs after downloads complete:

#!/bin/bash

ARCHIVE_DIR="/mnt/nas/backups/gmail-takeout/archive"
EXTRACT_DIR="/mnt/nas/backups/gmail-takeout/extracted"
MBOX_CONSOLIDATED="/mnt/nas/backups/gmail-takeout/gmail-consolidated.mbox"

# Find the most recent archive directory
LATEST_ARCHIVE=$(ls -td "$ARCHIVE_DIR"/*/ | head -1)

if [ -z "$LATEST_ARCHIVE" ]; then
    echo "No archives found"
    exit 1
fi

echo "Processing $LATEST_ARCHIVE"

# Extract all zip files
for zipfile in "$LATEST_ARCHIVE"*.zip; do
    unzip -q "$zipfile" -d "$EXTRACT_DIR"
done

# Combine all mbox files
cat "$EXTRACT_DIR"/Takeout/Mail/*.mbox >> "$MBOX_CONSOLIDATED"

# Remove duplicates using formail (from procmail package)
formail -D 8192 msgid.cache < "$MBOX_CONSOLIDATED" > "${MBOX_CONSOLIDATED}.tmp"
mv "${MBOX_CONSOLIDATED}.tmp" "$MBOX_CONSOLIDATED"

# Clean up extracted files
rm -rf "$EXTRACT_DIR"/*

echo "Consolidation complete"

Why This Approach

I considered several deduplication methods:

  • Maildir format – each email as a separate file, easier to dedupe but millions of tiny files kill NFS performance
  • Database import – cleanest long-term solution but adds complexity
  • mbox with formail – simple, Unix-native, works with standard tools

I went with mbox because I can grep it, process it with standard tools, and import it into any email client if needed. The formail deduplication isn't perfect—it uses Message-ID headers, which can occasionally have duplicates—but it catches 99%+ of redundant emails.

Scheduling and Monitoring

I schedule both scripts through Cronicle:

  • Download script – runs daily at 3 AM
  • Extraction script – runs weekly on Sundays at 4 AM

The weekly extraction is intentional. I don't need to process every daily check—only when there's actually new data, which is monthly. Running weekly gives a buffer in case Google's export timing varies.

Cronicle sends me notifications on failures, and I have a simple monitoring check:

#!/bin/bash
# Check if consolidated mbox was updated in the last 35 days

MBOX="/mnt/nas/backups/gmail-takeout/gmail-consolidated.mbox"
DAYS_OLD=$(( ($(date +%s) - $(stat -c %Y "$MBOX")) / 86400 ))

if [ $DAYS_OLD -gt 35 ]; then
    echo "WARNING: Gmail backup is $DAYS_OLD days old"
    exit 1
fi

echo "Gmail backup is current ($DAYS_OLD days old)"

This runs daily via Cronicle and alerts me if the backup hasn't updated in over a month, which would indicate the pipeline broke somewhere.

What Didn't Work

Several things I tried that failed or weren't worth the complexity:

Attempting to automate Takeout requests via browser automation – I spent a weekend trying to use Puppeteer to click through the Takeout interface. It worked for a few weeks, then Google changed their UI and it broke. Not worth maintaining.

Using rclone mount instead of sync – I initially tried mounting Google Drive as a filesystem and working with files directly. Performance was terrible, and the mount would occasionally hang, breaking the entire pipeline.

Trying to parse and deduplicate during download – I wanted to avoid storing redundant data entirely. Tried streaming the zip files through extraction and deduplication in one pass. This created race conditions and was impossible to resume if interrupted.

Compression after consolidation – The mbox file is huge (mine is 40GB+). I tried compressing it with gzip, but it made the file unusable for quick searches and added processing time. Now I just let the NAS handle compression at the filesystem level with Btrfs.

Current Limitations

This pipeline works for my needs, but it has clear limitations:

  • No real-time backup – there's up to a month lag between when an email arrives and when it's backed up
  • Manual Takeout scheduling – I can't trigger exports programmatically
  • Storage overhead – I'm storing complete monthly exports plus the consolidated mbox
  • No incremental extraction – each extraction processes the entire archive, even if only a few emails are new

For truly critical emails, I still use IMAP-based backup (mbsync) to a separate location. This Takeout pipeline is my "complete history" backup, not my "last 24 hours" backup.

What I'd Change If Starting Over

If I were building this today, I might use a proper email archiving tool like OfflineIMAP or mbsync for continuous backup, and only use Takeout as a quarterly "full snapshot" verification.

But that would require keeping IMAP access enabled, which I've disabled on my account for security reasons. Takeout is the only export method that works without IMAP.

I'd also consider storing the consolidated mbox in a SQLite database instead. Would make searching and deduplication much faster. But mbox works fine for now, and I can always migrate later.

Key Takeaways

  • Google Takeout isn't designed for automation – but you can work around it by scheduling exports manually and automating the download
  • rclone is incredibly reliable – it's been running daily for over a year without a single failure
  • Deduplication is necessary – each Takeout export is complete, so without deduping you're storing 12x your actual data per year
  • Separate staging from archives – makes the pipeline recoverable and prevents data loss if something goes wrong
  • Monitor the output, not the process – checking if the final mbox is current is more reliable than monitoring individual script runs

This setup has been running since early 2023. I've restored from it twice—once to verify