bash: Get last line of a huge gziped file

In a database backup process I generate a text dumpfile. As database is quite huge, dump file is huge too so I compresses it with gzip. Compression is done inline while dump is generated (thanks Unix pipe !).

At the process end, I check dump file validity by watching the last line and check the "Dump completed" string presence. In my script I do it by extracting last line into a variable:

str=`zcat ${PATHSAVE}/dumpFull.sql.gz | tail -n1`

As database dump file is huge (currently more than 200Gb) this end process check take huge time to run (currently more than 180 minutes).

I'm searching a way to extract quicker the last line of my .gz file ... any idea anyone ?

Note 1: For explain context, we can say database is MySql community, backup tool is mysqldump, generated dumpfile is a full text file. OS is CentOs. Backup script is Bash shell script.

Note 2: I'm aware about Percona xtraBackup but in my case I want to use mysqldump for this specific backup job. Time need for restauration is not an issue.

Thanks

Can you change the way the gzip file is created? There's a subset of the gzip format that's referred to as "rsyncable" gzip that resets the compression table every so often; for those, it's feasible to start reading partway through (discarding only information up to the next reset point). There's a penalty in compression ratio, of course, but that can be tuned by changing the frequency of such restarts.
– Charles Duffy
Jun 29 at 15:03

3 Answers
3

This is a job for a fifo (a pipe) and the tee command. Use this when making your backup.

tee

mkfifo mypipe tail mypipe -1 > lastline.txt & mysqldump whatever | tee mypipe | gzip >dump.gz rm mypipe

What's going on?

mkfifo mypipe puts a fifo object into your current working directory. It looks like a file that you can write to, and read from, at the same time.

mkfifo mypipe

tail mypipe -1 >lastline.txt uses tail to read whatever you write to mypipe and save the last line in a file.

tail mypipe -1 >lastline.txt

tail

mypipe

mysqldump whatever | tee mypipe | gzip >dump.gz does your dump operation, and pipes the output to the tee command. Tee writes the output to mypipe and pipes it along to gzip.

mysqldump whatever | tee mypipe | gzip >dump.gz

tee

mypipe

gzip

The & between the two parts of the command causes both parts to run at the same time.

&

rm mypipe gets rid of the fifo object.

rm mypipe

Charles Duffy pointed out that some shells (including bash) have process substitution, and so your command can be simpler if you're using one of those shells.

bash

mysqldump whatever | tee >(tail -1 > lastline.txt ) | gzip >dump.gz

In this case the shell creates its own pipe for you.

Credit: Pipe output to two different commands

This is bad: it fails to "check dump file validity" which is the whole point to start with.
– kmkaplan
Jun 30 at 6:06

You can easily determine the last line that was processed in a pipeline, rather than from disk by just using your existing verification (with zcat and tail) and reading from stdin rather than from the written file:

zcat

tail

mysqldump args | gzip - | tee fullDump.gz | zcat - | tail -n1

This will give you the last processed line, unzipped, which you can check against the Dump completed string for success.

Dump completed

This is pretty simple but is not quite 100% bullet-proof as there is a small chance that tee fails to finish writing to disk but completes writing to stdout:

tee

If a write to any successfully opened file operand fails, writes to other successfully opened file operands and standard output shall continue, but the exit status shall be non-zero.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html

To protect against this possible failure we can simply embellish this pipeline a little:

mysqldump args | gzip - | { tee foo.gz || echo "fail" | gzip -; } | zcat - | tail -n1

Such that we emit fail as the last line in the event that tee exits non-0. If you are using bash as your shell you could alternately use pipefail (set -euxo pipefail), which would cause this pipeline to exit non-0 if any command failed (rather than just the last one).

fail

tee

pipefail

set -euxo pipefail

This is bad: it fails to "check dump file validity" which is the whole point to start with.
– kmkaplan
Jun 30 at 7:36

fair, fixed. Went off the most popular answer rather than the question @kmkaplan
– Matthew Story
Jun 30 at 7:52

Now I'm puzzled. The tee command could write to file before stdout... Or the opposite. Is it specified?
– kmkaplan
Jun 30 at 7:58

@kmkaplan we have a bit more work to do: If a write to any successfully opened file operand fails, writes to other successfully opened file operands and standard output shall continue, but the exit status shall be non-zero. ~ pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html
– Matthew Story
Jun 30 at 8:11

@kmkaplan this edge-case is now covered
– Matthew Story
Jun 30 at 8:19

If you really do want to check the integrity of the compressed file you could change from gz and do something like

# strings used in order to clean up if it gets messy dd if=dump.compressed.not-gz bs=1m skip=195000 | compress-tool -df | strings | grep 'Dump complete'

edit:
You could also debug gz when compressing a dump in which you've injected the 'dump complete' string and discover its signature. Database dumps are similar so perhaps it will be the same on all dumps. If so, just grep for it like above but without the compress and strings commands

You could lookup for the line during the dump process. Anyway, you better use more sofisticated ways to verify backups.
– Igor
Jun 29 at 19:20

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk