bash: Get last line of a huge gziped file
bash: Get last line of a huge gziped file
In a database backup process I generate a text dumpfile. As database is quite huge, dump file is huge too so I compresses it with gzip. Compression is done inline while dump is generated (thanks Unix pipe !).
At the process end, I check dump file validity by watching the last line and check the "Dump completed" string presence. In my script I do it by extracting last line into a variable:
str=`zcat ${PATHSAVE}/dumpFull.sql.gz | tail -n1`
As database dump file is huge (currently more than 200Gb) this end process check take huge time to run (currently more than 180 minutes).
I'm searching a way to extract quicker the last line of my .gz file ... any idea anyone ?
Note 1: For explain context, we can say database is MySql community, backup tool is mysqldump, generated dumpfile is a full text file. OS is CentOs. Backup script is Bash shell script.
Note 2: I'm aware about Percona xtraBackup but in my case I want to use mysqldump for this specific backup job. Time need for restauration is not an issue.
Thanks
3 Answers
3
This is a job for a fifo (a pipe) and the tee
command. Use this when making your backup.
tee
mkfifo mypipe
tail mypipe -1 > lastline.txt & mysqldump whatever | tee mypipe | gzip >dump.gz
rm mypipe
What's going on?
mkfifo mypipe
puts a fifo object into your current working directory. It looks like a file that you can write to, and read from, at the same time.
mkfifo mypipe
tail mypipe -1 >lastline.txt
uses tail
to read whatever you write to mypipe
and save the last line in a file.
tail mypipe -1 >lastline.txt
tail
mypipe
mysqldump whatever | tee mypipe | gzip >dump.gz
does your dump operation, and pipes the output to the tee
command. Tee writes the output to mypipe
and pipes it along to gzip
.
mysqldump whatever | tee mypipe | gzip >dump.gz
tee
mypipe
gzip
The &
between the two parts of the command causes both parts to run at the same time.
&
rm mypipe
gets rid of the fifo object.
rm mypipe
Charles Duffy pointed out that some shells (including bash
) have process substitution, and so your command can be simpler if you're using one of those shells.
bash
mysqldump whatever | tee >(tail -1 > lastline.txt ) | gzip >dump.gz
In this case the shell creates its own pipe for you.
Credit: Pipe output to two different commands
This is bad: it fails to "check dump file validity" which is the whole point to start with.
– kmkaplan
Jun 30 at 6:06
You can easily determine the last line that was processed in a pipeline, rather than from disk by just using your existing verification (with zcat
and tail
) and reading from stdin rather than from the written file:
zcat
tail
mysqldump args | gzip - | tee fullDump.gz | zcat - | tail -n1
This will give you the last processed line, unzipped, which you can check against the Dump completed
string for success.
Dump completed
This is pretty simple but is not quite 100% bullet-proof as there is a small chance that tee
fails to finish writing to disk but completes writing to stdout:
tee
If a write to any successfully opened file operand fails, writes to other successfully opened file operands and standard output shall continue, but the exit status shall be non-zero.
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html
To protect against this possible failure we can simply embellish this pipeline a little:
mysqldump args | gzip - | { tee foo.gz || echo "fail" | gzip -; } | zcat - | tail -n1
Such that we emit fail
as the last line in the event that tee
exits non-0. If you are using bash as your shell you could alternately use pipefail
(set -euxo pipefail
), which would cause this pipeline to exit non-0 if any command failed (rather than just the last one).
fail
tee
pipefail
set -euxo pipefail
This is bad: it fails to "check dump file validity" which is the whole point to start with.
– kmkaplan
Jun 30 at 7:36
fair, fixed. Went off the most popular answer rather than the question @kmkaplan
– Matthew Story
Jun 30 at 7:52
Now I'm puzzled. The tee command could write to file before stdout... Or the opposite. Is it specified?
– kmkaplan
Jun 30 at 7:58
@kmkaplan we have a bit more work to do: If a write to any successfully opened file operand fails, writes to other successfully opened file operands and standard output shall continue, but the exit status shall be non-zero. ~ pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html
– Matthew Story
Jun 30 at 8:11
@kmkaplan this edge-case is now covered
– Matthew Story
Jun 30 at 8:19
If you really do want to check the integrity of the compressed file you could change from gz and do something like
# strings used in order to clean up if it gets messy
dd if=dump.compressed.not-gz bs=1m skip=195000 | compress-tool -df | strings | grep 'Dump complete'
edit:
You could also debug gz when compressing a dump in which you've injected the 'dump complete' string and discover its signature. Database dumps are similar so perhaps it will be the same on all dumps. If so, just grep for it like above but without the compress and strings commands
You could lookup for the line during the dump process. Anyway, you better use more sofisticated ways to verify backups.
– Igor
Jun 29 at 19:20
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Can you change the way the gzip file is created? There's a subset of the gzip format that's referred to as "rsyncable" gzip that resets the compression table every so often; for those, it's feasible to start reading partway through (discarding only information up to the next reset point). There's a penalty in compression ratio, of course, but that can be tuned by changing the frequency of such restarts.
– Charles Duffy
Jun 29 at 15:03