lucianoes

iconv-chunks convert latin1 to utf8

Dec 5th, 2011
161
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Perl 1.99 KB | None | 0 0
  1. #!/usr/bin/perl
  2.  
  3. our $CHUNK_SIZE = 1024 * 1024 * 100; # 100M
  4.  
  5. =head1 NAME
  6.  
  7. iconv-chunks - Process huge files with iconv
  8.  
  9. =head1 SYNOPSIS
  10.  
  11.   iconv-chunks <filename> [iconv-options]
  12.  
  13. =head1 DESCRIPTION
  14.  
  15. The standard iconv program reads the entire input file into
  16. memory, which doesn't work for large files (such as database exports).
  17.  
  18. This script is just a wrapper that processes the input file
  19. in manageable chunks and writes it to standard output.
  20.  
  21. The first argument is the input filename (use - to specify standard input).
  22. Anything else is passed through to iconv.
  23.  
  24. The real iconv needs to be somewhere in your PATH.
  25.  
  26. =head1 EXAMPLES
  27.  
  28.   # Convert latin1 to utf-8:
  29.   ./iconv-chunks database.txt -f latin1 -t utf-8 > out.txt
  30.  
  31.   # Input filename of - means standard input:
  32.   ./iconv-chunks - -f iso8859-1 -t utf8 < database.txt > out.txt
  33.  
  34.   # More complex example, using compressed input/output to minimize disk use:
  35.   zcat database.txt.gz | ./iconv-chunks - -f iso8859-1 -t utf8 | \
  36.   gzip - > database-utf.dump.gz
  37.  
  38. =head1 AUTHOR
  39.  
  40. Maurice Aubrey <[email protected]>
  41.  
  42. =cut
  43.  
  44. # $Id: iconv-chunks 6 2007-08-20 21:14:55Z mla $
  45.  
  46. use strict;
  47. use warnings;
  48. use bytes;
  49. use File::Temp qw/ tempfile /;
  50.  
  51. # iconv errors:
  52. #   iconv: unable to allocate buffer for input: Cannot allocate memory
  53. #   iconv: cannot open input file `database.txt': File too large
  54.  
  55. @ARGV >= 1 or die "Usage: $0 <inputfile> [iconv-options]\n";
  56. my @options = splice @ARGV, 1;
  57.  
  58. my($oh, $tmp) = tempfile(undef, CLEANUP => 1);
  59. # warn "Tempfile: $tmp\n";
  60.  
  61. my $iconv = "iconv @options $tmp";
  62. sub iconv { system($iconv) == 0 or die "command '$iconv' failed: $!" }
  63.  
  64. my $size = 0;
  65. # must read by line to ensure we don't split multi-byte character
  66. while (<>) {
  67.   $size += length $_;
  68.   print $oh $_;
  69.   if ($size >= $CHUNK_SIZE) {
  70.     iconv;
  71.     truncate $oh, 0 or die "truncate '$tmp' failed: $!";
  72.     seek $oh, 0, 0 or die "seek on '$tmp' failed: $!";
  73.     $size = 0;
  74.   }
  75. }
  76. iconv if $size > 0;
  77.  
Advertisement
Add Comment
Please, Sign In to add comment