Ver Fonte

New doc about reproducible archives

* doc/tar.texi (Reproducibility): New section.
Spruce some other sections related to timestamps etc.
Paul Eggert há 1 ano atrás
pai
commit
d1ca333391
2 ficheiros alterados com 176 adições e 70 exclusões
  1. 7 2
      NEWS
  2. 169 68
      doc/tar.texi

+ 7 - 2
NEWS

@@ -1,5 +1,10 @@
-GNU tar NEWS - User visible changes. 2023-07-18
+GNU tar NEWS - User visible changes. 2023-07-24
 Please send GNU tar bug reports to <bug-tar@gnu.org>
+
+version TBD
+
+* New manual section "Reproducibility", for reproducible tarballs.
+
 
 version 1.35 - Sergey Poznyakoff, 2023-07-18
 
@@ -14,7 +19,7 @@ version 1.35 - Sergey Poznyakoff, 2023-07-18
 ** Fix interaction of --update with --wildcards.
 
 ** When extracting archives into an empty directory, do not create
-   hard links to files outside that directory. 
+   hard links to files outside that directory.
 
 ** Handle partial reads from regular files.
 

+ 169 - 68
doc/tar.texi

@@ -346,6 +346,7 @@ Controlling the Archive Format
 * Compression::                 Using Less Space through Compression
 * Attributes::                  Handling File Attributes
 * Portability::                 Making @command{tar} Archives More Portable
+* Reproducibility::             Making @command{tar} Archives More Reproducible
 * cpio::                        Comparison of @command{tar} and @command{cpio}
 
 Using Less Space through Compression
@@ -2806,7 +2807,7 @@ numeric fields.
 Creates a @acronym{POSIX.1-1988} compatible archive.
 
 @item posix
-Creates a @acronym{POSIX.1-2001 archive}.
+Creates a @acronym{POSIX.1-2001} archive.
 
 @end table
 
@@ -3048,8 +3049,8 @@ latter case, the modification time of that file is used. @xref{override}.
 
 When @command{--clamp-mtime} is also specified, files with
 modification times earlier than @var{date} will retain their actual
-modification times, and @var{date} will only be used for files whose
-modification times are later than @var{date}.
+modification times, and @var{date} will be used only for files with
+modification times later than @var{date}.
 
 @opsummary{multi-volume}
 @item --multi-volume
@@ -3525,7 +3526,7 @@ No directory sorting is performed. This is the default.
 @item name
 Sort the directory entries on name. The operating system may deliver
 directory entries in a more or less random order, and sorting them
-makes archive creation reproducible.
+makes archive creation more reproducible.  @xref{Reproducibility}.
 
 @item inode
 Sort the directory entries on inode number. Sorting directories on
@@ -5592,28 +5593,27 @@ $ @kbd{tar -c -f archive.tar --mode='a+rw' .}
 @item --mtime=@var{date}
 @opindex mtime
 
-When adding files to an archive, @command{tar} will use @var{date} as
+When adding files to an archive, @command{tar} uses @var{date} as
 the modification time of members when creating archives, instead of
 their actual modification times.  The argument @var{date} can be
 either a textual date representation in almost arbitrary format
 (@pxref{Date input formats}) or a name of an existing file, starting
 with @samp{/} or @samp{.}.  In the latter case, the modification time
-of that file will be used.
+of that file is used.
 
-The following example will set the modification date to 00:00:00,
+The following example sets the modification date to 00:00:00 @sc{utc} on
 January 1, 1970:
 
 @smallexample
-$ @kbd{tar -c -f archive.tar --mtime='1970-01-01' .}
+$ @kbd{tar -c -f archive.tar --mtime='@@0' .}
 @end smallexample
 
 @noindent
 When used with @option{--verbose} (@pxref{verbose tutorial}) @GNUTAR{}
-will try to convert the specified date back to its textual
-representation and compare it with the one given with
-@option{--mtime} options.  If the two dates differ, @command{tar} will
-print a warning saying what date it will use.  This is to help user
-ensure he is using the right date.
+converts the specified date back to a textual form and compares it
+with the one given with @option{--mtime}.
+If the two forms differ, @command{tar} prints both forms in a message,
+to help the user check that the right date is being used.
 
 For example:
 
@@ -5625,14 +5625,15 @@ tar: Option --mtime: Treating date 'yesterday' as 2006-06-20
 @end smallexample
 
 @noindent
-When used with @option{--clamp-mtime} @GNUTAR{} will only set the
-modification date to @var{date} on files whose actual modification
-date is later than @var{date}.  This is to make it easy to build
+When used with @option{--clamp-mtime} @GNUTAR{} sets the
+modification date to @var{date} only on files whose actual modification
+date is later than @var{date}.  This makes it easier to build
 reproducible archives given a common timestamp for generated files
 while still retaining the original timestamps of untouched files.
+@xref{Reproducibility}.
 
 @smallexample
-$ @kbd{tar -c -f archive.tar --clamp-mtime --mtime=@@$SOURCE_DATE_EPOCH .}
+$ @kbd{tar -c -f archive.tar --clamp-mtime --mtime="$SOURCE_EPOCH" .}
 @end smallexample
 
 @item --owner=@var{user}
@@ -8123,7 +8124,7 @@ Contains shell globbing-patterns and regular expressions (if prefixed
 with @samp{RE:}@footnote{According to the Bazaar docs,
 globbing-patterns are Korn-shell style and regular expressions are
 perl-style.  As of @GNUTAR{} version @value{VERSION}, these are
-treated as shell-style globs and posix extended regexps.  This will be
+treated as shell-style globs and POSIX extended regexps.  This will be
 fixed in future releases.}.  Patterns affect the directory and all its
 subdirectories.
 
@@ -8131,7 +8132,7 @@ Any line beginning with a @samp{#} is a comment.
 
 @findex .hgignore
 @item .hgignore
-Contains posix regular expressions@footnote{Support for perl-style
+Contains POSIX regular expressions@footnote{Support for perl-style
 regexps will appear in future releases.}.  The line @samp{syntax:
 glob} switches to shell globbing patterns.  The line @samp{syntax:
 regexp} switches back.  Comments begin with a @samp{#}.  Patterns
@@ -9163,7 +9164,7 @@ to an archive, the archive will only include new files.  If you use
 @option{--after-date} when extracting an archive, @command{tar} will
 only extract files newer than the @var{date} you specify.
 
-If you only want @command{tar} to make the date comparison based on
+If you want @command{tar} to make the date comparison based only on
 modification of the file's data (rather than status
 changes), then use the @option{--newer-mtime=@var{date}} option.
 
@@ -9190,7 +9191,7 @@ name; the data modification time of that file is used as the date.
 
 @opindex newer-mtime
 @item --newer-mtime=@var{date}
-Acts like @option{--after-date}, but only looks at data modification times.
+Act like @option{--after-date}, but look only at data modification times.
 @end table
 
 These options limit @command{tar} to operate only on files which have
@@ -9209,8 +9210,8 @@ field.
 
 To be precise, @option{--after-date} checks @emph{both} @code{mtime} and
 @code{ctime} and processes the file if either one is more recent than
-@var{date}, while @option{--newer-mtime} only checks @code{mtime} and
-disregards @code{ctime}.  Neither does it use @code{atime} (the last time the
+@var{date}, while @option{--newer-mtime} checks only @code{mtime} and
+disregards @code{ctime}.  Neither option uses @code{atime} (the last time the
 contents of the file were looked at).
 
 Date specifiers can have embedded spaces.  Because of this, you may need
@@ -9223,11 +9224,11 @@ $ @kbd{tar -cf foo.tar --newer-mtime '2 days ago'}
 @end smallexample
 
 When any of these options is used with the option @option{--verbose}
-(@pxref{verbose tutorial}) @GNUTAR{} will try to convert the specified
-date back to its textual representation and compare that with the
-one given with the option.  If the two dates differ, @command{tar} will
-print a warning saying what date it will use.  This is to help user
-ensure he is using the right date.  For example:
+(@pxref{verbose tutorial}) @GNUTAR{} converts the specified
+date back to a textual form and compares that with the
+one given with the option.  If the two forms differ, @command{tar}
+prints both forms in a message, to help the user check that the right
+date is being used.  For example:
 
 @smallexample
 @group
@@ -9596,56 +9597,61 @@ format imposes a number of limitations.  The most important of them
 are:
 
 @enumerate
-@item The maximum length of a file name is limited to 99 characters.
-@item The maximum length of a symbolic link is limited to 99 characters.
-@item It is impossible to store special files (block and character
+@item
+File names and symbolic links can contain at most 100 bytes.
+@item
+File sizes must be less than 8 GiB (@math{2^33} bytes = 8,589,934,592 bytes).
+@item
+It is impossible to store special files (block and character
 devices, fifos etc.)
-@item Maximum value of user or group @acronym{ID} is limited to 2097151 (7777777
-octal)
-@item V7 archives do not contain symbolic ownership information (user
+@item
+UIDs and GIDs must be less than @math{2^21} (2,097,152).
+@item
+V7 archives do not contain symbolic ownership information (user
 and group name of the file owner).
 @end enumerate
 
 This format has traditionally been used by Automake when producing
 Makefiles.  This practice will change in the future, in the meantime,
-however this means that projects containing file names more than 99
-characters long will not be able to use @GNUTAR{} @value{VERSION} and
+however this means that projects containing file names more than 100
+bytes long will not be able to use @GNUTAR{} @value{VERSION} and
 Automake prior to 1.9.
 
 @item ustar
-Archive format defined by @acronym{POSIX.1-1988} specification.  It stores
+Archive format defined by @acronym{POSIX.1-1988} and later.  It stores
 symbolic ownership information.  It is also able to store
 special files.  However, it imposes several restrictions as well:
 
 @enumerate
-@item The maximum length of a file name is limited to 256 characters,
-provided that the file name can be split at a directory separator in
-two parts, first of them being at most 155 bytes long.  So, in most
-cases the maximum file name length will be shorter than 256
-characters.
-@item The maximum length of a symbolic link name is limited to
-100 characters.
-@item Maximum size of a file the archive is able to accommodate
-is 8GB
-@item Maximum value of UID/GID is 2097151.
-@item Maximum number of bits in device major and minor numbers is 21.
+@item
+File names can contain at most 255 bytes.
+@item
+File names longer than 100 bytes must be split at a directory separator in
+two parts, the first being at most 155 bytes long.
+So, in most cases file names must be a bit shorter than 255 bytes.
+@item
+Symbolic links can contain at most 100 bytes.
+@item
+Files can contain at most 8 GiB (@math{2^33} bytes = 8,589,934,592 bytes).
+@item
+UIDs, GIDs, device major numbers, and device minor numbers
+must be less than @math{2^21} (2,097,152).
 @end enumerate
 
 @item star
-Format used by J@"org Schilling @command{star}
+The format used by the late J@"org Schilling's @command{star}
 implementation.  @GNUTAR{} is able to read @samp{star} archives but
 currently does not produce them.
 
 @item posix
-Archive format defined by @acronym{POSIX.1-2001} specification.  This is the
-most flexible and feature-rich format.  It does not impose any
-restrictions on file sizes or file name lengths.  This format is quite
-recent, so not all tar implementations are able to handle it properly.
-However, this format is designed in such a way that any tar
-implementation able to read @samp{ustar} archives will be able to read
-most @samp{posix} archives as well, with the only exception that any
-additional information (such as long file names etc.)@: will in such
-case be extracted as plain text files along with the files it refers to.
+The format defined by @acronym{POSIX.1-2001} and later.  This is the
+most flexible and feature-rich format.  It does not impose arbitrary
+restrictions on file sizes or file name lengths.  This format is more
+recent, so some @command{tar} implementations cannot handle it properly.
+However, any @command{tar} implementation able to read @samp{ustar}
+archives should be able to read most @samp{posix} archives as well,
+except that it will extract any additional information (such as long
+file names) as extra plain text files.
 
 This archive format will be the default format for future versions
 of @GNUTAR{}.
@@ -9659,21 +9665,22 @@ formats:
 @headitem Format @tab UID @tab File Size @tab File Name @tab Devn
 @item gnu    @tab 1.8e19 @tab Unlimited @tab Unlimited @tab 63
 @item oldgnu @tab 1.8e19 @tab Unlimited @tab Unlimited @tab 63
-@item v7     @tab 2097151 @tab 8GB @tab 99 @tab n/a
-@item ustar  @tab 2097151 @tab 8GB @tab 256 @tab 21
+@item v7     @tab 2097151 @tab 8 GiB @minus{} 1 @tab 99 @tab n/a
+@item ustar  @tab 2097151 @tab 8 GiB @minus{} 1 @tab 255 @tab 21
 @item posix  @tab Unlimited @tab Unlimited @tab Unlimited @tab Unlimited
 @end multitable
 
 The default format for @GNUTAR{} is defined at compilation
 time.  You may check it by running @command{tar --help}, and examining
 the last lines of its output.  Usually, @GNUTAR{} is configured
-to create archives in @samp{gnu} format, however, future version will
+to create archives in @samp{gnu} format, however, a future version will
 switch to @samp{posix}.
 
 @menu
 * Compression::                 Using Less Space through Compression
 * Attributes::                  Handling File Attributes
 * Portability::                 Making @command{tar} Archives More Portable
+* Reproducibility::             Making @command{tar} Archives More Reproducible
 * cpio::                        Comparison of @command{tar} and @command{cpio}
 @end menu
 
@@ -10610,8 +10617,8 @@ will use the following default value:
 %d/PaxHeaders/%f
 @end smallexample
 
-This default is selected to ensure the reproducibility of the
-archive. @acronym{POSIX} standard recommends to use
+This default helps make the archive more reproducible.
+@xref{Reproducibility}.  @acronym{POSIX} recommends using
 @samp{%d/PaxHeaders.%p/%f} instead, which means the two archives
 created with the same set of options and containing the same set
 of files will be byte-to-byte different. This default will be used
@@ -10712,9 +10719,8 @@ use the following option:
 
 @cindex archives, binary equivalent
 @cindex binary equivalent archives, creating
-As another example, here is the option that ensures that any two
-archives created using it, will be binary equivalent if they have the
-same contents:
+As another example, the following option helps make the archive
+more reproducible.  @xref{Reproducibility}
 
 @smallexample
 --pax-option delete=atime
@@ -10800,7 +10806,7 @@ file.  You will than have to switch to a format that is able to
 handle such values.  The format summary table (@pxref{Formats}) will
 help you to do so.
 
-In particular, when trying to archive files larger than 8GB or with
+In particular, when trying to archive files 8 GiB or larger, or with
 timestamps not in the range 1970-01-01 00:00:00 through 2242-03-16
 12:56:31 @sc{utc}, you will have to chose between @acronym{GNU} and
 @acronym{POSIX} archive formats.  When considering which format to
@@ -10816,7 +10822,9 @@ representations.
 
 On the other hand, @acronym{POSIX} archives, generally speaking, can
 be extracted by any tar implementation that understands older
-@acronym{ustar} format.  The only exception are files larger than 8GB.
+@acronym{ustar} format.  The exceptions are files 8 GiB or larger,
+or files dated before 1970-01-01 00:00:00 or after 2242-03-16
+12:56:31 @sc{utc}
 
 @FIXME{Describe how @acronym{POSIX} archives are extracted by non
 POSIX-aware tars.}
@@ -11171,6 +11179,99 @@ Done
 @end group
 @end smallexample
 
+@node Reproducibility
+@section Making @command{tar} Archives More Reproducible
+
+Sometimes it is important for an archive to be reproducible,
+so that one can be easily verify it to have been derived solely from its input.
+However, two archives created by @GNUTAR{} from two sets of input
+files normally might differ even if the input files have the same
+contents and @GNUTAR{} was invoked the same way on both sets of input.
+This can happen if the inputs have different modification dates or
+other metadata, or if the input directories' entries are in different orders.
+
+To avoid this problem when creating an archive, and thus make the
+archive reproducible, you can run @GNUTAR{} in the C locale with
+some or all of the following options:
+
+@table @option
+@item --sort=name
+Omit irrelevant information about directory entry order.
+
+@item --format=posix
+Avoid problems with large files or files with unusual timestamps.
+This also enables @option{--pax-option} options mentioned below.
+
+@item --pax-option='exthdr.name=%d/PaxHeaders/%f'
+Omit the process ID of @command{tar}.
+This option is needed only if @env{POSIXLY_CORRECT} is set in the environment.
+
+@item --pax-option='delete=atime,delete=ctime'
+Omit irrelevant information about file access or status change time.
+
+@item --clamp-mtime --mtime="$SOURCE_EPOCH"
+Omit irrelevant information about file timestamps after
+@samp{$SOURCE_EPOCH}, which should be a time no less than any
+timestamp of any source file.
+
+@item --numeric-owner
+Omit irrelevant information about user and group names.
+
+@item --owner=0
+@itemx --group=0
+Omit irrelevant information about file ownership and group.
+
+@item --mode='go+u,go-w'
+Omit irrelevant information about file permissions.
+@end table
+
+When creating a reproducible archive from version-controlled source files,
+it can be useful to set each file's modification time
+to be that of its last commit, so that the timestamps
+are reproducible from the version-control repository.
+If these timestamps are all on integer second boundaries, and if you use
+@option{--format=posix --pax-option='delete=atime,delete=ctime'
+--clamp-mtime --mtime="$SOURCE_EPOCH"}
+where @code{$SOURCE_EPOCH} is the the time of the most recent commit,
+and if all non-source files have timestamps greater than @code{$SOURCE_EPOCH},
+then @GNUTAR{} should generate an archive in @acronym{ustar} format,
+since no POSIX features will be needed and the archive will be in the
+@acronym{ustar} subset of @acronym{posix} format.
+
+Also, if compressing, use a reproducible compression format; e.g.,
+with @command{gzip} you should use the @option{--no-name} (@option{-n}) option.
+
+Here is an example set of shell commands to produce a reproducible
+tarball with @command{git} and @command{gzip}, which you can tailor to
+your project's needs.
+
+@example
+function get_commit_time() @{
+  TZ=UTC0 git log -1 \
+    --format=tformat:%cd \
+    --date=format:%Y-%m-%dT%H:%M:%SZ \
+    "$@@"
+@}
+SOURCE_EPOCH=$(get_commit_time)
+git ls-files | while read -r file; do
+  commit_time=$(get_commit_time -- "$file") &&
+  touch -cmd $commit_time -- "$file"
+done
+TARFLAGS="
+  --sort=name --format=posix
+  --pax-option=exthdr.name=%d/PaxHeaders/%f
+  --pax-option=delete=atime,delete=ctime
+  --clamp-mtime --mtime=$SOURCE_EPOCH
+  --numeric-owner --owner=0 --group=0
+  --mode=go+u,go-w
+"
+GZIPFLAGS="
+  --no-name --best
+"
+LC_ALL=C tar $TARFLAGS -cf - FILES |
+  gzip $GZIPFLAGS > ARCHIVE.tgz
+@end example
+
 @node cpio
 @section Comparison of @command{tar} and @command{cpio}
 @UNREVISED{}