Browse Source

New doc about reproducible archives

* doc/tar.texi (Reproducibility): New section.
Spruce some other sections related to timestamps etc.
Paul Eggert 1 year ago
parent
commit
d1ca333391
2 changed files with 176 additions and 70 deletions
  1. 7 2
      NEWS
  2. 169 68
      doc/tar.texi

+ 7 - 2
NEWS

@@ -1,5 +1,10 @@
-GNU tar NEWS - User visible changes. 2023-07-18
+GNU tar NEWS - User visible changes. 2023-07-24
 Please send GNU tar bug reports to <bug-tar@gnu.org>
 Please send GNU tar bug reports to <bug-tar@gnu.org>
+
+version TBD
+
+* New manual section "Reproducibility", for reproducible tarballs.
+
 
 
 version 1.35 - Sergey Poznyakoff, 2023-07-18
 version 1.35 - Sergey Poznyakoff, 2023-07-18
 
 
@@ -14,7 +19,7 @@ version 1.35 - Sergey Poznyakoff, 2023-07-18
 ** Fix interaction of --update with --wildcards.
 ** Fix interaction of --update with --wildcards.
 
 
 ** When extracting archives into an empty directory, do not create
 ** When extracting archives into an empty directory, do not create
-   hard links to files outside that directory. 
+   hard links to files outside that directory.
 
 
 ** Handle partial reads from regular files.
 ** Handle partial reads from regular files.
 
 

+ 169 - 68
doc/tar.texi

@@ -346,6 +346,7 @@ Controlling the Archive Format
 * Compression::                 Using Less Space through Compression
 * Compression::                 Using Less Space through Compression
 * Attributes::                  Handling File Attributes
 * Attributes::                  Handling File Attributes
 * Portability::                 Making @command{tar} Archives More Portable
 * Portability::                 Making @command{tar} Archives More Portable
+* Reproducibility::             Making @command{tar} Archives More Reproducible
 * cpio::                        Comparison of @command{tar} and @command{cpio}
 * cpio::                        Comparison of @command{tar} and @command{cpio}
 
 
 Using Less Space through Compression
 Using Less Space through Compression
@@ -2806,7 +2807,7 @@ numeric fields.
 Creates a @acronym{POSIX.1-1988} compatible archive.
 Creates a @acronym{POSIX.1-1988} compatible archive.
 
 
 @item posix
 @item posix
-Creates a @acronym{POSIX.1-2001 archive}.
+Creates a @acronym{POSIX.1-2001} archive.
 
 
 @end table
 @end table
 
 
@@ -3048,8 +3049,8 @@ latter case, the modification time of that file is used. @xref{override}.
 
 
 When @command{--clamp-mtime} is also specified, files with
 When @command{--clamp-mtime} is also specified, files with
 modification times earlier than @var{date} will retain their actual
 modification times earlier than @var{date} will retain their actual
-modification times, and @var{date} will only be used for files whose
-modification times are later than @var{date}.
+modification times, and @var{date} will be used only for files with
+modification times later than @var{date}.
 
 
 @opsummary{multi-volume}
 @opsummary{multi-volume}
 @item --multi-volume
 @item --multi-volume
@@ -3525,7 +3526,7 @@ No directory sorting is performed. This is the default.
 @item name
 @item name
 Sort the directory entries on name. The operating system may deliver
 Sort the directory entries on name. The operating system may deliver
 directory entries in a more or less random order, and sorting them
 directory entries in a more or less random order, and sorting them
-makes archive creation reproducible.
+makes archive creation more reproducible.  @xref{Reproducibility}.
 
 
 @item inode
 @item inode
 Sort the directory entries on inode number. Sorting directories on
 Sort the directory entries on inode number. Sorting directories on
@@ -5592,28 +5593,27 @@ $ @kbd{tar -c -f archive.tar --mode='a+rw' .}
 @item --mtime=@var{date}
 @item --mtime=@var{date}
 @opindex mtime
 @opindex mtime
 
 
-When adding files to an archive, @command{tar} will use @var{date} as
+When adding files to an archive, @command{tar} uses @var{date} as
 the modification time of members when creating archives, instead of
 the modification time of members when creating archives, instead of
 their actual modification times.  The argument @var{date} can be
 their actual modification times.  The argument @var{date} can be
 either a textual date representation in almost arbitrary format
 either a textual date representation in almost arbitrary format
 (@pxref{Date input formats}) or a name of an existing file, starting
 (@pxref{Date input formats}) or a name of an existing file, starting
 with @samp{/} or @samp{.}.  In the latter case, the modification time
 with @samp{/} or @samp{.}.  In the latter case, the modification time
-of that file will be used.
+of that file is used.
 
 
-The following example will set the modification date to 00:00:00,
+The following example sets the modification date to 00:00:00 @sc{utc} on
 January 1, 1970:
 January 1, 1970:
 
 
 @smallexample
 @smallexample
-$ @kbd{tar -c -f archive.tar --mtime='1970-01-01' .}
+$ @kbd{tar -c -f archive.tar --mtime='@@0' .}
 @end smallexample
 @end smallexample
 
 
 @noindent
 @noindent
 When used with @option{--verbose} (@pxref{verbose tutorial}) @GNUTAR{}
 When used with @option{--verbose} (@pxref{verbose tutorial}) @GNUTAR{}
-will try to convert the specified date back to its textual
-representation and compare it with the one given with
-@option{--mtime} options.  If the two dates differ, @command{tar} will
-print a warning saying what date it will use.  This is to help user
-ensure he is using the right date.
+converts the specified date back to a textual form and compares it
+with the one given with @option{--mtime}.
+If the two forms differ, @command{tar} prints both forms in a message,
+to help the user check that the right date is being used.
 
 
 For example:
 For example:
 
 
@@ -5625,14 +5625,15 @@ tar: Option --mtime: Treating date 'yesterday' as 2006-06-20
 @end smallexample
 @end smallexample
 
 
 @noindent
 @noindent
-When used with @option{--clamp-mtime} @GNUTAR{} will only set the
-modification date to @var{date} on files whose actual modification
-date is later than @var{date}.  This is to make it easy to build
+When used with @option{--clamp-mtime} @GNUTAR{} sets the
+modification date to @var{date} only on files whose actual modification
+date is later than @var{date}.  This makes it easier to build
 reproducible archives given a common timestamp for generated files
 reproducible archives given a common timestamp for generated files
 while still retaining the original timestamps of untouched files.
 while still retaining the original timestamps of untouched files.
+@xref{Reproducibility}.
 
 
 @smallexample
 @smallexample
-$ @kbd{tar -c -f archive.tar --clamp-mtime --mtime=@@$SOURCE_DATE_EPOCH .}
+$ @kbd{tar -c -f archive.tar --clamp-mtime --mtime="$SOURCE_EPOCH" .}
 @end smallexample
 @end smallexample
 
 
 @item --owner=@var{user}
 @item --owner=@var{user}
@@ -8123,7 +8124,7 @@ Contains shell globbing-patterns and regular expressions (if prefixed
 with @samp{RE:}@footnote{According to the Bazaar docs,
 with @samp{RE:}@footnote{According to the Bazaar docs,
 globbing-patterns are Korn-shell style and regular expressions are
 globbing-patterns are Korn-shell style and regular expressions are
 perl-style.  As of @GNUTAR{} version @value{VERSION}, these are
 perl-style.  As of @GNUTAR{} version @value{VERSION}, these are
-treated as shell-style globs and posix extended regexps.  This will be
+treated as shell-style globs and POSIX extended regexps.  This will be
 fixed in future releases.}.  Patterns affect the directory and all its
 fixed in future releases.}.  Patterns affect the directory and all its
 subdirectories.
 subdirectories.
 
 
@@ -8131,7 +8132,7 @@ Any line beginning with a @samp{#} is a comment.
 
 
 @findex .hgignore
 @findex .hgignore
 @item .hgignore
 @item .hgignore
-Contains posix regular expressions@footnote{Support for perl-style
+Contains POSIX regular expressions@footnote{Support for perl-style
 regexps will appear in future releases.}.  The line @samp{syntax:
 regexps will appear in future releases.}.  The line @samp{syntax:
 glob} switches to shell globbing patterns.  The line @samp{syntax:
 glob} switches to shell globbing patterns.  The line @samp{syntax:
 regexp} switches back.  Comments begin with a @samp{#}.  Patterns
 regexp} switches back.  Comments begin with a @samp{#}.  Patterns
@@ -9163,7 +9164,7 @@ to an archive, the archive will only include new files.  If you use
 @option{--after-date} when extracting an archive, @command{tar} will
 @option{--after-date} when extracting an archive, @command{tar} will
 only extract files newer than the @var{date} you specify.
 only extract files newer than the @var{date} you specify.
 
 
-If you only want @command{tar} to make the date comparison based on
+If you want @command{tar} to make the date comparison based only on
 modification of the file's data (rather than status
 modification of the file's data (rather than status
 changes), then use the @option{--newer-mtime=@var{date}} option.
 changes), then use the @option{--newer-mtime=@var{date}} option.
 
 
@@ -9190,7 +9191,7 @@ name; the data modification time of that file is used as the date.
 
 
 @opindex newer-mtime
 @opindex newer-mtime
 @item --newer-mtime=@var{date}
 @item --newer-mtime=@var{date}
-Acts like @option{--after-date}, but only looks at data modification times.
+Act like @option{--after-date}, but look only at data modification times.
 @end table
 @end table
 
 
 These options limit @command{tar} to operate only on files which have
 These options limit @command{tar} to operate only on files which have
@@ -9209,8 +9210,8 @@ field.
 
 
 To be precise, @option{--after-date} checks @emph{both} @code{mtime} and
 To be precise, @option{--after-date} checks @emph{both} @code{mtime} and
 @code{ctime} and processes the file if either one is more recent than
 @code{ctime} and processes the file if either one is more recent than
-@var{date}, while @option{--newer-mtime} only checks @code{mtime} and
-disregards @code{ctime}.  Neither does it use @code{atime} (the last time the
+@var{date}, while @option{--newer-mtime} checks only @code{mtime} and
+disregards @code{ctime}.  Neither option uses @code{atime} (the last time the
 contents of the file were looked at).
 contents of the file were looked at).
 
 
 Date specifiers can have embedded spaces.  Because of this, you may need
 Date specifiers can have embedded spaces.  Because of this, you may need
@@ -9223,11 +9224,11 @@ $ @kbd{tar -cf foo.tar --newer-mtime '2 days ago'}
 @end smallexample
 @end smallexample
 
 
 When any of these options is used with the option @option{--verbose}
 When any of these options is used with the option @option{--verbose}
-(@pxref{verbose tutorial}) @GNUTAR{} will try to convert the specified
-date back to its textual representation and compare that with the
-one given with the option.  If the two dates differ, @command{tar} will
-print a warning saying what date it will use.  This is to help user
-ensure he is using the right date.  For example:
+(@pxref{verbose tutorial}) @GNUTAR{} converts the specified
+date back to a textual form and compares that with the
+one given with the option.  If the two forms differ, @command{tar}
+prints both forms in a message, to help the user check that the right
+date is being used.  For example:
 
 
 @smallexample
 @smallexample
 @group
 @group
@@ -9596,56 +9597,61 @@ format imposes a number of limitations.  The most important of them
 are:
 are:
 
 
 @enumerate
 @enumerate
-@item The maximum length of a file name is limited to 99 characters.
-@item The maximum length of a symbolic link is limited to 99 characters.
-@item It is impossible to store special files (block and character
+@item
+File names and symbolic links can contain at most 100 bytes.
+@item
+File sizes must be less than 8 GiB (@math{2^33} bytes = 8,589,934,592 bytes).
+@item
+It is impossible to store special files (block and character
 devices, fifos etc.)
 devices, fifos etc.)
-@item Maximum value of user or group @acronym{ID} is limited to 2097151 (7777777
-octal)
-@item V7 archives do not contain symbolic ownership information (user
+@item
+UIDs and GIDs must be less than @math{2^21} (2,097,152).
+@item
+V7 archives do not contain symbolic ownership information (user
 and group name of the file owner).
 and group name of the file owner).
 @end enumerate
 @end enumerate
 
 
 This format has traditionally been used by Automake when producing
 This format has traditionally been used by Automake when producing
 Makefiles.  This practice will change in the future, in the meantime,
 Makefiles.  This practice will change in the future, in the meantime,
-however this means that projects containing file names more than 99
-characters long will not be able to use @GNUTAR{} @value{VERSION} and
+however this means that projects containing file names more than 100
+bytes long will not be able to use @GNUTAR{} @value{VERSION} and
 Automake prior to 1.9.
 Automake prior to 1.9.
 
 
 @item ustar
 @item ustar
-Archive format defined by @acronym{POSIX.1-1988} specification.  It stores
+Archive format defined by @acronym{POSIX.1-1988} and later.  It stores
 symbolic ownership information.  It is also able to store
 symbolic ownership information.  It is also able to store
 special files.  However, it imposes several restrictions as well:
 special files.  However, it imposes several restrictions as well:
 
 
 @enumerate
 @enumerate
-@item The maximum length of a file name is limited to 256 characters,
-provided that the file name can be split at a directory separator in
-two parts, first of them being at most 155 bytes long.  So, in most
-cases the maximum file name length will be shorter than 256
-characters.
-@item The maximum length of a symbolic link name is limited to
-100 characters.
-@item Maximum size of a file the archive is able to accommodate
-is 8GB
-@item Maximum value of UID/GID is 2097151.
-@item Maximum number of bits in device major and minor numbers is 21.
+@item
+File names can contain at most 255 bytes.
+@item
+File names longer than 100 bytes must be split at a directory separator in
+two parts, the first being at most 155 bytes long.
+So, in most cases file names must be a bit shorter than 255 bytes.
+@item
+Symbolic links can contain at most 100 bytes.
+@item
+Files can contain at most 8 GiB (@math{2^33} bytes = 8,589,934,592 bytes).
+@item
+UIDs, GIDs, device major numbers, and device minor numbers
+must be less than @math{2^21} (2,097,152).
 @end enumerate
 @end enumerate
 
 
 @item star
 @item star
-Format used by J@"org Schilling @command{star}
+The format used by the late J@"org Schilling's @command{star}
 implementation.  @GNUTAR{} is able to read @samp{star} archives but
 implementation.  @GNUTAR{} is able to read @samp{star} archives but
 currently does not produce them.
 currently does not produce them.
 
 
 @item posix
 @item posix
-Archive format defined by @acronym{POSIX.1-2001} specification.  This is the
-most flexible and feature-rich format.  It does not impose any
-restrictions on file sizes or file name lengths.  This format is quite
-recent, so not all tar implementations are able to handle it properly.
-However, this format is designed in such a way that any tar
-implementation able to read @samp{ustar} archives will be able to read
-most @samp{posix} archives as well, with the only exception that any
-additional information (such as long file names etc.)@: will in such
-case be extracted as plain text files along with the files it refers to.
+The format defined by @acronym{POSIX.1-2001} and later.  This is the
+most flexible and feature-rich format.  It does not impose arbitrary
+restrictions on file sizes or file name lengths.  This format is more
+recent, so some @command{tar} implementations cannot handle it properly.
+However, any @command{tar} implementation able to read @samp{ustar}
+archives should be able to read most @samp{posix} archives as well,
+except that it will extract any additional information (such as long
+file names) as extra plain text files.
 
 
 This archive format will be the default format for future versions
 This archive format will be the default format for future versions
 of @GNUTAR{}.
 of @GNUTAR{}.
@@ -9659,21 +9665,22 @@ formats:
 @headitem Format @tab UID @tab File Size @tab File Name @tab Devn
 @headitem Format @tab UID @tab File Size @tab File Name @tab Devn
 @item gnu    @tab 1.8e19 @tab Unlimited @tab Unlimited @tab 63
 @item gnu    @tab 1.8e19 @tab Unlimited @tab Unlimited @tab 63
 @item oldgnu @tab 1.8e19 @tab Unlimited @tab Unlimited @tab 63
 @item oldgnu @tab 1.8e19 @tab Unlimited @tab Unlimited @tab 63
-@item v7     @tab 2097151 @tab 8GB @tab 99 @tab n/a
-@item ustar  @tab 2097151 @tab 8GB @tab 256 @tab 21
+@item v7     @tab 2097151 @tab 8 GiB @minus{} 1 @tab 99 @tab n/a
+@item ustar  @tab 2097151 @tab 8 GiB @minus{} 1 @tab 255 @tab 21
 @item posix  @tab Unlimited @tab Unlimited @tab Unlimited @tab Unlimited
 @item posix  @tab Unlimited @tab Unlimited @tab Unlimited @tab Unlimited
 @end multitable
 @end multitable
 
 
 The default format for @GNUTAR{} is defined at compilation
 The default format for @GNUTAR{} is defined at compilation
 time.  You may check it by running @command{tar --help}, and examining
 time.  You may check it by running @command{tar --help}, and examining
 the last lines of its output.  Usually, @GNUTAR{} is configured
 the last lines of its output.  Usually, @GNUTAR{} is configured
-to create archives in @samp{gnu} format, however, future version will
+to create archives in @samp{gnu} format, however, a future version will
 switch to @samp{posix}.
 switch to @samp{posix}.
 
 
 @menu
 @menu
 * Compression::                 Using Less Space through Compression
 * Compression::                 Using Less Space through Compression
 * Attributes::                  Handling File Attributes
 * Attributes::                  Handling File Attributes
 * Portability::                 Making @command{tar} Archives More Portable
 * Portability::                 Making @command{tar} Archives More Portable
+* Reproducibility::             Making @command{tar} Archives More Reproducible
 * cpio::                        Comparison of @command{tar} and @command{cpio}
 * cpio::                        Comparison of @command{tar} and @command{cpio}
 @end menu
 @end menu
 
 
@@ -10610,8 +10617,8 @@ will use the following default value:
 %d/PaxHeaders/%f
 %d/PaxHeaders/%f
 @end smallexample
 @end smallexample
 
 
-This default is selected to ensure the reproducibility of the
-archive. @acronym{POSIX} standard recommends to use
+This default helps make the archive more reproducible.
+@xref{Reproducibility}.  @acronym{POSIX} recommends using
 @samp{%d/PaxHeaders.%p/%f} instead, which means the two archives
 @samp{%d/PaxHeaders.%p/%f} instead, which means the two archives
 created with the same set of options and containing the same set
 created with the same set of options and containing the same set
 of files will be byte-to-byte different. This default will be used
 of files will be byte-to-byte different. This default will be used
@@ -10712,9 +10719,8 @@ use the following option:
 
 
 @cindex archives, binary equivalent
 @cindex archives, binary equivalent
 @cindex binary equivalent archives, creating
 @cindex binary equivalent archives, creating
-As another example, here is the option that ensures that any two
-archives created using it, will be binary equivalent if they have the
-same contents:
+As another example, the following option helps make the archive
+more reproducible.  @xref{Reproducibility}
 
 
 @smallexample
 @smallexample
 --pax-option delete=atime
 --pax-option delete=atime
@@ -10800,7 +10806,7 @@ file.  You will than have to switch to a format that is able to
 handle such values.  The format summary table (@pxref{Formats}) will
 handle such values.  The format summary table (@pxref{Formats}) will
 help you to do so.
 help you to do so.
 
 
-In particular, when trying to archive files larger than 8GB or with
+In particular, when trying to archive files 8 GiB or larger, or with
 timestamps not in the range 1970-01-01 00:00:00 through 2242-03-16
 timestamps not in the range 1970-01-01 00:00:00 through 2242-03-16
 12:56:31 @sc{utc}, you will have to chose between @acronym{GNU} and
 12:56:31 @sc{utc}, you will have to chose between @acronym{GNU} and
 @acronym{POSIX} archive formats.  When considering which format to
 @acronym{POSIX} archive formats.  When considering which format to
@@ -10816,7 +10822,9 @@ representations.
 
 
 On the other hand, @acronym{POSIX} archives, generally speaking, can
 On the other hand, @acronym{POSIX} archives, generally speaking, can
 be extracted by any tar implementation that understands older
 be extracted by any tar implementation that understands older
-@acronym{ustar} format.  The only exception are files larger than 8GB.
+@acronym{ustar} format.  The exceptions are files 8 GiB or larger,
+or files dated before 1970-01-01 00:00:00 or after 2242-03-16
+12:56:31 @sc{utc}
 
 
 @FIXME{Describe how @acronym{POSIX} archives are extracted by non
 @FIXME{Describe how @acronym{POSIX} archives are extracted by non
 POSIX-aware tars.}
 POSIX-aware tars.}
@@ -11171,6 +11179,99 @@ Done
 @end group
 @end group
 @end smallexample
 @end smallexample
 
 
+@node Reproducibility
+@section Making @command{tar} Archives More Reproducible
+
+Sometimes it is important for an archive to be reproducible,
+so that one can be easily verify it to have been derived solely from its input.
+However, two archives created by @GNUTAR{} from two sets of input
+files normally might differ even if the input files have the same
+contents and @GNUTAR{} was invoked the same way on both sets of input.
+This can happen if the inputs have different modification dates or
+other metadata, or if the input directories' entries are in different orders.
+
+To avoid this problem when creating an archive, and thus make the
+archive reproducible, you can run @GNUTAR{} in the C locale with
+some or all of the following options:
+
+@table @option
+@item --sort=name
+Omit irrelevant information about directory entry order.
+
+@item --format=posix
+Avoid problems with large files or files with unusual timestamps.
+This also enables @option{--pax-option} options mentioned below.
+
+@item --pax-option='exthdr.name=%d/PaxHeaders/%f'
+Omit the process ID of @command{tar}.
+This option is needed only if @env{POSIXLY_CORRECT} is set in the environment.
+
+@item --pax-option='delete=atime,delete=ctime'
+Omit irrelevant information about file access or status change time.
+
+@item --clamp-mtime --mtime="$SOURCE_EPOCH"
+Omit irrelevant information about file timestamps after
+@samp{$SOURCE_EPOCH}, which should be a time no less than any
+timestamp of any source file.
+
+@item --numeric-owner
+Omit irrelevant information about user and group names.
+
+@item --owner=0
+@itemx --group=0
+Omit irrelevant information about file ownership and group.
+
+@item --mode='go+u,go-w'
+Omit irrelevant information about file permissions.
+@end table
+
+When creating a reproducible archive from version-controlled source files,
+it can be useful to set each file's modification time
+to be that of its last commit, so that the timestamps
+are reproducible from the version-control repository.
+If these timestamps are all on integer second boundaries, and if you use
+@option{--format=posix --pax-option='delete=atime,delete=ctime'
+--clamp-mtime --mtime="$SOURCE_EPOCH"}
+where @code{$SOURCE_EPOCH} is the the time of the most recent commit,
+and if all non-source files have timestamps greater than @code{$SOURCE_EPOCH},
+then @GNUTAR{} should generate an archive in @acronym{ustar} format,
+since no POSIX features will be needed and the archive will be in the
+@acronym{ustar} subset of @acronym{posix} format.
+
+Also, if compressing, use a reproducible compression format; e.g.,
+with @command{gzip} you should use the @option{--no-name} (@option{-n}) option.
+
+Here is an example set of shell commands to produce a reproducible
+tarball with @command{git} and @command{gzip}, which you can tailor to
+your project's needs.
+
+@example
+function get_commit_time() @{
+  TZ=UTC0 git log -1 \
+    --format=tformat:%cd \
+    --date=format:%Y-%m-%dT%H:%M:%SZ \
+    "$@@"
+@}
+SOURCE_EPOCH=$(get_commit_time)
+git ls-files | while read -r file; do
+  commit_time=$(get_commit_time -- "$file") &&
+  touch -cmd $commit_time -- "$file"
+done
+TARFLAGS="
+  --sort=name --format=posix
+  --pax-option=exthdr.name=%d/PaxHeaders/%f
+  --pax-option=delete=atime,delete=ctime
+  --clamp-mtime --mtime=$SOURCE_EPOCH
+  --numeric-owner --owner=0 --group=0
+  --mode=go+u,go-w
+"
+GZIPFLAGS="
+  --no-name --best
+"
+LC_ALL=C tar $TARFLAGS -cf - FILES |
+  gzip $GZIPFLAGS > ARCHIVE.tgz
+@end example
+
 @node cpio
 @node cpio
 @section Comparison of @command{tar} and @command{cpio}
 @section Comparison of @command{tar} and @command{cpio}
 @UNREVISED{}
 @UNREVISED{}