sparse.texi 8.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217
  1. @c This is part of the paxutils manual.
  2. @c Copyright (C) 2006 Free Software Foundation, Inc.
  3. @c This file is distributed under GFDL 1.1 or any later version
  4. @c published by the Free Software Foundation.
  5. The notion of sparse file, and the ways of handling it from the point
  6. of view of @GNUTAR{} user have been described in detail in
  7. @ref{sparse}. This chapter describes the internal format @GNUTAR{}
  8. uses to store such files.
  9. The support for sparse files in @GNUTAR{} has a long history. The
  10. earliest version featuring this support that I was able to find was 1.09,
  11. released in November, 1990. The format introduced back then is called
  12. @dfn{old GNU} sparse format and in spite of the fact that its design
  13. contained many flaws, it was the only format @GNUTAR{} supported
  14. until version 1.14 (May, 2004), which introduced initial support for
  15. sparse archives in @acronym{PAX} archives (@pxref{posix}). This
  16. format was not free from design flows, either and it was subsequently
  17. improved in versions 1.15.2 (November, 2005) and 1.15.92 (June,
  18. 2006).
  19. In addition to GNU sparse format, @GNUTAR{} is able to read and
  20. extract sparse files archived by @command{star}.
  21. The following subsections describe each format in detail.
  22. @menu
  23. * Old GNU Format::
  24. * PAX 0:: PAX Format, Versions 0.0 and 0.1
  25. * PAX 1:: PAX Format, Version 1.0
  26. @end menu
  27. @node Old GNU Format
  28. @appendixsubsec Old GNU Format
  29. The format introduced some time around 1990 (v. 1.09). It was
  30. designed on top of standard @code{ustar} headers in such an
  31. unfortunate way that some of its fields overwrote fields required by
  32. POSIX.
  33. An old GNU sparse header is designated by type @samp{S}
  34. (@code{GNUTYPE_SPARSE}) and has the following layout:
  35. @multitable @columnfractions 0.10 0.10 0.20 0.20 0.40
  36. @headitem Offset @tab Size @tab Name @tab Data type @tab Contents
  37. @item 0 @tab 345 @tab @tab N/A @tab Not used.
  38. @item 345 @tab 12 @tab atime @tab Number @tab @code{atime} of the file.
  39. @item 357 @tab 12 @tab ctime @tab Number @tab @code{ctime} of the file .
  40. @item 369 @tab 12 @tab offset @tab Number @tab For
  41. multivolume archives: the offset of the start of this volume.
  42. @item 381 @tab 4 @tab @tab N/A @tab Not used.
  43. @item 385 @tab 1 @tab @tab N/A @tab Not used.
  44. @item 386 @tab 96 @tab sp @tab @code{sparse_header} @tab (4 entries) File map.
  45. @item 482 @tab 1 @tab isextended @tab Bool @tab @code{1} if an
  46. extension sparse header follows, @code{0} otherwise.
  47. @item 483 @tab 12 @tab realsize @tab Number @tab Real size of the file.
  48. @end multitable
  49. Each of @code{sparse_header} object at offset 386 describes a single
  50. data chunk. It has the following structure:
  51. @multitable @columnfractions 0.10 0.10 0.20 0.60
  52. @headitem Offset @tab Size @tab Data type @tab Contents
  53. @item 0 @tab 12 @tab Number @tab Offset of the
  54. beginning of the chunk.
  55. @item 12 @tab 12 @tab Number @tab Size of the chunk.
  56. @end multitable
  57. If the member contains more than four chunks, the @code{isextended}
  58. field of the header has the value @code{1} and the main header is
  59. followed by one or more @dfn{extension headers}. Each such header has
  60. the following structure:
  61. @multitable @columnfractions 0.10 0.10 0.20 0.20 0.40
  62. @headitem Offset @tab Size @tab Name @tab Data type @tab Contents
  63. @item 0 @tab 21 @tab sp @tab @code{sparse_header} @tab
  64. (21 entires) File map.
  65. @item 504 @tab 1 @tab isextended @tab Bool @tab @code{1} if an
  66. extension sparse header follows, or @code{0} otherwise.
  67. @end multitable
  68. A header with @code{isextended=0} ends the map.
  69. @node PAX 0
  70. @appendixsubsec PAX Format, Versions 0.0 and 0.1
  71. @UNREVISED{}
  72. There are two formats available in this branch. The version @code{0.0}
  73. is the initial version of sparse format used by @command{tar}
  74. versions 1.14--1.15.1. The sparse file map is kept in extended
  75. (@code{x}) PAX header variables:
  76. @table @code
  77. @item GNU.sparse.size
  78. Real size of the stored file
  79. @item GNU.sparse.numblocks
  80. Number of blocks in the sparse map
  81. @item GNU.sparse.offset
  82. Offset of the data block
  83. @item GNU.sparse.numbytes
  84. Size of the data block
  85. @end table
  86. The latter two variables repeat for each data block, so the overall
  87. structure is like this:
  88. @smallexample
  89. @group
  90. GNU.sparse.size=@var{size}
  91. GNU.sparse.numblocks=@var{numblocks}
  92. repeat @var{numblocks} times
  93. GNU.sparse.offset=@var{offset}
  94. GNU.sparse.numbytes=@var{numbytes}
  95. end repeat
  96. @end group
  97. @end smallexample
  98. This format presented the following two problems:
  99. @enumerate 1
  100. @item
  101. Whereas the POSIX specification allows a variable to appear multiple
  102. times in a header, it requires that only the last occurrence be
  103. meaningful. Thus, multiple ocurrences of @code{GNU.sparse.offset} and
  104. @code{GNU.sparse.numbytes} are conficting with the POSIX specs.
  105. @item
  106. Attempting to extract such archives using a third-party @command{tar}s
  107. results in extraction of sparse files in @emph{compressed form}. If
  108. the @command{tar} implementation in question does not support POSIX
  109. format, it will also extract a file containing extension header
  110. attributes. This file can be used to expand the file to its original
  111. state. However, posix-aware @command{tar}s will usually ignore the
  112. unknown variables, which makes restoring the file much more
  113. difficult@FIXME-xref{how to extract sparse file using third-party @command{tar}s}.
  114. @end enumerate
  115. @GNUTAR{} 1.15.2 introduced sparse format version @code{0.1}, which
  116. attempted to solve these problems. As its predecessor, this format
  117. stores sparse map in the extended POSIX header. It retains
  118. @code{GNU.sparse.size} and @code{GNU.sparse.numblocks} variables, but
  119. instead of @code{GNU.sparse.offset}/@code{GNU.sparse.numbytes} pairs
  120. it uses a single variable:
  121. @table @code
  122. @item GNU.sparse.map
  123. Map of non-null data chunks. It is a string consisting of
  124. comma-separated values "@var{offset},@var{size}[,@var{offset-1},@var{size-1}...]"
  125. @end table
  126. To address the 2nd problem, the @code{name} field in @code{ustar}
  127. is replaced with a special name, constructed using the following pattern:
  128. @smallexample
  129. %d/GNUSparseFile.%p/%f
  130. @end smallexample
  131. The real name of the sparse file is stored in the variable
  132. @code{GNU.sparse.name}. Thus, those @command{tar} implementations
  133. that are not aware of GNU extensions will at least extract the files
  134. into separate directories, giving the user a possibility to expand it
  135. afterwards @FIXME-ref{how to extract sparse file using third-party
  136. @command{tar}s}.
  137. The resulting @code{GNU.sparse.map} string can be @emph{very} long.
  138. Although POSIX does not impose any limit on the length of a @code{x}
  139. header variable, this possibly can confuse some tars.
  140. @node PAX 1
  141. @appendixsubsec PAX Format, Version 1.0
  142. @UNREVISED{}
  143. The version @code{1.0} of sparse format was introduced with @GNUTAR{}
  144. 1.15.92. Its main objective was to make the resulting file
  145. extractable with little effort even by non-posix aware @command{tar}
  146. implementations. Starting from this version, the extended header
  147. preceding a sparse member always contains the following variables that
  148. identify the format being used:
  149. @table @code
  150. @item GNU.sparse.major
  151. Major version
  152. @item GNU.sparse.minor
  153. Minor version
  154. @end table
  155. The @code{name} field in @code{ustar} header contains a special name,
  156. constructed using the following pattern:
  157. @smallexample
  158. %d/GNUSparseFile.%p/%f
  159. @end smallexample
  160. The real name of the sparse file is stored in the variable
  161. @code{GNU.sparse.name}. The real size of the file is stored in the
  162. variable @code{GNU.sparse.realsize}.
  163. The sparse map itself is stored in the file data block, preceding the actual
  164. file data. It consists of a series of octal numbers of arbitrary length, delimited
  165. by newlines. The map is padded with nulls to the nearest block boundary.
  166. The first number gives the number of entries in the map. Following are map entries,
  167. each one consisting of two numbers giving the offset and size of the
  168. data block it describes.
  169. The format is designed in such a way that non-posix aware tars and tars not
  170. supporting @code{GNU.sparse.*} keywords will extract each sparse file
  171. in its condensed form with the file map prepended and will place it
  172. into a separate directory. Then, using a simple program it would be
  173. possible to expand the file to its original form even without GNU tar.
  174. @FIXME-xref{how to extract sparse file using third-party
  175. @command{tar}s}. @FIXME{Write the program and give its URL here}.