4
0

sparse.texi 9.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234
  1. @c This is part of the paxutils manual.
  2. @c Copyright (C) 2006--2023 Free Software Foundation, Inc.
  3. @c This file is distributed under GFDL 1.1 or any later version
  4. @c published by the Free Software Foundation.
  5. @cindex sparse formats
  6. @cindex sparse versions
  7. The notion of sparse file, and the ways of handling it from the point
  8. of view of @GNUTAR{} user have been described in detail in
  9. @ref{sparse}. This chapter describes the internal format @GNUTAR{}
  10. uses to store such files.
  11. The support for sparse files in @GNUTAR{} has a long history. The
  12. earliest version featuring this support that I was able to find was 1.09,
  13. released in November, 1990. The format introduced back then is called
  14. @dfn{old GNU} sparse format and in spite of the fact that its design
  15. contained many flaws, it was the only format @GNUTAR{} supported
  16. until version 1.14 (May, 2004), which introduced initial support for
  17. sparse archives in @acronym{PAX} archives (@pxref{posix}). This
  18. format was not free from design flaws, either and it was subsequently
  19. improved in versions 1.15.2 (November, 2005) and 1.15.92 (June,
  20. 2006).
  21. In addition to GNU sparse format, @GNUTAR{} is able to read and
  22. extract sparse files archived by @command{star}.
  23. The following subsections describe each format in detail.
  24. @menu
  25. * Old GNU Format::
  26. * PAX 0:: PAX Format, Versions 0.0 and 0.1
  27. * PAX 1:: PAX Format, Version 1.0
  28. @end menu
  29. @node Old GNU Format
  30. @appendixsubsec Old GNU Format
  31. @cindex sparse formats, Old GNU
  32. @cindex Old GNU sparse format
  33. The format introduced in November 1990 (v. 1.09) was
  34. designed on top of standard @code{ustar} headers in such an
  35. unfortunate way that some of its fields overwrote fields required by
  36. POSIX.
  37. An old GNU sparse header is designated by type @samp{S}
  38. (@code{GNUTYPE_SPARSE}) and has the following layout:
  39. @multitable @columnfractions 0.10 0.10 0.20 0.20 0.40
  40. @headitem Offset @tab Size @tab Name @tab Data type @tab Contents
  41. @item 0 @tab 345 @tab @tab N/A @tab Not used.
  42. @item 345 @tab 12 @tab atime @tab Number @tab @code{atime} of the file.
  43. @item 357 @tab 12 @tab ctime @tab Number @tab @code{ctime} of the file .
  44. @item 369 @tab 12 @tab offset @tab Number @tab For
  45. multivolume archives: the offset of the start of this volume.
  46. @item 381 @tab 4 @tab @tab N/A @tab Not used.
  47. @item 385 @tab 1 @tab @tab N/A @tab Not used.
  48. @item 386 @tab 96 @tab sp @tab @code{sparse_header} @tab (4 entries) File map.
  49. @item 482 @tab 1 @tab isextended @tab Bool @tab @code{1} if an
  50. extension sparse header follows, @code{0} otherwise.
  51. @item 483 @tab 12 @tab realsize @tab Number @tab Real size of the file.
  52. @end multitable
  53. Each of @code{sparse_header} object at offset 386 describes a single
  54. data chunk. It has the following structure:
  55. @multitable @columnfractions 0.10 0.10 0.20 0.60
  56. @headitem Offset @tab Size @tab Data type @tab Contents
  57. @item 0 @tab 12 @tab Number @tab Offset of the
  58. beginning of the chunk.
  59. @item 12 @tab 12 @tab Number @tab Size of the chunk.
  60. @end multitable
  61. If the member contains more than four chunks, the @code{isextended}
  62. field of the header has the value @code{1} and the main header is
  63. followed by one or more @dfn{extension headers}. Each such header has
  64. the following structure:
  65. @multitable @columnfractions 0.10 0.10 0.20 0.20 0.40
  66. @headitem Offset @tab Size @tab Name @tab Data type @tab Contents
  67. @item 0 @tab 21 @tab sp @tab @code{sparse_header} @tab
  68. (21 entries) File map.
  69. @item 504 @tab 1 @tab isextended @tab Bool @tab @code{1} if an
  70. extension sparse header follows, or @code{0} otherwise.
  71. @end multitable
  72. A header with @code{isextended=0} ends the map.
  73. @node PAX 0
  74. @appendixsubsec PAX Format, Versions 0.0 and 0.1
  75. @cindex sparse formats, v.0.0
  76. There are two formats available in this branch. The version @code{0.0}
  77. is the initial version of sparse format used by @command{tar}
  78. versions 1.14--1.15.1. The sparse file map is kept in extended
  79. (@code{x}) PAX header variables:
  80. @table @code
  81. @vrindex GNU.sparse.size, extended header variable
  82. @item GNU.sparse.size
  83. Real size of the stored file;
  84. @item GNU.sparse.numblocks
  85. @vrindex GNU.sparse.numblocks, extended header variable
  86. Number of blocks in the sparse map;
  87. @item GNU.sparse.offset
  88. @vrindex GNU.sparse.offset, extended header variable
  89. Offset of the data block;
  90. @item GNU.sparse.numbytes
  91. @vrindex GNU.sparse.numbytes, extended header variable
  92. Size of the data block.
  93. @end table
  94. The latter two variables repeat for each data block, so the overall
  95. structure is like this:
  96. @smallexample
  97. @group
  98. GNU.sparse.size=@var{size}
  99. GNU.sparse.numblocks=@var{numblocks}
  100. repeat @var{numblocks} times
  101. GNU.sparse.offset=@var{offset}
  102. GNU.sparse.numbytes=@var{numbytes}
  103. end repeat
  104. @end group
  105. @end smallexample
  106. This format presented the following two problems:
  107. @enumerate 1
  108. @item
  109. Whereas the POSIX specification allows a variable to appear multiple
  110. times in a header, it requires that only the last occurrence be
  111. meaningful. Thus, multiple occurrences of @code{GNU.sparse.offset} and
  112. @code{GNU.sparse.numbytes} are conflicting with the POSIX specs.
  113. @item
  114. Attempting to extract such archives using a third-party's @command{tar}
  115. results in extraction of sparse files in @emph{condensed form}. If
  116. the @command{tar} implementation in question does not support POSIX
  117. format, it will also extract a file containing extension header
  118. attributes. This file can be used to expand the file to its original
  119. state. However, posix-aware @command{tar}s will usually ignore the
  120. unknown variables, which makes restoring the file more
  121. difficult. @xref{extracting sparse v0x, Extraction of sparse
  122. members in v.0.0 format}, for the detailed description of how to
  123. restore such members using non-GNU @command{tar}s.
  124. @end enumerate
  125. @cindex sparse formats, v.0.1
  126. @GNUTAR{} 1.15.2 introduced sparse format version @code{0.1}, which
  127. attempted to solve these problems. As its predecessor, this format
  128. stores sparse map in the extended POSIX header. It retains
  129. @code{GNU.sparse.size} and @code{GNU.sparse.numblocks} variables, but
  130. instead of @code{GNU.sparse.offset}/@code{GNU.sparse.numbytes} pairs
  131. it uses a single variable:
  132. @table @code
  133. @item GNU.sparse.map
  134. @vrindex GNU.sparse.map, extended header variable
  135. Map of non-null data chunks. It is a string consisting of
  136. comma-separated values "@var{offset},@var{size}[,@var{offset-1},@var{size-1}...]"
  137. @end table
  138. To address the 2nd problem, the @code{name} field in @code{ustar}
  139. is replaced with a special name, constructed using the following pattern:
  140. @smallexample
  141. %d/GNUSparseFile.%p/%f
  142. @end smallexample
  143. @vrindex GNU.sparse.name, extended header variable
  144. The real name of the sparse file is stored in the variable
  145. @code{GNU.sparse.name}. Thus, those @command{tar} implementations
  146. that are not aware of GNU extensions will at least extract the files
  147. into separate directories, giving the user a possibility to expand it
  148. afterwards. @xref{extracting sparse v0x, Extraction of sparse
  149. members in v.0.1 format}, for the detailed description of how to
  150. restore such members using non-GNU @command{tar}s.
  151. The resulting @code{GNU.sparse.map} string can be @emph{very} long.
  152. Although POSIX does not impose any limit on the length of a @code{x}
  153. header variable, this possibly can confuse some @command{tar}s.
  154. @node PAX 1
  155. @appendixsubsec PAX Format, Version 1.0
  156. @cindex sparse formats, v.1.0
  157. The version @code{1.0} of sparse format was introduced with @GNUTAR{}
  158. 1.15.92. Its main objective was to make the resulting file
  159. extractable with little effort even by non-posix aware @command{tar}
  160. implementations. Starting from this version, the extended header
  161. preceding a sparse member always contains the following variables that
  162. identify the format being used:
  163. @table @code
  164. @item GNU.sparse.major
  165. @vrindex GNU.sparse.major, extended header variable
  166. Major version
  167. @item GNU.sparse.minor
  168. @vrindex GNU.sparse.minor, extended header variable
  169. Minor version
  170. @end table
  171. The @code{name} field in @code{ustar} header contains a special name,
  172. constructed using the following pattern:
  173. @smallexample
  174. %d/GNUSparseFile.%p/%f
  175. @end smallexample
  176. @vrindex GNU.sparse.name, extended header variable, in v.1.0
  177. @vrindex GNU.sparse.realsize, extended header variable
  178. The real name of the sparse file is stored in the variable
  179. @code{GNU.sparse.name}. The real size of the file is stored in the
  180. variable @code{GNU.sparse.realsize}.
  181. The sparse map itself is stored in the file data block, preceding the actual
  182. file data. It consists of a series of decimal numbers delimited
  183. by newlines. The map is padded with nulls to the nearest block boundary.
  184. The first number gives the number of entries in the map. Following are
  185. map entries, each one consisting of two numbers giving the offset and
  186. size of the data block it describes.
  187. The format is designed in such a way that non-posix aware @command{tar}s and @command{tar}s not
  188. supporting @code{GNU.sparse.*} keywords will extract each sparse file
  189. in its condensed form with the file map prepended and will place it
  190. into a separate directory. Then, using a simple program it would be
  191. possible to expand the file to its original form even without @GNUTAR{}.
  192. @xref{Sparse Recovery}, for the detailed information on how to extract
  193. sparse members without @GNUTAR{}.