This presentation describes the internals and on-disk layout of ext3 and ext4 filesystems. This presentation was given to a techie audience and hence comprises mostly of low level technical details.
1.
ext3/ext4 Filesystems
Kalpak Shah
Clogeny Technologies Pvt. Ltd.
2.
AGENGA
Layout of EXT3/4
Essential on-disk data structures
New features in ext4
• Extents, uninit_bg, nanosecond timestamps, 48-
bit support, preallocation, mballoc, flex_bg,
journal checksums
• Its effects on on-disk layout
Crash recovery
Latest filesystem design layouts
3.
Basic layout of EXT2/3/4 partition
All block groups are of same size and stored sequentially.
Superblock and group descriptors are duplicated in
multiple block groups as per SPARSE_SUPER feature.
Block sizes starting from 512 bytes upto 8KB are
supported.
4.
Creating an ext3 FS
mkfs.ext3 -b 4096 -I 512 -i 8192 -J size=256
/dev/sda1
• Blocksize consideration
• Number of inodes and inode sizes
• Journal size
For example, consider an 8GB ext3 fs with a 4KB
blocksize. In this case, each 4KB block bitmap
describes 32K data blocks that is, 128MB. Therefore
64 block groups will be present in this fs.
5.
Ext3 Superblock
The ext3 superblock is stored in an ext3_super_block
structure. Some important fields are listed here:
• s_inodes_count, s_blocks_count,
s_free_blocks_count, s_free_inodes_count,
s_inode_size
• blocks_per_group, inodes_per_group
• s_mnt_count, s_max_mnt_count
• s_feature_{compat, incompat, rocompat}
• s_uuid, s_volume_name
• s_journal_inum, s_journal_dev
• s_state, s_errors
6.
Group Descriptors
Each block group has its own group descriptor,
represented by ext3_group_desc structure, which
has these fields:
• bg_block_bitmap
• bg_inode_bitmap
• bg_inode_table
• bg_free_{blocks,inode}_count
• bg_used_dirs_count
Most fields are useful for inode/block allocator
8.
Directory layout
EXT3/4 implements directories using a special kind
of file whose data blocks store filenames along with
corresponding inode numbers. Such data blocks
basically contain structures of type ext3_dir_entry_2.
This structure contains the following fields:
• Inode number
• Directory entry length
• Name length
• Filetype
• Name
Directories entries are stored using a 2-level hashing
for fast retrieval.
9.
Ext4 Features – Extents
Replaces traditional indirect block mapping scheme which
causes high metadata overhead and poor performance with
large files.
An extent is a single descriptor that represents a range of
contiguous blocks:
struct ext4_extent {
__le32 ee_block; /* first logical block */
__le16 ee_len; /* no of blocks */
__le16 ee_start_hi; /* high 16 bits of phy blk */
__le32 ee_start_lo; /* low 32 bits of phy blk */
};
Extents tree leads to efficient lookups and improves
performance on sequential IO as well as mail server workloads.
Ext4 supports both extents and indirect mapping schemes and
files can be converted between the two formats.
11.
Ext4 features
Large FS support
• Ext3 used 32-bit block numbers and with 4KB
blocksize, the filesystem is limited to maximum 16TB
size.
• Ext4 uses 48-bit block numbers. All on-disk structures
needed to be changed to support the 48-bit block
number.
Persistent preallocation (fallocate support)
• Apps such as large databases often write zeros to a
file for guaranteed and contiguous space reservation.
• Ext4 improves this scenario by skipping the zero-out
and marking the extents as uninitialized instead.
12.
Ext4 features
UNINIT_BG
• For very large filesystems, e2fsck times are starting to
become unacceptable.
• The uninitialized block groups feature uses flags in the group
descriptor to indicate of the block group is initialized or not.
Efsck can just ignore block groups that are marked as
uninitialized .
• The flags marking the block group uninitialized and the high
watermark are checksummed so we can detect corruption.
• We have seen 2-10x speedup for e2fsck in many cases.
Nanosecond timestamp support
• Using the i_{atime, ctime, mtime, crtime}_extra fields.
13.
Ext4 features
Multi-block-allocator
• Allocates multiple blocks at once using buddy data
structure.
• Includes inode and group preallocation
• Includes special allocation modes for small files and
GOAL blocks.
flex_bg
• This feature groups meta-data(inode,block bitmap and
indoe table) from a series of groups at the beginning
of a “flex” group in order to improve performance
during heavy meta-data operations.
14.
Crash recovery - JBD/2
First a copy of the blocks to be written is stored in the journal.
Then, when the I/O transfer to the journal is completed
(commit block is written), the blocks are written (replayed) in
the filesystem.
Journaling modes:
• Journal – All data and metadata is journaled.
• Ordered – Only metadata changes are journaled. Data blocks are
written to disk before the metadata to avoid data corruption.
• Writeback – Only metadata is journaled. Fastest mode.
Journal checksums
• All blocks in a transaction are checksummed and the checksum is
stored in the commit header.
• While replaying the transaction(either by e2fsck or ext4), this
checksum ensures that corrupt or partial transactions are not
written to disk.
15.
Latest filesystem design layouts
Trees
• Latest filesystems like ZFS, BtrFS, Tux3 use indexed trees
for efficient directory layouts, blocks, objects(inodes, EAs)
and snapshots. With 64-bit or 128-bit pointers, we literally
end all limits imposed on filesystems – no of inodes, EA
sizes, no of files within directories.
Checksumming
• All data/metadata is checksummed for early detection and
possible correction.
In-built VM
• Volume manager and filesystem are tightly coupled to take
advantage of mirroring and RAID like functionality.
In-built encryption, compression
Be the first to comment