Restrict size of buffer cache in Linux

Question

Is there a way to tell the Linux kernel to only use a certain percentage of memory for the buffer cache? I know /proc/sys/vm/drop_caches can be used to clear the cache temporarily, but is there any permanent setting that prevents it from growing to more than e.g. 50% of main memory?

The reason I want to do this, is that I have a server running a Ceph OSD which constantly serves data from disk and manages to use up the entire physical memory as buffer cache within a few hours. At the same time, I need to run applications that will allocate a large amount (several 10s of GB) of physical memory. Contrary to popular belief (see the advice given on nearly all questions concerning the buffer cache), the automatic freeing up the memory by discarding clean cache entries is not instantaneous: starting my application can take up to a minute when the buffer cache is full (*), while after clearing the cache (using echo 3 > /proc/sys/vm/drop_caches) the same application starts nearly instantaneously.

(*) During this minute of startup time, the application is faulting in new memory but spends 100% of its time in the kernel, according to Vtune in a function called pageblock_pfn_to_page. This function seems to be related to memory compaction needed to find huge pages, which leads me to believe that actually fragmentation is the problem.

There is something called cache tiering. ceph osd pool set {cachepool} hit_set_count 1 ceph osd pool set {cachepool} hit_set_period 3600 ceph osd pool set {cachepool} target_max_bytes 1000000000000 as a example see. docs.ceph.com/docs/master/rados/operations/cache-tiering — Michael D., Jan 11 at 22:55
Since this problem apparently only affects the startup of the memory intensive applications, maybe you could start apps via a script that clears the cache before actually starting them. Maybe this starts them faster while still leaving the cache management to the kernel while they are running. — Thawn, Jan 15 at 13:14

NOLFXceptMe · Accepted Answer · 2016-01-13 03:18:37Z

If you do not want an absolute limit but just pressure the kernel to flush out the buffers faster, you should look at vm.vfs_cache_pressure

This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed.

Ranges from 0 to 200. Move it towards 200 for higher pressure. Default is set at 100. You can also analyze your memory usage using the slabtop command. In your case, the dentry and *_inode_cache values must be high.

If you want an absolute limit, you should look up cgroups. Place the Ceph OSD server within a cgroup and limit the maximum memory it can use by setting the memory.limit_in_bytes parameter for the cgroup.

memory.memsw.limit_in_bytes sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or M for Megabytes, and g or G for Gigabytes.

References:

[1] - GlusterFS Linux Kernel Tuning

[2] - RHEL 6 Resource Management Guide

A cgroup with limit_in_bytes set seems to do it. Thanks! – Wim Jan 22 at 15:35 — Wim, Jan 22 at 15:35

Ijaz Khan · Answer 2 · 2016-01-07 11:02:33Z

tuned is a dynamic adaptive system tuning daemon that tunes system settings dynamically depending on usage.

 $ man tuned

See the related documentation , and configuration files.

 /etc/tuned
 /etc/tuned/*.conf
 /usr/share/doc/tuned-2.4.1
 /usr/share/doc/tuned-2.4.1/TIPS.txt

This parameter may be useful for you.

** Set flushing to once per 5 minutes
** echo "3000" > /proc/sys/vm/dirty_writeback_centisecs

Additional Info

The sync command flushes the buffer, i.e., forces all unwritten data to be written to disk, and can be used when one wants to be sure that everything is safely written. In traditional UNIX systems, there is a program called update running in the background which does a sync every 30 seconds, so it is usually not necessary to use sync. Linux has an additional daemon, bdflush, which does a more imperfect sync more frequently to avoid the sudden freeze due to heavy disk I/O that sync sometimes causes.

Under Linux, bdflush is started by update. There is usually no reason to worry about it, but if bdflush happens to die for some reason, the kernel will warn about this, and you should start it by hand (/sbin/update).

Isn't this only for dirty entries? I don't think that's the issue on my system as they are all clean -- the delay is not in writing back dirty pages but in defragmenting space left by removing clean ones. — Wim, Jan 7 at 10:23
Yes , this is for dirty pages , i think you can also fix other performance problems by setting tuned to dynamic mode. — Ijaz Khan, Jan 7 at 10:28

DnrDevil · Answer 3 · 2016-01-13 00:35:24Z

I don't know about A % but, You can set a time limit so it drops it after x amount of minutes.

First in a terminal

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

To clear current caches.

Make it a cron-job Press Alt-F2, type gksudo gedit /etc/crontab, Then Add this line near the bottom.

 */15 *    * * *   root    sync && echo 3 > /proc/sys/vm/drop_caches

This cleans every 15 minutes. You can set to 1 or 5 minutes if you really want to by changing the first parameter to * or */5 instead of */15

To see your free RAM, excepting cache:

free -m | sed -n -e '3p' | grep -Po "\d+$

etherfish · Answer 4 · 2016-01-16 16:35:34Z

I think your hunch at the very end of your question is on the right track. I'd suspect either A, NUMA-aware memory allocation migrating pages between CPUs, or B, more likely, the defrag code of transparent hugepages trying to find contiguous, aligned regions.

Hugepages and transparent hugepages has been identified for both marked performance improvements on certain workloads and responsible for consuming enormous amounts of CPU time without providing much benefit.

It'd help to know which kernel you're running, the contents of /proc/meminfo (or at least the HugePages_* values.), and, if possible, more of the vtune profiler callgraph referencing pageblock_pfn_to_page().

Also, if you'd indulge my guess, try disable hugepage defrag with:

echo 'never' >/sys/kernel/mm/transparent_hugepage/defrag

(it may be this instead, depending on your kernel:)

echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Lastly, is this app using many tens of gigs of ram something you wrote? What language?

Since you used the term, "faulting in memory pages," I'm guessing you're familiar enough with operating design and virtual memory. I struggle to envision a situation/application that would be faulting so aggressively that isn't reading in lots of I/O - almost always from the buffer cache that you're trying to limit.

(If you're curious, check out mmap(2) flags like MAP_ANONYMOUS and MAP_POPULATE and mincore(2) which can be used to see which virtual pages actually have a mapped physical page.)

Good Luck!

asked	11 months ago
viewed	3743 times
active	11 months ago

current community

your communities

more stack exchange communities

Restrict size of buffer cache in Linux

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged linux-kernel buffer or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Restrict size of buffer cache in Linux

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged linux-kernel buffer or ask your own question.

Related

Hot Network Questions