Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Is there a way to tell the Linux kernel to only use a certain percentage of memory for the buffer cache? I know /proc/sys/vm/drop_caches can be used to clear the cache temporarily, but is there any permanent setting that prevents it from growing to more than e.g. 50% of main memory?

The reason I want to do this, is that I have a server running a Ceph OSD which constantly serves data from disk and manages to use up the entire physical memory as buffer cache within a few hours. At the same time, I need to run applications that will allocate a large amount (several 10s of GB) of physical memory. Contrary to popular belief (see the advice given on nearly all questions concerning the buffer cache), the automatic freeing up the memory by discarding clean cache entries is not instantaneous: starting my application can take up to a minute when the buffer cache is full (*), while after clearing the cache (using echo 3 > /proc/sys/vm/drop_caches) the same application starts nearly instantaneously.

(*) During this minute of startup time, the application is faulting in new memory but spends 100% of its time in the kernel, according to Vtune in a function called pageblock_pfn_to_page. This function seems to be related to memory compaction needed to find huge pages, which leads me to believe that actually fragmentation is the problem.

share|improve this question
1  
There is something called cache tiering. ceph osd pool set {cachepool} hit_set_count 1 ceph osd pool set {cachepool} hit_set_period 3600 ceph osd pool set {cachepool} target_max_bytes 1000000000000 as a example see. docs.ceph.com/docs/master/rados/operations/cache-tiering – Michael D. Jan 11 at 22:55
1  
Since this problem apparently only affects the startup of the memory intensive applications, maybe you could start apps via a script that clears the cache before actually starting them. Maybe this starts them faster while still leaving the cache management to the kernel while they are running. – Thawn Jan 15 at 13:14
up vote 3 down vote accepted
+50

If you do not want an absolute limit but just pressure the kernel to flush out the buffers faster, you should look at vm.vfs_cache_pressure

This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed.

Ranges from 0 to 200. Move it towards 200 for higher pressure. Default is set at 100. You can also analyze your memory usage using the slabtop command. In your case, the dentry and *_inode_cache values must be high.

If you want an absolute limit, you should look up cgroups. Place the Ceph OSD server within a cgroup and limit the maximum memory it can use by setting the memory.limit_in_bytes parameter for the cgroup.

memory.memsw.limit_in_bytes sets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or M for Megabytes, and g or G for Gigabytes.

References:

[1] - GlusterFS Linux Kernel Tuning

[2] - RHEL 6 Resource Management Guide

share|improve this answer
    
A cgroup with limit_in_bytes set seems to do it. Thanks! – Wim Jan 22 at 15:35

tuned is a dynamic adaptive system tuning daemon that tunes system settings dynamically depending on usage.

 $ man tuned

See the related documentation , and configuration files.

 /etc/tuned
 /etc/tuned/*.conf
 /usr/share/doc/tuned-2.4.1
 /usr/share/doc/tuned-2.4.1/TIPS.txt

This parameter may be useful for you.

** Set flushing to once per 5 minutes
** echo "3000" > /proc/sys/vm/dirty_writeback_centisecs

Additional Info

The sync command flushes the buffer, i.e., forces all unwritten data to be written to disk, and can be used when one wants to be sure that everything is safely written. In traditional UNIX systems, there is a program called update running in the background which does a sync every 30 seconds, so it is usually not necessary to use sync. Linux has an additional daemon, bdflush, which does a more imperfect sync more frequently to avoid the sudden freeze due to heavy disk I/O that sync sometimes causes.

Under Linux, bdflush is started by update. There is usually no reason to worry about it, but if bdflush happens to die for some reason, the kernel will warn about this, and you should start it by hand (/sbin/update).

share|improve this answer
    
Isn't this only for dirty entries? I don't think that's the issue on my system as they are all clean -- the delay is not in writing back dirty pages but in defragmenting space left by removing clean ones. – Wim Jan 7 at 10:23
    
Yes , this is for dirty pages , i think you can also fix other performance problems by setting tuned to dynamic mode. – Ijaz Khan Jan 7 at 10:28

I don't know about A % but, You can set a time limit so it drops it after x amount of minutes.

First in a terminal

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

To clear current caches.

Make it a cron-job Press Alt-F2, type gksudo gedit /etc/crontab, Then Add this line near the bottom.

 */15 *    * * *   root    sync && echo 3 > /proc/sys/vm/drop_caches

This cleans every 15 minutes. You can set to 1 or 5 minutes if you really want to by changing the first parameter to * or */5 instead of */15

To see your free RAM, excepting cache:

free -m | sed -n -e '3p' | grep -Po "\d+$
share|improve this answer

I think your hunch at the very end of your question is on the right track. I'd suspect either A, NUMA-aware memory allocation migrating pages between CPUs, or B, more likely, the defrag code of transparent hugepages trying to find contiguous, aligned regions.

Hugepages and transparent hugepages has been identified for both marked performance improvements on certain workloads and responsible for consuming enormous amounts of CPU time without providing much benefit.

It'd help to know which kernel you're running, the contents of /proc/meminfo (or at least the HugePages_* values.), and, if possible, more of the vtune profiler callgraph referencing pageblock_pfn_to_page().

Also, if you'd indulge my guess, try disable hugepage defrag with:

echo 'never' >/sys/kernel/mm/transparent_hugepage/defrag

(it may be this instead, depending on your kernel:)

echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Lastly, is this app using many tens of gigs of ram something you wrote? What language?

Since you used the term, "faulting in memory pages," I'm guessing you're familiar enough with operating design and virtual memory. I struggle to envision a situation/application that would be faulting so aggressively that isn't reading in lots of I/O - almost always from the buffer cache that you're trying to limit.

(If you're curious, check out mmap(2) flags like MAP_ANONYMOUS and MAP_POPULATE and mincore(2) which can be used to see which virtual pages actually have a mapped physical page.)

Good Luck!

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.