Friday, April 8, 2011

Object cache for C#

I'm doing a document viewer for some document format. To make it easier, let's say this is a PDF viewer, a Desktop application. One requirement for the software is the speed in rendering. So, right now, I'm caching the image for the next pages while the user is scrolling through the document.

This works, the UI is very responsive and it seems like the application is able to render the pages almost instantly....at a cost : the memory usage sometimes goes to 600MB. I cache it all in memory.

Now, I can cache to disk, I know, but doing that all the time is noticeably slower. What I would like to do is implement some cache (LRU?), where some of the cached pages (image objects) are on memory and most of them are on disk.

Before I embark on this, is there something in the framework or some library out there that will do this for me? It seems a pretty common enough problem. (This is a desktop application, not ASP.NET)

Alternatively, do you have other ideas for this problem?

From stackoverflow
  • There's patterns & practices Enterprise Library (more specifically, Caching Application Block), but it IMO tends to be over-engineered and overly complex.

  • How are you implementing your cache?

    You can use the Cache class from System.Web.Caching, even in non-web applications, and it will purge items on an LRU basis if/when it needs the memory.

    In a non-web application you'll need to use HttpRuntime.Cache to access the Cache instance.

    Note that the documentation states that the Cache class isn't intended to be used outside of ASP.NET, although it's always worked for me. (I've never relied on it in any mission-critical app though.)

    moogs : oh wow, I just it was just for ASP.NET only. Will try this :) Right now it's just a straightforward hashtable.
    Rune Grimstad : The Asp.Net cache is not intended to be used outside Asp.Net, so be sure to test your application properly. But it should work well of course :-)
  • The .NET Framework has always had the ability to keep weak references to objects.

    Basically, weak references are references to objects that the runtime considers "unimportant" and that may be removed by a garbage collection run at any point in time. This can be used, for example, to cache things, but you'd have no control over what gets colected and what not.

    On the other hand, it's very simple to use and it may just be what you need.

    Dave

    moogs : i read about this a long time ago and forgot about this, thanks!
    Paul Stovell : these kinds of solutions can backfire in this scenario. Imagine you are pre-loading images, then putting them in a hashtable of weak references. If no request for it comes in, and the GC runs, it will be deleted. Then when the user scrolls you need to regenerate it. This could all happen within secs
    Dave Van den Eynde : Preloading and caching are not the same thing, and you should take care not to confuse them.
    Paul Stovell : you are right, but the original question sounds a lot more like a preloading scenario to me: "So, right now, I'm caching the image *for the next pages* while the user is scrolling through the document." Why would you want it to be collected before they scroll? And why keep them in memory after?
    Dave Van den Eynde : In that case, the Caching application block is also not going to be of much use. Someone should come up with a Preloading application block. ;)
    moogs : caching can be turned into preloading. the software accesses data the user will accesss in the future, or at least tries to predict so
    Dave Van den Eynde : Perhaps, but I don't think it's the same thing. The preloading of data, and the prediction of which data to load is very distinct from the decision which data to remove. They are inherently related: you can probably come up with a good algorithm to decide which data to preload, but you'd have no control over how long that data is kept.
  • Caching application block and ASP.NET cache are both options however, although they do LRU, the only kind of disk utilization that happens is by memory paging. I think there are ways you can optimize this that are more specific to your goal to get a better output. Here are some thoughts:

    • Since it's an ASP.NET app, why not generate the images, write them to disk, and when the browser requests the next page have IIS serve it up. That keeps your ASP.NET worker process lower while letting IIS do what it's good at.
    • Use statistics about how a user interacts. LRU as implemented in the ASP.NET cache will typically apply to the individual image files - not much use for this scenario by the sounds of it. Instead, maybe some logic like: "If the user has scrolled X pages in the last Y seconds, then general the next X*Y images". Users scrolling quickly will get more pages generated; users reading slowly will need less cached.
    • Have IIS serve images from disk, and use a custom HTTP handler for images you really want to control the caching of.
    • Have the browser request the files ahead of time too, and rely on browser caching.
    • Once an image has been served to the client, is it fair to say it can pretty much be removed from the cache? That could reduce the footprint substantially.

    I'd certainly avoid using a plain hash table though.

  • A classic trade-off situation. Keeping everything in memory will be fast at the cost of massively increased memory consumption, whilst retrieving from disc decreases memory consumption, but isn't as performant. However, you already know all this!

    The built-in System.Web.Caching.Cache class is great, and I've used it to good effect many times myself in my ASP.NET applications (although mostly for database record caching), however, the drawback is that the cache will only run on one machine (typically a sole web server) and cannot be distributed across multiple machines.

    If it's possible to "throw some hardware" at the problem, and it doesn't necessarily need to be expensive hardware, just boxes with plenty of memory, you could always go with a distributed caching solution. This will give you much more memory to play with whilst retaining (nearly) the same level of performance.

    Some options for a distributed caching solution for .NET are:

    Memcached.NET

    indeXus.Net

    or even Microsoft's own Velocity project.

    Dave Van den Eynde : So instead of caching it on disk, you'd do it over the network? Yup, that'll be a lot faster.
    DrJokepu : Dave Van den Eynde: Caching on a local, gigabit network is a very common approach. It is not feasible in the case, however.
  • I wrote an LRU Cache and some test cases, feel free to use it.

    You can read through the source on my blog.

    For the lazy (here it is minus the test cases):

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace LRUCache {
        public class IndexedLinkedList<T> {
    
            LinkedList<T> data = new LinkedList<T>();
            Dictionary<T, LinkedListNode<T>> index = new Dictionary<T, LinkedListNode<T>>();
    
            public void Add(T value) {
                index[value] = data.AddLast(value);
            }
    
            public void RemoveFirst() {
                index.Remove(data.First.Value);
                data.RemoveFirst();
            }
    
            public void Remove(T value) {
                LinkedListNode<T> node;
                if (index.TryGetValue(value, out node)) {
                    data.Remove(node);
                    index.Remove(value);
                }
            }
    
            public int Count {
                get {
                    return data.Count;
                }
            }
    
            public void Clear() {
                data.Clear();
                index.Clear();
            }
    
            public T First {
                get {
                    return data.First.Value;
                }
            }
        }
    }
    

    LRUCache

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace LRUCache {
        public class LRUCache<TKey, TValue> : IDictionary<TKey, TValue> {
    
            object sync = new object();
            Dictionary<TKey, TValue> data;
            IndexedLinkedList<TKey> lruList = new IndexedLinkedList<TKey>();
            ICollection<KeyValuePair<TKey, TValue>> dataAsCollection;
            int capacity;
    
            public LRUCache(int capacity) {
    
                if (capacity <= 0) {
                    throw new ArgumentException("capacity should always be bigger than 0");
                }
    
                data = new Dictionary<TKey, TValue>(capacity);
                dataAsCollection = data;
                this.capacity = capacity;
            }
    
            public void Add(TKey key, TValue value) {
                if (!ContainsKey(key)) {
                    this[key] = value;
                } else {
                    throw new ArgumentException("An attempt was made to insert a duplicate key in the cache.");
                }
            }
    
            public bool ContainsKey(TKey key) {
                return data.ContainsKey(key);
            }
    
            public ICollection<TKey> Keys {
                get {
                    return data.Keys;
                }
            }
    
            public bool Remove(TKey key) {
                bool existed = data.Remove(key);
                lruList.Remove(key);
                return existed;
            }
    
            public bool TryGetValue(TKey key, out TValue value) {
                return data.TryGetValue(key, out value);
            }
    
            public ICollection<TValue> Values {
                get { return data.Values; }
            }
    
            public TValue this[TKey key] {
                get {
                    var value = data[key];
                    lruList.Remove(key);
                    lruList.Add(key);
                    return value;
                }
                set {
                    data[key] = value;
                    lruList.Remove(key);
                    lruList.Add(key);
    
                    if (data.Count > capacity) {
                        Remove(lruList.First);
                        lruList.RemoveFirst();
                    }
                }
            }
    
            public void Add(KeyValuePair<TKey, TValue> item) {
                Add(item.Key, item.Value);
            }
    
            public void Clear() {
                data.Clear();
                lruList.Clear();
            }
    
            public bool Contains(KeyValuePair<TKey, TValue> item) {
                return dataAsCollection.Contains(item);
            }
    
            public void CopyTo(KeyValuePair<TKey, TValue>[] array, int arrayIndex) {
                dataAsCollection.CopyTo(array, arrayIndex);
            }
    
            public int Count {
                get { return data.Count; }
            }
    
            public bool IsReadOnly {
                get { return false; }
            }
    
            public bool Remove(KeyValuePair<TKey, TValue> item) {
    
                bool removed = dataAsCollection.Remove(item);
                if (removed) {
                    lruList.Remove(item.Key);
                }
                return removed;
            }
    
    
            public IEnumerator<KeyValuePair<TKey, TValue>> GetEnumerator() {
                return dataAsCollection.GetEnumerator();
            }
    
    
            System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
                return ((System.Collections.IEnumerable)data).GetEnumerator();
            }
    
        }
    }
    
    Ricardo Nolde : The link is down!
    Sam Saffron : I fixed the link, but I recall there is a bug in the code, I will try to fix it.
    Sam Saffron : I updated the code, and wrote a few tests which did not find any flaws, not to say it has no bugs
    jrista : I think it would be better to put the removal of "least recently used" items in the add and remove methods, rather than in the indexer. I wouldn't want to take the performance hit every time I accessed something in my cache, because I don't care if the cache is "full" when I'm simply accessing. However, when I add, I definitely care whether the cache is full, and do indeed want to purge the least recently used item(s) at that time.
  • @Sam S: Thanks for the code!

    This part seems incorrect:

                if (data.Count > capacity) {
                    Remove(lruList.First);
                    lruList.RemoveFirst();
                }
    

    Together with

    public bool Remove(TKey key) {
        bool existed = data.Remove(key);
        lruList.Remove(key);
        return existed;
    }
    

    Your lruList will be out of sync, as the oldest two are deleted from the lruList, and only the oldest from the data...

0 comments:

Post a Comment