[.net] Resources for working with Machine Learning in F#


Answers

In addition to what Tomas mentioned, I spent some time with Infer.NET about a year ago and found it was pretty good for continuous graphical models. I know it's improved quite a lot over the last year in both the scope of the library and F# support. I suggest checking it out and seeing if it has what you need.

Question

I have learned a Machine Learning course using Matlab as a prototyping tool. Since I got addicted to F#, I would like to continue my Machine Learning study in F#.

I may want to use F# for both prototyping and production, so a Machine Learning framework would be a great start. Otherwise, I can start with a collection of libraries:

  • Highly-optimized linear algebra library
  • Statistics package
  • Visualization library (which allows to draw and interact with charts, diagrams...)
  • Parallel computing toolbox (similar to Matlab parallel computing toolbox)

And the most important resources (to me) are books, blog posts and online courses regarding Machine Learning in a functional programming language (F#/OCaml/Haskell...).

Can anyone suggest these kinds of resource? Thanks.


EDIT:

This is a summary based on the answers below:

Machine Learning frameworks:

  • Infer.NET: an .NET framework for Bayesian inference in graphical models with good F# support.
  • WekaSharper: a F# wrapper around the popular data mining framework Weka.
  • Microsoft Sho: a continuous environment development for data analysis (including matrix operations, optimization and visualization) on .NET platform.

Related libraries:

Reading list:

Any other pointers or suggestions are also welcome.




  • The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.
  • Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).

F# should be a superior choice. Being a first-class language in Visual Studio, F#'s interoperability with C# is quite good.

  • Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.

Assuming that you start by a functional-first and immutable implementation, it should be easy to parallelize your app. Moreover, asynchronous workflow is a bless for IO-bound applications like yours.

  • The language (and the compiler) should be mature.

I don't compare F# to Clojure and Scala on JVM, but F# is much more mature than Clojure CLR and Scala on .NET. In choosing F#, you're sure to have long-term commitment from Microsoft and help from ever-growing F# community.

When the user enters a search string it is parsed into an expression tree.

You can represent expression trees using discriminated unions. With the introduction of query expressions in F# 3.0, you are able to translate your logics into SQL queries easily. You can even push it further by defining a similar query language for your domain.

Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it.

F# 3.0 introduces type providers to allow users access non-structured data in a type-safe way; you may want to look at this "F# 3.0 - Information Rich Programming" video for more insights. If you would like to use F# as a programming language for data mining, I have asked a related question and got pretty good responses here.

That said, your first feelings about F# may not be correct. From my experience, you can always stay as close to the functional and immutable side as you want. Given that you already have an interesting application, I suggest to get your hands dirty to know whether F# is the language for your purpose.

UPDATE:

Here is an F# prototype which demonstrates the idea:

/// You start by modeling your domain with a set of types.
/// FullText is a sequence of Records, which is processed on demand.
type Word = string
and Freq = int
and Record = {Occurrences: (Word * Freq) list; Id: string}
and FullText = Record seq

/// Model your expression tree by following the grammar closely.
type Expression =
    | Occur of Word
    | Near of Word * Word
    | And of Expression * Expression 
    | Or of Expression * Expression

/// Find wether a word w occurs in the occurrence list.
let occur w {Occurrences = xs} = xs |> Seq.map fst |> Seq.exists ((=) w)

/// Check whether two words are near each other.
/// Note that type annotation is only needed for the stub implementation.
let near (w1: Word) (w2: Word) (r: Record): bool = failwith "Not implemented yet"

/// Evaluate an expression tree.
/// The code is succinct and clear thanks to pattern matching. 
let rec eval expr r = 
    match expr with
    | Occur w -> occur w r
    | Near(w1, w2) -> near w1 w2 r
    | And(e1, e2) -> eval e1 r && eval e2 r
    | Or(e1, e2) -> eval e1 r || eval e2 r

/// Utility function which returns second element in a 3-tuple
let inline snd3 (_, x, _) = x

/// Get the rank of the record by adding up frequencies on the whole database.
let rank (r: Record) (ft: FullText): Freq = failwith "Not implemented yet"

/// Retrieve all records which match the expression tree.
let retrieve expr fullText =
    fullText |> Seq.filter (eval expr)
             |> Seq.map (fun r -> r, rank r fullText, r.Occurrences)
             |> Seq.sortBy snd3

/// An example query
let query = 
    And (Occur "transformer%", 
         Or (Or (Near ("100", "W"), Near ("100", "watts")), 
             Or (Occur "100W", Occur "0.1kW")))



As far as multi-language integration goes, combining C and Haskell is remarkably easy, and I say this as someone who is (unlike dons) not really much of an expert on either. Any other language that integrates well with C shouldn't be much trickier; you can always fall back to a thin interface layer in C if nothing else. For better or worse, C is still the lingua franca of programming, so Haskell is more than acceptable for most cases.

...but. You say you're motivated by performance issues, and want to use "a functional language". From this I infer you're not previously familiar with the languages you ask about. Among Haskell's defining features are that it, by default, uses non-strict evaluation and immutable data structures--which are both incredibly useful in many ways, but it also means that optimizing Haskell for performance is often dramatically different from other languages, and well-honed instincts may lead you astray in baffling ways. You may want to browse performance-related topics on the Haskell wiki to get a feel for the issues.

Which isn't to say that you can't do what you want in Haskell--you certainly can. Both laziness and immutability can in fact be exploited for performance benefits (Chris Okasaki's thesis provides some nice examples). But be aware that there'll be a bit of a learning curve when it comes to dealing with performance.

Both Haskell and OCaml provide the lovely benefits of using an ML-family language, but for most programmers, OCaml is likely to offer a gentler learning curve and better immediate results.




Machine learning in OCaml or Haskell?

Hal Daume has written several major machine learning algorithms during his Ph.D. (now he is an assistant professor and rising star in machine learning community)

On his web page, there are a SVM, a simple decision tree and a logistic regression all in OCaml. By reading these code, you can have a feeling how machine learning models are implemented in OCaml.

I'd also like to mention F#, a new .Net language similar to OCaml. Here's a factor graph model written in F# analyzing Chess play data. This research also has a NIPS publication.

While FP is suitable for implementing machine learning and data mining models. But what you can get here most is NOT performance. It is right that FP supports parallel computing better than imperative languages, like C# or Java. But implementing a parallel SVM, or decision tree, has very little relation to do with the language! Parallel is parallel. The numerical optimizations behind machine learning and data mining are usually imperative, writing them pure-functionally is usually hard and less efficient. Making these sophisticated algorithms parallel is very hard task in the algorithm level, not in the language level. If you want to run 100 SVM in parallel, FP helps here. But I don't see the difficulty running 100 libsvm parallel in C++, not to consider that the single thread libsvm is more efficient than a not-well-tested haskell svm package.

Then what do FP languages, like F#, OCaml, Haskell, give?

  1. Easy to test your code. FP languages usually have a top-level interpreter, you can test your functions on the fly.

  2. Few mutable states. This means that passing the same parameter to a function, this function always gives the same result, thus debugging is easy in FPs.

  3. Code is succinct. Type inference, pattern matching, closures, etc. You focus more on the domain logic, and less on the language part. So when you write the code, your mind is mainly thinking about the programming logic itself.

  4. Writing code in FPs is fun.