c# - right - linq to sql left join
Does “where” position in LINQ query matter when joining in-memory? (2)
For in memory lists (IEnumerables), no optimization is applied and query execution is made in chained order for in-memory lists.
I also tried result
by first casting it to IQueryable
then apply filtering but apparently casting time is pretty high for this big table.
I made a quick test for this case.
Console.WriteLine($"List Row Count = {list.Count()}");
Console.WriteLine($"JoinList Row Count = {joinList.Count()}");
var watch = Stopwatch.StartNew();
var result = list.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Where(t => t.inner.Prop3 == "Prop13")
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result.Dump();
watch.Stop();
Console.WriteLine($"Result1 Elapsed = {watch.ElapsedTicks}");
watch.Restart();
var result2 = list
.Where(t => t.Prop3 == "Prop13")
.Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result2.Dump();
watch.Stop();
Console.WriteLine($"Result2 Elapsed = {watch.ElapsedTicks}");
watch.Restart();
var result3 = list.AsQueryable().Join(joinList, l => l.Prop3, i=> i.Prop3, (lst, inner) => new {lst, inner})
.Where(t => t.inner.Prop3 == "Prop13")
.Select(t => new { t.inner.Prop4, t.lst.Prop2});
result3.Dump();
watch.Stop();
Console.WriteLine($"Result3 Elapsed = {watch.ElapsedTicks}");
Findings:
List Count = 100
JoinList Count = 10
Result1 Elapsed = 27
Result2 Elapsed = 17
Result3 Elapsed = 591
List Count = 1000
JoinList Count = 10
Result1 Elapsed = 20
Result2 Elapsed = 12
Result3 Elapsed = 586
List Count = 100000
JoinList Count = 10
Result1 Elapsed = 603
Result2 Elapsed = 19
Result3 Elapsed = 1277
List Count = 1000000
JoinList Count = 10
Result1 Elapsed = 1469
Result2 Elapsed = 88
Result3 Elapsed = 3219
Situation: Say we are executing a LINQ query that joins two in-memory lists (so no DbSets or SQL-query generation involved) and this query also has a where
clause. This where
only filters on properties included in the original set (the from
part of the query).
Question: Does the linq query interpreter optimize this query in that it first executes the where
before it performs the join
, regardless of whether I write the where
before or after the join
? – so it does not have to perform a join on elements that are not included later anyways.
Example: For example, I have a categories
list I want to join with a products
list. However, I am just interested in the category
with ID
1. Does the linq interpreter internally perform the exact same operations regardless of whether I write:
from category in categories
join prod in products on category.ID equals prod.CategoryID
where category.ID == 1 // <------ below join
select new { Category = category.Name, Product = prod.Name };
or
from category in categories
where category.ID == 1 // <------ above join
join prod in products on category.ID equals prod.CategoryID
select new { Category = category.Name, Product = prod.Name };
Previous research: I already saw this question but the OP author stated that his/her question is only targeting non-in-memory cases with generated SQL. I am explicitly interested with LINQ executing a join on two lists in-memory.
Update: This is not a dublicate of "Order execution of chain linq query" question as the referenced question clearly refers to a dbset and my question explicitly addressed a non-db scenario. (Moreover, although similar, I am not asking about inclusions based on navigational properties here but about "joins".)
Update2: Although very similar, this is also not a dublicate of "Is order of the predicate important when using LINQ?" as I am asking explicitly about in-memory situations and I cannot see the referenced question explicitly addressing this case. Moreover, the question is a bit old and I am actually interested in linq in the context of .NET Core (which didn't exist in 2012), so I updated the tag of this question to reflect this second point.
Please note: With this question I am aiming at whether the linq query interpreter somehow optimizes this query in the background and am hoping to get a reference to a piece of documentation or source code that shows how this is done by linq. I am not interested in answers such as "it does not matter because the performance of both queries is roughly the same".
The LINQ query syntax will be compiled to a method chain. For details, read e.g. in this question.
The first LINQ query will be compiled to the following method chain:
categories
.Join(
products,
category => category.ID,
prod => prod.CategoryID,
(category, prod) => new { category, prod })
.Where(t => t.category.ID == 1)
.Select(t => new { Category = t.category.Name, Product = t.prod.Name });
The second one:
categories
.Where(category => category.ID == 1)
.Join(
products,
category => category.ID,
prod => prod.CategoryID,
(category, prod) => new { Category = category.Name, Product = prod.Name });
As you can see, the second query will cause less allocations (note only one anonymous type vs 2 in the first query, and note how many instances of those anonymous types will be created on performing the query).
Furthermore, it's clear that the first query will perform a join operation on lot more data than the second (already filtered) one.
There will be no additional query optimization in case of LINQ-to-objects queries.
So the second version is preferable.