Envmap #10796

marler8997 · 2022-02-05T06:19:32Z

This is my take on fixing #4603 which is an alternative to #10650

This PR includes commits from that PR and modifies it to a different design. In this PR, instead of having EnvMap store environment variables in both their original casing and a canonicalized casing, this one keeps the original casing and modifies the hash map to ignore case.

squeek502 · 2022-02-07T05:40:34Z

lib/std/process.zig

+                    if (upcase(c_a) != upcase(c_b))
+                        return false;
+                }
+                if (it_b.nextCodepoint()) |_| return false;


Suggested change

if (it_b.nextCodepoint()) |_| return false;

if (it_b.nextCodepoint()) |_| return false;

return true;

Without the return true here, it falls down to eqlString which is case sensitive.

lib/std/process.zig

squeek502 · 2022-02-07T09:07:27Z

Did some benchmarking comparing the implemention in this PR vs #10650 (some context: #10650 (comment))

This PR is in blue, #10650 is in orange:

Benchmark code is here: https://gist.github.com/squeek502/13fdaa4255c6c5e926ceb00a728e1ee1

Some notes:

On my machine, the only benchmark that this PR wins in is get of missing keys when using Unicode strings. For all the rest, 10650 either wins or has comparable performance.
10650 uses about 2x more memory across the board as expected.
The Unicode benchmarks insert random UTF-8 strings and call get with UTF-8 strings, and likewise the ASCII benchmarks insert random ASCII strings and call get with ASCII strings. That is, the benchmarks do not mix/match ASCII/UTF-8. Doing so might better simulate real-world use-cases (i.e. an EnvMap with mostly/all ASCII keys and then seeing how fast lookups of (missing) UTF-8 strings are).

EDIT: Here's a zip containing the .ods file if anyone wants to make their own graphs: envmap.zip

marler8997 · 2022-02-07T18:59:48Z

These benchmarks are good data. I feel I should also share some of the things I think matter when it comes to deciding on a design like this.

Perfomance
Code size
Runtime memory usage
Code simplicity/maintainability

In the case of environment variables, I would actually put runtime perfomance lower in the list of importance than with other types of modules because I don't think managing environment variables would usually be in the critical path of performance. Still important, but not the only criteria to consider. When I was younger I use to always put performance at the top of my list, but now I realize how complex performance is and now I tend to start with the "simplest" design and only optimize for performance when the simplest design isn't fast enough. This not only results in less work, it also ends up having faster code overall because once you have a working design, you don't need to guess what will be faster, you can actually benchmark the application that isn't performing fast enough. This focuses your efforts on optimizing code that will actually make your application faster instead of trying to shoot fish in a barrel.

All this being said, I find the benchmarks interesting and we can use that data to help guide us if we find that environment variables management performance becomes an issue. Adding a mechanism to cache string conversion like you've done trades higher memory usage and higher complexity for less calculations, which can increase performance in certain situations. But to reiterate, I think it would be best to leave a performance optimization like that out until we see that it has a noticeable benefit, and not at the cost of making other usages slower if those usages are just as common.

matu3ba · 2022-02-09T12:51:40Z

Unifying directory path traversal (without opening directories for perf) on Linux and Windows has the same problem, so it looks to me like the performance optimization could go into a lib for doing string stuff on paths (without the syscall overhead).

squeek502 · 2022-02-09T17:16:02Z

@matu3ba could you be more specific? I'm unsure what performance optimization you're talking about or how it would apply to path traversal.

matu3ba · 2022-02-09T18:23:10Z

@squeek502 My intention was to provide a faster iterator that works forward and backwards on paths /home/user/bla/blubb\/bla/../file represented as utf8 and utf16 strings (Linux + Windows) to get the directory names without asking the Kernel first.

squeek502 · 2022-02-09T18:33:40Z

@matu3ba, sorry still confused. Faster than what? What parts of these PRs could it use to go faster?

matu3ba · 2022-02-09T18:53:12Z

Faster than what?

I think the idea is stupid, so never mind. Sorry for noise.

lib/std/process.zig

lib/std/os/windows/ntdll.zig

squeek502 · 2022-02-19T23:54:22Z

@Vexu if the usage of ntdll's Rtl Unicode functions for Windows environment variables is accepted, then #10595 can act as a companion to these EnvMap changes to fully address #4603

marler8997 · 2022-02-20T03:44:43Z

Yeah I think we'll need clarification from @andrewrk on what he meant by 8bf7cff:

It is not planned to support full Unicode case-insensitivity
on Windows, and in fact relying on non-ASCII case-insensitive
environment variables is fundamentally problematic.

This PR uses RtlUpcaseUnicodeChar to make string comparison case insensitive for environment variable keys on Windows. I'm not sure why this is problematic. Either Andrew knows something I'm not aware of, or he had another solution in mind that does have a problem.

squeek502 · 2022-02-20T03:52:17Z

Either Andrew knows something I'm not aware of, or he had another solution in mind that does have a problem.

My assumption is that he was assuming it would require putting Unicode data in the standard library.

marler8997 · 2022-02-20T03:56:47Z

My assumption is that he was assuming it would require putting Unicode data in the standard library.

Yeah that could definitely be it. Out of curiosity, does anyone know how Windows implements RtlUpcaseUnicodeChar? Where is it keeping all the unicode data at runtime? Maybe it's in the kernel or something?

squeek502 · 2022-02-20T04:58:27Z

Yeah that could definitely be it. Out of curiosity, does anyone know how Windows implements RtlUpcaseUnicodeChar? Where is it keeping all the unicode data at runtime? Maybe it's in the kernel or something?

My guess is PEB.UnicodeCaseTableData.

andrewrk · 2022-05-11T20:41:31Z

I was unaware that NtDll had functions for environment variables when I made that comment. This changes my position.

What I do not want to do is have the Zig std lib depend on shipping with Unicode.txt or similar data. However, if NtDll provides us with an API that allows us to match the Windows-native method of accessing environment variables, then I am in favor of using that API. This really has nothing to do with proper Unicode support; really what's happening is that Windows has some set of weird rules about what environment variables do or don't match via string equality, and we just have to match those rules by calling the appropriate NtDll functions.

For example, the RtlUpcaseUnicodeChar function is completely wrong for trying to implement proper Unicode support because it lacks awareness of various things, but it may be correct in matching environment variable name string equality to be what is expected behavior on Windows.

In conclusion, I'm in favor of this change, assuming that it matches the equality checking properties of RtlEqualUnicodeString.

Can we confirm that is indeed the case?
What was the CI failure?

squeek502 · 2022-05-11T22:04:03Z

Can we confirm that is indeed the case?

Unsure if this is what you're looking for, but for ReactOS:

This PR uses RtlUpcaseUnicodeChar directly which should be equivalent.

EnvMap provides the same API as the previously used BufMap (besides `putMove` and `getPtr`), so usage sites of `getEnvMap` can usually remain unchanged. For non-Windows, EnvMap is a wrapper around BufMap. On Windows, it uses a new EnvMapWindows to handle some Windows-specific behavior: - Lookups use Unicode-aware case insensitivity (but `get` cannot return an error because EnvMapWindows has an internal buffer to use for lookup conversions) - Canonical names are returned when iterating the EnvMap Fixes ziglang#10561, closes ziglang#4603

…e applicable # Conflicts: # lib/std/build/RunStep.zig

Fixes a process.EnvMap compile error on 32-bit architectures

`key` was never being given a value, caused by b83cea1

Now that BufMap.BufMapHashMap is pub, we can just get Size directly

Co-authored-by: Veikka Tuominen <git@vexu.eu>

andrewrk · 2022-05-12T00:53:06Z

@squeek502 I am satisfied with this. Thank you for digging!

marler8997 · 2022-05-12T04:31:41Z

What was the CI failure?

I've rebased and repushed, looks like the Azure CI failure is no more. Probably an intermittent failure from long ago.

marler8997 force-pushed the envmap branch 7 times, most recently from 32c67a6 to 88066f2 Compare Feb 5, 2022

marler8997 marked this pull request as ready for review Feb 5, 2022

squeek502 reviewed Feb 7, 2022

View changes

lib/std/process.zig Show resolved Hide resolved

marler8997 force-pushed the envmap branch from a4d31ee to a7480bb Compare Feb 7, 2022

Vexu requested changes Feb 13, 2022

View changes

lib/std/process.zig Show resolved Hide resolved

marler8997 force-pushed the envmap branch from a8c5735 to b79b66e Compare Feb 18, 2022

Vexu approved these changes Feb 19, 2022

View changes

lib/std/os/windows/ntdll.zig Outdated Show resolved Hide resolved

squeek502 mentioned this pull request Feb 19, 2022

Add process.EnvMap, a cross-platform environment variable map #10650

Closed

andrewrk added standard library This issue involves writing Zig code for the standard library. os-windows labels May 11, 2022

squeek502 and others added 13 commits May 12, 2022

Update usages of process.getEnvMap and change BufMap -> EnvMap wher…

9e89000

…e applicable # Conflicts: # lib/std/build/RunStep.zig

Make the BufMap.count return value match its underlying HashMap's Size

a65be05

Fixes a process.EnvMap compile error on 32-bit architectures

Fix regression in RunStep.addPathDir

a206961

`key` was never being given a value, caused by b83cea1

Set EnvMap.Size to BufMap.BufMapHashMap.Size

b2b48fb

Now that BufMap.BufMapHashMap is pub, we can just get Size directly

EnvMapWindows: Fix putUtf8 not uppercasing keys

e70cb04

remove extra storage from EnvMap on windows

69f0a55

add unicode support

e65d8f8

reverse some of the now unneeded changes from squeek

1c874a8

some fixes to the EnvMap HashContext

71f6919

incorporate review changes from squeek

8492ced

document that on Windows, all key arguments in EnvMap must be valid utf8

a38e6a6

fix ntdll extern casing

2ddfe16

Co-authored-by: Veikka Tuominen <git@vexu.eu>

marler8997 force-pushed the envmap branch from 3e2e006 to 2ddfe16 Compare May 12, 2022

remove RtlUpcaseUnicodeString, no longer needed

aef642f

andrewrk merged commit 5c20c70 into ziglang:master May 12, 2022
3 of 4 checks passed

marler8997 deleted the envmap branch May 12, 2022

Envmap #10796

Envmap #10796

marler8997 commented Feb 5, 2022 •

edited

squeek502 Feb 7, 2022

squeek502 commented Feb 7, 2022 •

edited

marler8997 commented Feb 7, 2022 •

edited

matu3ba commented Feb 9, 2022

squeek502 commented Feb 9, 2022

matu3ba commented Feb 9, 2022 •

edited

squeek502 commented Feb 9, 2022 •

edited

matu3ba commented Feb 9, 2022 •

edited

squeek502 commented Feb 19, 2022

marler8997 commented Feb 20, 2022 •

edited

squeek502 commented Feb 20, 2022

marler8997 commented Feb 20, 2022

squeek502 commented Feb 20, 2022

andrewrk commented May 11, 2022 •

edited

squeek502 commented May 11, 2022

andrewrk commented May 12, 2022

marler8997 commented May 12, 2022 •

edited

	if (it_b.nextCodepoint()) \|_\| return false;
	if (it_b.nextCodepoint()) \|_\| return false;
	return true;

Envmap #10796

Envmap #10796

Conversation

marler8997 commented Feb 5, 2022 • edited

squeek502 Feb 7, 2022

Choose a reason for hiding this comment

squeek502 commented Feb 7, 2022 • edited

marler8997 commented Feb 7, 2022 • edited

matu3ba commented Feb 9, 2022

squeek502 commented Feb 9, 2022

matu3ba commented Feb 9, 2022 • edited

squeek502 commented Feb 9, 2022 • edited

matu3ba commented Feb 9, 2022 • edited

squeek502 commented Feb 19, 2022

marler8997 commented Feb 20, 2022 • edited

squeek502 commented Feb 20, 2022

marler8997 commented Feb 20, 2022

squeek502 commented Feb 20, 2022

andrewrk commented May 11, 2022 • edited

squeek502 commented May 11, 2022

andrewrk commented May 12, 2022

marler8997 commented May 12, 2022 • edited

marler8997 commented Feb 5, 2022 •

edited

squeek502 commented Feb 7, 2022 •

edited

marler8997 commented Feb 7, 2022 •

edited

matu3ba commented Feb 9, 2022 •

edited

squeek502 commented Feb 9, 2022 •

edited

matu3ba commented Feb 9, 2022 •

edited

marler8997 commented Feb 20, 2022 •

edited

andrewrk commented May 11, 2022 •

edited

marler8997 commented May 12, 2022 •

edited