Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envmap #10796

Merged
merged 14 commits into from May 12, 2022
Merged

Envmap #10796

merged 14 commits into from May 12, 2022

Conversation

marler8997
Copy link
Contributor

@marler8997 marler8997 commented Feb 5, 2022

This is my take on fixing #4603 which is an alternative to #10650

This PR includes commits from that PR and modifies it to a different design. In this PR, instead of having EnvMap store environment variables in both their original casing and a canonicalized casing, this one keeps the original casing and modifies the hash map to ignore case.

@marler8997 marler8997 force-pushed the envmap branch 7 times, most recently from 32c67a6 to 88066f2 Compare Feb 5, 2022
@marler8997 marler8997 marked this pull request as ready for review Feb 5, 2022
if (upcase(c_a) != upcase(c_b))
return false;
}
if (it_b.nextCodepoint()) |_| return false;
Copy link
Contributor

@squeek502 squeek502 Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (it_b.nextCodepoint()) |_| return false;
if (it_b.nextCodepoint()) |_| return false;
return true;

Without the return true here, it falls down to eqlString which is case sensitive.

lib/std/process.zig Show resolved Hide resolved
@squeek502
Copy link
Contributor

squeek502 commented Feb 7, 2022

Did some benchmarking comparing the implemention in this PR vs #10650 (some context: #10650 (comment))

This PR is in blue, #10650 is in orange:

insert-2022-02-07
get-found-2022-02-07
get-missing-2022-02-07
remove-2022-02-07
memory-2022-02-07

Benchmark code is here: https://gist.github.com/squeek502/13fdaa4255c6c5e926ceb00a728e1ee1

Some notes:

  • On my machine, the only benchmark that this PR wins in is get of missing keys when using Unicode strings. For all the rest, 10650 either wins or has comparable performance.
  • 10650 uses about 2x more memory across the board as expected.
  • The Unicode benchmarks insert random UTF-8 strings and call get with UTF-8 strings, and likewise the ASCII benchmarks insert random ASCII strings and call get with ASCII strings. That is, the benchmarks do not mix/match ASCII/UTF-8. Doing so might better simulate real-world use-cases (i.e. an EnvMap with mostly/all ASCII keys and then seeing how fast lookups of (missing) UTF-8 strings are).

EDIT: Here's a zip containing the .ods file if anyone wants to make their own graphs: envmap.zip

@marler8997
Copy link
Contributor Author

marler8997 commented Feb 7, 2022

These benchmarks are good data. I feel I should also share some of the things I think matter when it comes to deciding on a design like this.

  • Perfomance
  • Code size
  • Runtime memory usage
  • Code simplicity/maintainability

In the case of environment variables, I would actually put runtime perfomance lower in the list of importance than with other types of modules because I don't think managing environment variables would usually be in the critical path of performance. Still important, but not the only criteria to consider. When I was younger I use to always put performance at the top of my list, but now I realize how complex performance is and now I tend to start with the "simplest" design and only optimize for performance when the simplest design isn't fast enough. This not only results in less work, it also ends up having faster code overall because once you have a working design, you don't need to guess what will be faster, you can actually benchmark the application that isn't performing fast enough. This focuses your efforts on optimizing code that will actually make your application faster instead of trying to shoot fish in a barrel.

All this being said, I find the benchmarks interesting and we can use that data to help guide us if we find that environment variables management performance becomes an issue. Adding a mechanism to cache string conversion like you've done trades higher memory usage and higher complexity for less calculations, which can increase performance in certain situations. But to reiterate, I think it would be best to leave a performance optimization like that out until we see that it has a noticeable benefit, and not at the cost of making other usages slower if those usages are just as common.

@matu3ba
Copy link
Contributor

matu3ba commented Feb 9, 2022

Unifying directory path traversal (without opening directories for perf) on Linux and Windows has the same problem, so it looks to me like the performance optimization could go into a lib for doing string stuff on paths (without the syscall overhead).

@squeek502
Copy link
Contributor

squeek502 commented Feb 9, 2022

@matu3ba could you be more specific? I'm unsure what performance optimization you're talking about or how it would apply to path traversal.

@matu3ba
Copy link
Contributor

matu3ba commented Feb 9, 2022

@squeek502 My intention was to provide a faster iterator that works forward and backwards on paths /home/user/bla/blubb\/bla/../file represented as utf8 and utf16 strings (Linux + Windows) to get the directory names without asking the Kernel first.

@squeek502
Copy link
Contributor

squeek502 commented Feb 9, 2022

@matu3ba, sorry still confused. Faster than what? What parts of these PRs could it use to go faster?

@matu3ba
Copy link
Contributor

matu3ba commented Feb 9, 2022

Faster than what?

I think the idea is stupid, so never mind. Sorry for noise.

lib/std/process.zig Show resolved Hide resolved
Vexu
Vexu approved these changes Feb 19, 2022
lib/std/os/windows/ntdll.zig Outdated Show resolved Hide resolved
@squeek502
Copy link
Contributor

squeek502 commented Feb 19, 2022

@Vexu if the usage of ntdll's Rtl Unicode functions for Windows environment variables is accepted, then #10595 can act as a companion to these EnvMap changes to fully address #4603

@marler8997
Copy link
Contributor Author

marler8997 commented Feb 20, 2022

Yeah I think we'll need clarification from @andrewrk on what he meant by 8bf7cff:

It is not planned to support full Unicode case-insensitivity
on Windows, and in fact relying on non-ASCII case-insensitive
environment variables is fundamentally problematic.

This PR uses RtlUpcaseUnicodeChar to make string comparison case insensitive for environment variable keys on Windows. I'm not sure why this is problematic. Either Andrew knows something I'm not aware of, or he had another solution in mind that does have a problem.

@squeek502
Copy link
Contributor

squeek502 commented Feb 20, 2022

Either Andrew knows something I'm not aware of, or he had another solution in mind that does have a problem.

My assumption is that he was assuming it would require putting Unicode data in the standard library.

@marler8997
Copy link
Contributor Author

marler8997 commented Feb 20, 2022

My assumption is that he was assuming it would require putting Unicode data in the standard library.

Yeah that could definitely be it. Out of curiosity, does anyone know how Windows implements RtlUpcaseUnicodeChar? Where is it keeping all the unicode data at runtime? Maybe it's in the kernel or something?

@squeek502
Copy link
Contributor

squeek502 commented Feb 20, 2022

Yeah that could definitely be it. Out of curiosity, does anyone know how Windows implements RtlUpcaseUnicodeChar? Where is it keeping all the unicode data at runtime? Maybe it's in the kernel or something?

My guess is PEB.UnicodeCaseTableData.

@andrewrk
Copy link
Member

andrewrk commented May 11, 2022

I was unaware that NtDll had functions for environment variables when I made that comment. This changes my position.

What I do not want to do is have the Zig std lib depend on shipping with Unicode.txt or similar data. However, if NtDll provides us with an API that allows us to match the Windows-native method of accessing environment variables, then I am in favor of using that API. This really has nothing to do with proper Unicode support; really what's happening is that Windows has some set of weird rules about what environment variables do or don't match via string equality, and we just have to match those rules by calling the appropriate NtDll functions.

For example, the RtlUpcaseUnicodeChar function is completely wrong for trying to implement proper Unicode support because it lacks awareness of various things, but it may be correct in matching environment variable name string equality to be what is expected behavior on Windows.

In conclusion, I'm in favor of this change, assuming that it matches the equality checking properties of RtlEqualUnicodeString.

  1. Can we confirm that is indeed the case?
  2. What was the CI failure?

@andrewrk andrewrk added standard library This issue involves writing Zig code for the standard library. os-windows labels May 11, 2022
@squeek502
Copy link
Contributor

squeek502 commented May 11, 2022

Can we confirm that is indeed the case?

Unsure if this is what you're looking for, but for ReactOS:

This PR uses RtlUpcaseUnicodeChar directly which should be equivalent.

squeek502 and others added 13 commits May 12, 2022
EnvMap provides the same API as the previously used BufMap (besides `putMove` and `getPtr`), so usage sites of `getEnvMap` can usually remain unchanged.

For non-Windows, EnvMap is a wrapper around BufMap. On Windows, it uses a new EnvMapWindows to handle some Windows-specific behavior:

- Lookups use Unicode-aware case insensitivity (but `get` cannot return an error because EnvMapWindows has an internal buffer to use for lookup conversions)
- Canonical names are returned when iterating the EnvMap

Fixes ziglang#10561, closes ziglang#4603
…e applicable

# Conflicts:
#	lib/std/build/RunStep.zig
Fixes a process.EnvMap compile error on 32-bit architectures
`key` was never being given a value, caused by b83cea1
Now that BufMap.BufMapHashMap is pub, we can just get Size directly
Co-authored-by: Veikka Tuominen <git@vexu.eu>
@andrewrk
Copy link
Member

andrewrk commented May 12, 2022

@squeek502 I am satisfied with this. Thank you for digging!

@marler8997
Copy link
Contributor Author

marler8997 commented May 12, 2022

  1. What was the CI failure?

I've rebased and repushed, looks like the Azure CI failure is no more. Probably an intermittent failure from long ago.

@andrewrk andrewrk merged commit 5c20c70 into ziglang:master May 12, 2022
3 of 4 checks passed
@marler8997 marler8997 deleted the envmap branch May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
os-windows standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants