Why is hash(-1) == hash(-2) in Python?

omairmajid.com

95 points

alexmolas

3 days ago


86 comments

Terr_ 3 days ago

In a way this is an argument for languages where it's normal to have result-types [0] or exceptions, instead of -1 etc. It just feels wrong to have any potential collision or confusion between normal results and error results.

[0] https://en.wikipedia.org/wiki/Result_type

  • kazinator 5 hours ago

    1. But, Python has exceptions! The underlying C language doesn't, but Python's run-time has them and can use them in the C code.

    2. It may be an argument for ensuring that absolutely everything that is an object can hash: the object hasher must not have error states. Nobody wants the overhead of a simple hash code being wrapped in a result type.

  • kevinventullo 11 hours ago

    We all know that types are not Pythonic.

    (I’m only mostly joking)

    • folkrav 11 hours ago

      Ackchyually, Python has pretty strong typing, as far as dynamic languages go.

      • Insanity 11 hours ago
        3 more

        If you use the actual type system and something like mypy, Python is a joy work with. (For my definition of joy, which includes static typing).

        • IshKebab 11 hours ago

          Actually writing Python is reasonably nice in that case.

          But dealing with Python infrastructure is so awful as to make the whole experience just bad.

          uv fixes a lot of that, but I think it will be some time before it's used everywhere, and I have zero hope that the Python devs will ever do the right thing and officially replace Pip with uv.

        • arccy 10 hours ago

          you also have to trawl through the flags to make sure it actually checks types, e.g. check_untyped_defs

      • bhickey 11 hours ago

        Sure. Everything is a PyObject.

  • kstrauser 11 hours ago

    Not really, I don’t think. Python code doesn’t ever really use hash() for anything where specific outputs are expected. Even if hash(a)==hash(b), it’s not implied that a==b or a is b.

    • tmvphil 11 hours ago

      But -2 seems like just a bad choice here. -1 and -2 are very "common" integers. Seems they could have just picked a quasi-random large integer to reduce the likelihood of collision. I expect hash functions to have collisions, but I expect it to be "unlikely" in a way that scales with the size of the hash.

      • Ekaros 20 minutes ago

        Are they? At least as keys for most dictionaries. Zero and positive integers sure, but negative ones would be somewhat rare. I could maybe see something like error handling, but then do you need performance there?

      • kstrauser 11 hours ago
        5 more

        I don't think that matters here, though. This specific hash function is literally used just for putting values in dict buckets. The docs say:

        > Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.

        They're not to be used as a crypto digest. The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together. Apparently that hasn't been enough of an issue over the years to bother changing it. One advantage is that those values are common enough that it might break code that incorrectly expects hash() to have predictable outputs. If it used 0xffffffff as the value, lots of busted tests may never stumble across it.

        • itishappy 9 hours ago
          4 more

          > The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together.

          Note that they'll only end up bucketed together in the underlying hashmap. The dictionary will still disambiguate the keys.

              >>> a = -1
              >>> b = -2
              >>> {a: a, b: b}
              {-1: -1, -2: -2}
          • kstrauser 9 hours ago
            3 more

            Yep, absolutely. End programmers will never see that unless they go digging into the CPython code. It's otherwise invisible.

            • daxfohl an hour ago
              2 more

              Not only bucketed together, which means they could still be disambiguated in C by the full hash, but they'll actually share the full hash, which means they'd have to jump back into Python code to check `__eq__` to disambiguate IIUC. That seems like a huge perf issue, is it not?

              • kstrauser 41 minutes ago

                Without looking it up, I think that's right, but... that's just kind of inevitable with hash tables. Unless you have a small number of keys and get very lucky, it's exceedingly likely that you'll have multiple keys in the same bucket, and you'll have to use whatever language's `__eq__` equivalent mechanism.

                A mitigating factor is that dict keys have to be hashable, which implies immutability, which generally implies a small number of native, fast types like str, numbers, or simple data structures like tuples. You could have some frozendict monstrosity as keys, but 1) that's not something you see often, and 2) don't do that. It you must, define a fast __eq__. For example, an object mapping to a database row might look like:

                  def __eq__(self, other):
                      if self.pk != other.pk:
                          return False # Fail quickly
                      # If that's true, only then do a deep comparison
                      return self.tuple_of_values == other.tuple_of_values
                
                But again, that's just not really something that's done.
      • nneonneo 10 hours ago

        I bet -2 is way less common than -1 in a typical codebase, particularly a C one.

    • TRiG_Ireland 11 hours ago

      Yes, but having result types would avoid the need for this special casing.

vhcr 9 hours ago

Somewhat related, hash() returns a result % sys.hash_info.modulus, so if you build a set or a dict with keys multiple of sys.hash_info.modulus, you run into quadratic behavior.

    >>> {i * sys.hash_info.modulus for i in range(5000)}
    0.40s
    >>> {i * sys.hash_info.modulus for i in range(50000)}
    29.34s
    >>> {i for i in range(50000)}
    0.06s
nneonneo 10 hours ago

Naturally, this extends to custom types:

  >>> class X:
  ...     def __hash__(self): return -1
  ... 
  >>> hash(X())
  -2
ks2048 11 hours ago

The next k for which hash(k) != k: 2**61-1

    print(hash(2**61 - 2))
    2305843009213693950

    print(hash(2**61 - 1))
    0
binary132 9 hours ago

The fact that a C function really only has its return value for communicating success or error status is why most fallible C functions use mutable parameters, also known as output arguments, for their result value or values. This is like, elementary C program design.

  • lifthrasiir 6 hours ago

    In this case however it can be partly justified as a space saving mechanism, as that hash has to be also cached in the object. It's like Rust's `Option<NonZeroU64>` done manually.

    • binary132 5 hours ago

      What I’m saying is either the value or the error can be discarded. If it was an error, the exception mechanism can be used since at that point you’re in the Python wrapper. It would probably be a little slower though, but Python is already slow. That way the error doesn’t need to be in the API at all.

daxfohl 10 hours ago

Another question is why do ints largely hash to themselves? Most hashing guides recommend hashes be far more distributed IIUC.

  • kazinator 5 hours ago

    There could be an issue whereby some attacker controls the hash inputs which happen to be integers, and chooses a sequence which collides to the same hash value. It would be easy to identify such a sequence.

    It just doesn't seem to have been identified as a real isssue, probably because the most commonly hashed external data is character strings. (For those, there are countermeasures like a properly scrambled hashing function, modified by a seed that can be randomized.)

  • woodruffw 9 hours ago

    Python's `__hash__` used primarily for hash map bucketing. In that context, the identity of an integer is a perfectly cromulent bucketing value.

    • zarzavat 9 hours ago

      If your keys are multiples of the table size then all your keys will hash to the same bucket, in a naive hash table at least.

      • daxfohl 8 hours ago

        More explicitly, since buckets in Python's Set implementation are the bottom N bits of the hash (where N depends on set size), this ensures linear sequences of ints (which are the most common type of int set) are all in different buckets, so zero collisions. This is the case except when the differences between integers in the set are multiples of powers of 2. But even then, the way Set is implemented, you don't see perf start to regress until like 2**20 or so. If everything in the set is multiples of 2**20, then they start to underperform better-distributed hashes. But it never gets more than like twice as slow.

        So, it ends up being a matter of optimizing for the common case, but still being reasonably performant for the worst case.

Galanwe 10 hours ago

> So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value.

Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.

  • sgerenser 9 hours ago

    Functions using errno typically still use a return value of -1 to indicate an error has occurred, then errno just stores the specific error code. Otherwise, how do you even know that you need to check errno?

    • kazinator 4 hours ago

      There are examples of such an ambiguity in the ISO C library itself.

      In those situations, the application must begin by clearing errno to zero.

      Then, it checks for a certain error value or values from the function which are ambiguous: they could be legit or indicate an error.

      If one of those values occurs, and if errno is nonzero, then the error occurred.

      This is how you deal with, for instance the strtol (string to long int) function. If there is a range error, strtol returns LONG_MIN or LONG_MAX. Those are also valid values in the range of long, but when no error has occurred, they are produced without errno being touched.

      strtol can also return 0 in another error case, when the input is such that no conversion can be performed. ISO C doesn't require errno to be set to anything in this case, unfortunately. The case is distinguished from a legitimate zero by the original pointer to the string being stored in *endptr (if the caller specifies endptr that is not null).

      ISO C and POSIX library functions do not reset errno to zero. They either leave it alone or set it to a nonzero value.

      If you need to use the above trick and are working inside a C library function, you have to save the original errno value before storing a zero in it, and then put that value back if no error has happened.

    • BenjiWiebe 9 hours ago

      You could always check errno and reset it to zero before any function call.

  • kragen 9 hours ago

    Also you can longjmp().

  • moralestapia 8 hours ago

    >it has to return that error as its return value.

    This is the kind of crap ignorant C developers do and then pretend is the language's fault.

    You can jump, you can return a reference to a complex type, a tuple, whatever.

kazinator 5 hours ago

Object hashes that can fail or be unimplemented, where lower-level hashes have to be aware of this and stay away from the value that the parent uses for error signaling?

What else is unmitigated shit in Python? :)

TRiG_Ireland 11 hours ago

I enjoyed this, because it seems to be written by an interested and interesting human, at a level that I could actually follow.

snickerbockers 10 hours ago

>CPython is written in C, and unlike Python, C doesn’t have exceptions. So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value

This is just wrong. C doesn't feature 'error handling' as a dedicated form of branching the way many higher level languages do and you're in no way required to use return codes as an error signal. This is a case of bad API design and is entirely python's fault.

jiggawatts 11 hours ago

This seems like a loaded footgun: anyone writing a custome hash function has to know about -1 being treated like an error and handle it in a similar way.

I can’t think of any other language where this kind of thing happens, which means other developers won’t expect it either.

I can see the bug report now: “certain records cause errors, occurs only for about one in a few billion.”

  • dagw 10 hours ago

    If you're writing a custom hash function in python for a custom python class, then this won't affect you since it's only treated as an error in the underlying C code.

    If you're adding a new type to the core python language then you have to be aware of this, but if you're hacking the C implementation to change the core language then you're probably pretty well versed in the cpython internals, or at least surrounded by people that are.

    • dagw 10 hours ago

      interestingly enough though, it doesn't seem to be possible to write a custom hash function that returns -1

         >>> class Foo:
               def __hash__(self):
                 return -1
      
         >>> f = Foo()
         >>> print(hash(f))
         -2
      
      So if you're doing something where you have a custom __hash__ function that you're expecting to return -1 for a certain value and then are testing for value of the hash of an object rather than testing the property of the object directly, then this might bite you. But I cannot think of any reasonable case where you might want to do that.
      • dragonwriter 9 hours ago

        You can write a custom hash method that returns -1 (and the one you wrote verifiably does.)

        The standard (not custom) hash function, in CPython, which calls your custom method and returns the actual value CPython uses for bucketing, will return -2 in that case, though.

        Which is necessary, so that, e.g., following the single mathematical function for hashing of numeric types published in the Python Language Reference will preserve x==y implies hash(x)==hash(y) for custom and build in numeric types in CPython where hash(-1) does not follow that pattern.

        Even though for -1 itself, the divergence from the rule is in __hash__ itself. This is also consistent with the basic rule that dunder methods are for communication with the standard machinery, but external consumers should use the standard machinery, not the dunder method.

      • omoikane 10 hours ago
        2 more

        Perhaps the worry was that because the observed behavior up until now was that hash never returns -1, if it were possible possible to write a custom hash function that returns -1, that might break some unsuspecting code due to Hyrum's Law.

        • dragonwriter 9 hours ago

          If the built-in hash function returned -1 when a custom __hash__ method did so, then custom numeric classes following the hashing guidelines in the Python Language Reference designed to maintain the x==y => hash(x)==hash(y) invariant would fail to maintain that invariant on the reference implementation of Python due to a CPython implementation detail, which would be a Bad Thing™.

  • krab 10 hours ago

    I don't think so. This is true for people writing cpython. But if you implement custom `__hash__` method, you should be fine.

intalentive 11 hours ago

>>> d = {-2: None}

>>> -1 in d

False

What gives?

  • krab 11 hours ago

    Because -1 is not equal to -2. They will fall into the same bucket of the hash map, but the hash collisions are expected and must be accounted for.

  • hyperpape 10 hours ago

    Since hashes aren't guaranteed to be unique, a dictionary should use equality to actually confirm that the object with the given hash matches the key present in the dictionary.

  • nayuki 10 hours ago

    > The __hash__() method should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects ...

    -- https://docs.python.org/3/reference/datamodel.html#object.__...

    > The general contract of hashCode is:

    > * If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.

    > * It is not required that if two objects are unequal according to the equals method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

    -- https://docs.oracle.com/en/java/javase/23/docs/api/java.base...

  • Arnavion 10 hours ago

    If hashmaps only looked at hash(k1) == hash(k2) and not also k1 == k2 they would be quite useless as maps.

kstrauser 11 hours ago

Because the hash function is only reliably useful for laying out dict keys, and its outputs are totally opaque and coincidental.

The pigeonhole principle says there are an infinite number of inputs with the same hash output. Trying to figure out how that happens can be fun and enlightening, but why? “Because it just does, that’s all.”

  • Etheryte 11 hours ago

    I think this is not constructive criticism. There's a good, clear cut reason for this specific overlap (error handling), saying it collides just because is not really useful.

    • kstrauser 11 hours ago

      But even that’s an implementation detail that happens to be true today, but could easily disappear tomorrow. There’s nothing in the docs saying it has to be that way, or that you can infer anything from a hash output other than that it’s going to be consistent within a single execution of that Python program.

      • tooltower 11 hours ago
        2 more

        This article is not about the API contract of the hash function, or the abstraction it provides. If you are just trying to hash things, you don't need any info here.

        It's very much trying to go _under_ the abstraction layer to investigate its behavior. Because it's interesting.

        This is very similar to how people investigate performance quirks or security issues.

        • kstrauser 11 hours ago

          I read the article and it's interesting! I learned something from it. From an end user POV it explains the mechanism behind how hash(-1)==hash(-2), which is neat. There's not really a why behind it, though. It wasn't really planned or designed for -1 to have an unexpected hash value. That's just the value the devs randomly picked as a flag. It could've been -837 just as easily.

      • alain94040 10 hours ago

        If the article title had been "why does the hash function never return -1", you would agree it's not a random behavior and serves a very clear purpose. It just so happens that the original author didn't know that until they did all the digging (nice job by the way).

      • sedatk 11 hours ago

        It could have been a bug. There's nothing wrong with the question and seeking answers to it. It was an interesting read too.

pmarreck 11 hours ago

A better question might be, why are people still using OO dynamic languages without any typing?

  • nayuki 11 hours ago

    > without any typing?

        >>> "abc" + 321
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: can only concatenate str (not "int") to str
    
    I rest my case.
  • kstrauser 11 hours ago

    Not sure. Is anyone using a language like that? Python certainly isn’t one.

    • worik 11 hours ago

      > Python certainly isn’t one.

      I am not a Python expert, but...

      Python is described as OO and I thought is is dynamically typed

      Not quite "without any typing" but close

      • dragonwriter 9 hours ago

        There are different definitions of most of the terms around typing which make most typing conversations confusing as people talk past each other.

        Particularly, there is a common definition of “typed” which is exactly equivalent to “statically typed” under which all so-called “dynamically typed” languages are actually “untyped”, and within that system strong and weak typing are either a meaningless distinction or one within the set of statically types languages.

        There's also, of course, a common usage within which “dynamically typed” is meaningful and strong vs. weak typing is a separate distinction, usually mostly discussed within languages with dynamic typing, though languages like C being both statically typed and weakly typed has been discussed.

      • kstrauser 11 hours ago
        14 more

        Not really. Python's strongly typed, but dynamically. You can think of Python variables as untyped references to strongly typed values: the values are typed, but their names are not.

        • tuveson 10 hours ago

          > untyped references to strongly typed values

          I would call that weakly typed since there is no step before your code runs that validates (or even tries to validate) that your program satisfies a "type system". But since the definition of "strong" and "weak" typing has always been vague, I will just say that it is stupidly typed[1], and that I am a pedantic nerd.

          1. https://danieltuveson.github.io/programming/languages/stupid...

        • wruza 10 hours ago
          4 more

          This is just hair splitting. What everyone means by "typing" today is typing members in structures and co. Every time you reach for python after a typed language, it feels like going through a barbed minefield blindfolded. The fact that its "world" almost entirely consists of little moving subclasses in subnamespaces doesn't help either. All you can think of python is how to write as few lines of it as possible to return to DX faster. It is my only use case for AI-based development because I cannot stand an extra minute in this abomination without cursing into the editor.

          • kragen 9 hours ago
            2 more

            It's not hairsplitting, and it's not what everyone means, just the people you talk to. Languages without typing are things like assembly language(s), which underlie everything you do on a computer and is (are) still in the TIOBE Index's top 20 https://www.tiobe.com/tiobe-index, as well as much more obscure languages. The distinction between untyped languages and dynamically typed languages is night and day.

            I'm pretty competent in Python, C, Java, JavaScript, assembly, and a number of other languages†, and my experience in Python is nothing like what you're describing, although I do have my own frustrations with it. It's possible that when you're more experienced you'll have a different perspective, but it sounds like you're stuck inside a pretty small bubble right now.

            ______

            † Last year I also programmed in C++, Lua, C#, bash, Tcl, Emacs Lisp, Forth, Golang, OCaml, Scheme, Perl, Common Lisp, and the ngspice scripting language.

            • wruza 9 hours ago

              I have a similar list, no need to worry about my YoE in various things. I left full-time python many years ago. You're right about my current bubble though. I find it productive and useful, compared to python which I have to touch from time to time. My worst gripe is that ML is my new interest and it is all python. I'm basically screwed.

          • dragonwriter 9 hours ago

            > Every time you reach for python after a typed language, it feels like going through a barbed minefield blindfolded.

            Using both Python and mandatorily statically typed languages regularly professsionally (my main working languages being C#, Python, and a mix of JS and TS), that's not my experience at all.

            (Of course, Python has a variety of optional static typecheckers with differing degrees of type inference, as well as supporting explicit type specifications; its not untyped unless you choose to use it that way.)

        • Dylan16807 11 hours ago
          6 more

          I agree with your last sentence, but I think we shouldn't say python itself is "strongly typed".

          • kstrauser 11 hours ago
            4 more

            Why, specifically? It's commonly described as strongly typed and I'm not aware of an argument against that.

            • Dylan16807 10 hours ago
              3 more

              Is the specific reason not clear? You just said it. The variables are untyped.

              Strong versus weak is not super solidly defined as referring only to values and not variables.

              • kstrauser 10 hours ago
                2 more

                > Is the specific reason not clear?

                Correct.

                > Strong versus weak is not super solidly defined as referring only to values and not variables.

                You're right. It's not super solidly defined as referring only to variables and not values, either. If dynamic typing is the same as weak typing, there'd be no point in having both phrases.

                • Dylan16807 9 hours ago

                  >> Is the specific reason not clear?

                  > Correct.

                  Is it still unclear with the clarification you didn't quote? I can't tell if I need to explain more or not.

                  > It's not super solidly defined as referring only to variables and not values, either.

                  I didn't mean to suggest it was.

                  > If dynamic typing is the same as weak typing, there'd be no point in having both phrases.

                  It's not the same. There's a bunch of things referred to as "weak".

        • worik 10 hours ago
          2 more

          > Not really. Python's strongly typed, but dynamically

          Yea, nah!

          You cannot have both.

          • kstrauser 10 hours ago

            That's an interesting and uncommon claim.

  • homebrewer 11 hours ago

    TaPL should be required reading for self proclaimed "engineers". Do untyped languages even exist, or could they? maybe something purely theoretical for ivory tower academics? Even POSIX shell and assembler type their values.

    • nayuki 10 hours ago

      > Do untyped languages even exist, or could they? Even POSIX shell and assembler type their values.

      I program in assembly language. Memory is just a big array of bytes. Values don't have types, but each instruction chooses what type to interpret memory as - e.g. uint8, int16, float32, x86 real-mode pointer, long-mode pointer, etc.

    • int_19h 9 hours ago

      I think one can reasonably argue that, if the language only has a single type, it is effectively "untyped".

      That would mean languages like Tcl. Or, for that matter, B.