UPDATE (mirroring the state-of-the-art knowledge level) status: 2017-05-12
The reason for this update is the fact that at the time I was asking this question I was not aware that I have discovered something about how Python3 works "under the hood".
The conclusion from all what will follow is:
If you write own Python3 code for an iterator and care about speed of execution you should write it as a generator function and not as an iterator class.
Below a minimalistic code example demonstrating that the same algorithm (here: self-made version of Pythons range()) expressed as a generator function runs much faster than if expressed as an iterator class: 
def   gnrtYieldRange(startWith, endAt, step=1): 
    while startWith <= endAt: 
        yield startWith
        startWith += step
class iterClassRange:
    def __init__(self, startWith, endAt, step=1): 
        self.startWith = startWith - 1
        self.endAt     = endAt
        self.step      = step
    def __iter__(self): 
        return self
    def __next__(self): 
        self.startWith += self.step
        if self.startWith <= self.endAt:
            return self.startWith
        else:
            raise StopIteration
N = 10000000
print("    Size of created list N = {} elements (ints 1 to N)".format(N))
from time import time as t
from customRange import gnrtYieldRange as cthnYieldRange
from customRange import cintYieldRange
from customRange import iterClassRange as cthnClassRange
from customRange import cdefClassRange
iterPythnRangeObj =          range(1, N+1)
gnrtYieldRangeObj = gnrtYieldRange(1, N)
cthnYieldRangeObj = cthnYieldRange(1, N)
cintYieldRangeObj = cintYieldRange(1, N)
iterClassRangeObj = iterClassRange(1, N)
cthnClassRangeObj = cthnClassRange(1, N)
cdefClassRangeObj = cdefClassRange(1, N)
sEXECs = [ 
    "liPR = list(iterPythnRangeObj)",
    "lgYR = list(gnrtYieldRangeObj)",
    "lcYR = list(cthnYieldRangeObj)",
    "liGR = list(cintYieldRangeObj)",
    "liCR = list(iterClassRangeObj)",
    "lcCR = list(cthnClassRangeObj)",
    "ldCR = list(cdefClassRangeObj)"
 ]
sCOMMENTs = [ 
    "Python3 own range(1, N+1) used here as reference for timings  ",
    "self-made range generator function using yield (run as it is) ",
    "self-made range (with yield) run from module created by Cython",
    "Cython-optimized self-made range (using yield) run from module",
    "self-made range as iterator class using __next__() and return ",
    "self-made range (using __next__) from module created by Cython",
    "Cython-optimized self-made range (using __next__) from module "
 ]
for idx, sEXEC in enumerate(sEXECs): 
    s=t();exec(sEXEC);e=t();print("{} takes: {:3.1f} sec.".format(sCOMMENTs[idx], e-s))
print("All created lists are equal:", all([liPR == lgYR, lgYR == lcYR, lcYR == liGR, liGR == liCR, liCR == lcCR, lcCR == ldCR]) )
print("Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'")
The code above put into a file and runned prints to stdout:
>python3.6 -u "gnrtFunction-fasterThan-iterClass_runMe.py"
    Size of created list N = 10000000 elements (ints 1 to N)
Python3 own range(1, N+1) used here as reference for timings   takes: 0.2 sec.
self-made range generator function using yield (run as it is)  takes: 1.1 sec.
self-made range (with yield) run from module created by Cython takes: 0.5 sec.
Cython-optimized self-made range (using yield) run from module takes: 0.3 sec.
self-made range as iterator class using __next__() and return  takes: 3.9 sec.
self-made range (using __next__) from module created by Cython takes: 3.3 sec.
Cython-optimized self-made range (using __next__) from module  takes: 0.2 sec.
All created lists are equal: True
Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'
>Exit code: 0
From the timings above you can see that the generator function variant of the self-made range() iterator runs faster than the iterator class variant and when no optimization of code is involved this behavior propagates also into C-code level of C-code created by Cython. 
If you are curious why in detail it is that way you can read through the provided answer(s) or play yourself a bit with the provided code yourself.
Below the missing pieces of code necessary to run the code above:
customRange.pyx - the file Cython creates the customRange module from: 
def gnrtYieldRange(startWith, endAt, step=1): 
    while startWith <= endAt: 
        yield startWith
        startWith += step
class iterClassRange:
    def __init__(self, startWith, endAt, step=1): 
        self.startWith = startWith - 1
        self.endAt     = endAt
        self.step      = step
    def __iter__(self): 
        return self
    def __next__(self): 
        self.startWith += self.step
        if self.startWith <= self.endAt:
            return self.startWith
        else:
            raise StopIteration
def cintYieldRange(int startWith, int endAt, int step=1): 
    while startWith <= endAt: 
        yield startWith
        startWith += step
cdef class cdefClassRange:
    cdef int startWith
    cdef int endAt
    cdef int step
    def __init__(self, int startWith, int endAt, int step=1): 
        self.startWith = startWith - 1
        self.endAt     = endAt
        self.step      = step
    def __iter__(self): 
        return self
    def __next__(self): 
        self.startWith += self.step
        if self.startWith <= self.endAt:
            return self.startWith
        else:
            raise StopIteration
and the setup file customRange-setup.py used to create the Python customRange module: 
import sys
sys.argv += ['build_ext',  '--inplace']
from distutils.core import setup
from Cython.Build   import cythonize
setup(
  name = 'customRange',
  ext_modules = cythonize("customRange.pyx"),
)
At the time I have asked this question I was busy with a quite complex
algorithm for generating unique combinations from a non-unique list available in form of a generator function using yield. My goal was to create a Python module written in C using this algorithm to make it run faster. For this purpose I have rewritten the generator function which used yield to an iterator class using __next__() and return. As I compared the speed of both variants of the algorithm I was surprised that the iterator class was two times slower than the generator function and I had (wrongly) assumed it has something to do with the way I have rewritten the algorithm (you need to know this if you want to better understand what the answers here are about) and had therefore
Originally asked how to make the iterator class version run at the same speed as the generator function and where the speed difference comes from?.
Below some more about the HISTORY of the question:
In the below provided Python script code exactly the same algorithm for creating unique combinations from a non-unique list of elements was implemented using a Python function with yield and using a class with __next__. The code is ready to run after copy/paste, so you can see it for yourself what I am speaking about. 
The same phenomenon observed for pure Python code propagates into C code of a Python extension module created out of the script code by Cython, so it is not limited to Python level code because it doesn't vanish at the C code level.
The question is:
Where does the huge difference in speed of execution come from? Is there anything that can be done to get both code variants to run at comparable speed? Is there something went wrong with the class/next implementation compared to the function/yield variant? Both are to my knowledge exactly the same code ...
Here the code (tweaking the number in the highlighted line changes the level of uniqueness of elements in the list the combinations are generated from what has a huge impact on the running time):
def uniqCmboYieldIter(lstItems, lenCmbo):
    dctCounter = {}
    lenLstItems = len(lstItems)
    for idx in range(lenLstItems):
        item = lstItems[idx]
        if item in dctCounter.keys(): 
            dctCounter[item] += 1
        else: 
            dctCounter[item]  = 1
        #:if
    #:for     
    lstUniqs   = sorted(dctCounter.keys())
    lstCntRpts = [dctCounter[item] for item in lstUniqs]
    lenUniqs   = len(lstUniqs)
    cmboAsIdxUniqs = [None] * lenCmbo
    multiplicities = [0] * lenUniqs
    idxIntoCmbo, idxIntoUniqs = 0, 0
    while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
        count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
        cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
        multiplicities[idxIntoUniqs] = count
        idxIntoCmbo  += count
        idxIntoUniqs += 1
    if idxIntoCmbo != lenCmbo:
        return
    while True:
        yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
        for idxIntoCmbo in reversed(range(lenCmbo)):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            y = x + 1
            if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
                break
        else:
            return
        for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            cmboAsIdxUniqs[idxIntoCmbo] = y
            multiplicities[x] -= 1
            multiplicities[y] += 1
            # print("# multiplicities:", multiplicities)
            while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
                y += 1
            if y == lenUniqs:
                break
class uniqCmboClassIter:
    # ----------------------------------------------------------------------------------------------
    def __iter__(self):
       return self
    # ----------------------------------------------------------------------------------------------
    def __init__(self, lstItems, lenCmbo):
        dctCounter = {}
        lenLstItems = len(lstItems)
        for idx in range(lenLstItems):
            item = lstItems[idx]
            if item in dctCounter.keys(): 
                dctCounter[item] += 1
            else: 
                dctCounter[item]  = 1
            #:if
        #:for     
        self.lstUniqs   = sorted(dctCounter.keys())
        self.lenUniqs   = len(self.lstUniqs)
        self.lstCntRpts = [dctCounter[item] for item in self.lstUniqs]
        self.lenCmbo        = lenCmbo
        self.cmboAsIdxUniqs = [None] * lenCmbo
        self.multiplicities = [0] * self.lenUniqs
        self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
        while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
            count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
            self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
            self.multiplicities[self.idxIntoUniqs] = count
            self.idxIntoCmbo  += count
            self.idxIntoUniqs += 1
            # print("self.multiplicities:", self.multiplicities)
            # print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
        if self.idxIntoCmbo != self.lenCmbo:
            return
        self.stopIteration = False
        self.x = None
        self.y = None
        return
    # ----------------------------------------------------------------------------------------------
    def __next__(self):
        if self.stopIteration is True:
            raise StopIteration
            return
        nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
        for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.y = self.x + 1
            if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
                break
        else:
            self.stopIteration = True
            return nextCmbo
        for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
            self.multiplicities[self.x] -= 1
            self.multiplicities[self.y] += 1
            # print("# multiplicities:", multiplicities)
            while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
                self.y += 1
            if self.y == self.lenUniqs:
                break
        return nextCmbo
# ============================================================================================================================================
lstSize   = 48 # 48
uniqLevel = 12 # (7 ~60% unique) higher level => more unique items in the generated list
aList = []
from random import randint
for _ in range(lstSize):
    aList.append( ( randint(1,uniqLevel), randint(1,uniqLevel) ) )
lenCmbo = 6
percUnique = 100.0 - 100.0*(lstSize-len(set(aList)))/lstSize
print("========================  lenCmbo:", lenCmbo, 
      "   sizeOfList:", len(aList), 
      "   noOfUniqueInList", len(set(aList)), 
      "   percUnique",  int(percUnique) ) 
import time
from itertools import combinations
# itertools.combinations
# ---
# def   uniqCmboYieldIter(lstItems, lenCmbo):
# class uniqCmboClassIter: def __init__(self, lstItems, lenCmbo):
# ---
start_time = time.time()
print("Combos:%9i"%len(list(combinations(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(     combinations(aList, lenCmbo)))):",  "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboYieldIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):",  "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboClassIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboClassIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
and the timings on my box:
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
========================  lenCmbo: 6    sizeOfList: 48    noOfUniqueInList 32    percUnique 66
Combos: 12271512  print(len(list(     combinations(aList, lenCmbo)))):   2.04635 seconds.
Combos:  1296058  print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):   3.25447 seconds.
Combos:  1296058  print(len(list(uniqCmboClassIter(aList, lenCmbo)))):   5.97371 seconds.
>Exit code: 0
  [2017-05-02_03:23]  207474 <-Chrs,Keys-> 1277194 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-stackoverflow/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
========================  lenCmbo: 6    sizeOfList: 48    noOfUniqueInList 22    percUnique 45
Combos: 12271512  print(len(list(     combinations(aList, lenCmbo)))):   2.05199 seconds.
Combos:   191072  print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):   0.47343 seconds.
Combos:   191072  print(len(list(uniqCmboClassIter(aList, lenCmbo)))):   0.89860 seconds.
>Exit code: 0
  [2017-05-02_03:23]  207476 <-Chrs,Keys-> 1277202 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-stackoverflow/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
========================  lenCmbo: 6    sizeOfList: 48    noOfUniqueInList 43    percUnique 89
Combos: 12271512  print(len(list(     combinations(aList, lenCmbo)))):   2.17285 seconds.
Combos:  6560701  print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):  16.72573 seconds.
Combos:  6560701  print(len(list(uniqCmboClassIter(aList, lenCmbo)))):  31.17714 seconds.
>Exit code: 0
UPDATE (status 2017-05-07):
At the time of asking the question and offering a bounty it was not known to me that there is a way to easily create C code of an extension module for an iterator object out of Python script code using Cython and that such C code can be created also from an iterator function using
yield.
Considering that the generated faster version of the C extension module is still not fast enough to compete with itertools.combinations it doesn't make much sense to dive deeply into knowing what exactly is causing the slowdown when using an iterator class compared to an iterator function and how to overcome this. It makes much more sense to find a way to speed up the faster version using Cython, especially because I am a total novice in writing Python extension modules failing to create a working code after hours and hours of intense focused work spend on tweaking existing C code of itertools.combinations with own modifications because of Segmentation Fault errors for which I was not able to grasp the reason of. 
Currently I think that there is still room to speed up the by me used Cython code and no need to go the harder way of writing the C code myself.
Below Cython code that runs OK and for speed optimized Cython code which changes somehow (I can't currently see the reason for that) the way the algorithm works and produce therefore wrong results. The idea behind the Cython optimization was to use in Cython code Python/Cython arrays instead of a Python lists. Any hints how to get a faster running Python extension module out of the used algorithm in a for a novice "safe" way are welcome.
def subbags_by_loops_with_dict_counter(lstItems, int lenCmbo):
    dctCounter = {}
    cdef int lenLstItems = len(lstItems)
    cdef int idx = 0
    for idx in range(lenLstItems):
        item = lstItems[idx]
        if item in dctCounter.keys(): 
            dctCounter[item] += 1
        else: 
            dctCounter[item]  = 1
        #:if
    #:for     
    lstUniqs   = sorted(dctCounter.keys())
    lstCntRpts = [dctCounter[item] for item in lstUniqs]
    cdef int lenUniqs   = len(lstUniqs)
    cmboAsIdxUniqs = [None] * lenCmbo
    multiplicities = [0] * lenUniqs
    cdef int idxIntoCmbo
    cdef int idxIntoUniqs
    cdef int count        
    while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
        count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
        cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
        multiplicities[idxIntoUniqs] = count
        idxIntoCmbo  += count
        idxIntoUniqs += 1
    if idxIntoCmbo != lenCmbo:
        return
    cdef int x
    cdef int y
    while True:
        yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
        for idxIntoCmbo in reversed(range(lenCmbo)):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            y = x + 1
            if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
                break
        else:
            return
        for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            cmboAsIdxUniqs[idxIntoCmbo] = y
            multiplicities[x] -= 1
            multiplicities[y] += 1
            while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
                y += 1
            if y == lenUniqs:
                break
Below OPTIMIZED CYTHON CODE which produces wrong results:
def subbags_loops_dict_cython_optimized(lstItems, int lenCmbo):
    dctCounter = {}
    cdef int lenLstItems = len(lstItems)
    cdef int idx = 0
    for idx in range(lenLstItems):
        item = lstItems[idx]
        if item in dctCounter.keys(): 
            dctCounter[item] += 1
        else: 
            dctCounter[item]  = 1
        #:if
    #:for     
    lstUniqs   = sorted(dctCounter.keys())
    lstCntRpts = [dctCounter[item] for item in lstUniqs]
    cdef int lenUniqs   = len(lstUniqs)
    cdef array.array cmboAsIdxUniqs = array.array('i', [])
    array.resize(cmboAsIdxUniqs, lenCmbo)
    # cmboAsIdxUniqs = [None] * lenCmbo 
    cdef array.array multiplicities = array.array('i', [])
    array.resize(multiplicities, lenUniqs)
    # multiplicities = [0] * lenUniqs
    cdef int idxIntoCmbo
    cdef int maxIdxCmbo
    cdef int curIdxCmbo
    cdef int idxIntoUniqs
    cdef int count        
    while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
        count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
        maxIdxCmbo = idxIntoCmbo + count
        curIdxCmbo = idxIntoCmbo
        while curIdxCmbo < maxIdxCmbo: 
            cmboAsIdxUniqs[curIdxCmbo] = idxIntoUniqs
            curIdxCmbo += 1
        multiplicities[idxIntoUniqs] = count
        idxIntoCmbo  += count
        idxIntoUniqs += 1
    # print("multiplicities:", multiplicities)
    # print("cmboAsIdxUniqs:", cmboAsIdxUniqs)
    if idxIntoCmbo != lenCmbo:
        return
    cdef int x
    cdef int y
    while True:
        yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
        for idxIntoCmbo in reversed(range(lenCmbo)):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            y = x + 1
            if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
                break
        else:
            return
        for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
            x = cmboAsIdxUniqs[idxIntoCmbo]
            cmboAsIdxUniqs[idxIntoCmbo] = y
            multiplicities[x] -= 1
            multiplicities[y] += 1
            # print("# multiplicities:", multiplicities)
            while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
                y += 1
            if y == lenUniqs:
                break
I made some experiences when I rewrote some of the recipes of the itertools documentation as C extensions. I think I may have some insights that could help you.
When you write pure Python code it's a tradeoff between speed (generator) and features (iterator).
The yield functions (known as generators) are for speed and generally they can be written without bothering about internal state. So it's less effort to write them and they are fast because Python just manages all the "state". 
The reason generators are faster (or at least not slower) is mostly because:
They implement the __next__-slot directly (typically tp_iternext) besides the __next__-method. In that case Python doesn't have to lookup the __next__ method - that's essentially what makes it faster in the following example:
from itertools import islice
def test():
    while True:
        yield 1
class Test(object):
    def __iter__(self):
        return self
    def __next__(self):
        return 1
%timeit list(islice(test(), 1000))
# 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So it's almost 3 times faster just because generators directly populate the __next__-slot.
A yield-function and the class have a state, but the yield function saves and loads the state much faster than you could with a class and attribute access:
def test():
    i = 0
    while True:
        yield i
        i += 1
class Test(object):
    def __init__(self):
        self.val = 0
    def __iter__(self):
        return self
    def __next__(self):
        current = self.val
        self.val += 1
        return current
%timeit list(islice(test(), 1000))
# 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test(), 1000))
# 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time the class is already 4 times slower (compared to the almost 3 times, when no state was involved). That is a cumulative effect: so the more "state" you have, the slower the class variant will be.
So much for the yield vs. class approach. Note that the actual timing will depend on the kind of operations. For example if the actual code that is run when next is called is slow (i.e. time.sleep(1)) then there's almost no difference between generator and class!
If you want a cython iterator class that is fast it has to be a cdef class. Otherwise you don't get the really fast class. The reason is that only a cdef class creates an extension type that directly implements the tp_iternext field! I'll use IPythons %%cython to compile the code (so I don't have to include the setup):
%%cython
def test():
    while True:
        yield 1
class Test(object):
    def __iter__(self):
        return self
    def __next__(self):
        return 1
cdef class Test_cdef(object):
    def __iter__(self):
        return self
    def __next__(self):
        return 1
%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The timings already show that the generator and basic class are faster than the pure Python equivalent, but their relative performance roughly stayed the same. However the cdef class variant beats both of them and that's mainly because the tp_iternext slot was used instead of just implementing the __next__ method. (Inspect the Cython generated C code if you don't trust me :) )
However it's just 2 times faster than the Python generator, that's not bad but it's not exactly overwhelming. To get really amazing speedups, you'll need to find a way to express your program without Python objects (the less Python objects the more speedup). For example if you use a dictionary for storing the item and it's multiplicity you still store Python objects and any lookup has to be done using python dictionary methods - even if you can call them by C API function instead of having to look up the real methods:
%%cython
cpdef cython_count(items):
    cdef dict res = dict()
    for item in items:
        if item in res:
            res[item] += 1
        else:
            res[item] = 1
    return res
import random
def count(items):
    res = {}
    for item in items:
        if item in res:
            res[item] += 1
        else:
            res[item] = 1
    return res
l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
There's one catch here, you didn't use collections.Counter which has an optimized C code (at least in python-3) for this kind of operation:
from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A quick note here: Don't use something in some_dict.keys() because the keys() are list-like in Python2 and ony implement O(n) contains operations while something in some_dict is typically O(1) (both Pythons)! That will make things faster in both versions but especially on Python2:
def count2(items):
    res = {}
    for item in items:
        if item in res.keys():  # with "keys()"
            res[item] += 1
        else:
            res[item] = 1
    return res
# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop  <--- WHOOPS!!!
That shows that you can only hope for something like 3-4 times speedup with Cython (and C extensions) when you use python structures but even minor mistakes like using ".keys()" can cost you much more in terms of performance if used incorrectly.
So what can you do if you want it faster? The answer is relativly easy: Create your own data structure based on C types instead of Python types.
That means you have to think about the design:
uniqComb**? Do you want integers (the examples say so, but I suppose you want arbitary Python objects).uniqComb** function to be sortable? You used sorted but you could also use a OrderedDict and keep keys in the order of appearance instead of by numerical value.The answers to these questions (these are only the question I immediatly asked myself, there are probably many more!) can help you decide which structure you can use internally. For example with Cython you can interface to C++ and you could use a map containing integer keys and integer values instead of a dictionary. It's sorted by default so you don't need to manually sort them yourself and you operate on native integers instead of Python objects. But you loose the ability to process arbitary python objects in your uniqComb and you need to know how to operate with C++ types in Cython. It could be amazingly fast though!
I don't go down that path because I assume you want to support arbitary orderable python types and I stick with the Counter as starting point but I'll save the multiplicities as integer array.arrays instead of as list. Let's call it the "least invasive" optimization. It actually doesn't matter much in terms of performance if you use a list or the array for lstCntRpts and multiplicities because they aren't a bottleneck - but it's a bit faster and saves a bit memory and more importantly it shows how you can include homogeneous arrays with cython:
%%cython
from cpython.list cimport PyList_Size  # (most) C API functions can be used with cython!
from array import array
from collections import Counter
cdef class uniqCmboClassIter:
    cdef list lstUniqs
    cdef Py_ssize_t lenUniqs
    cdef int[:] lstCntRpts   # memoryview
    cdef Py_ssize_t lenCmbo
    cdef list cmboAsIdxUniqs
    cdef int[:] multiplicities  # memoryview
    cdef Py_ssize_t idxIntoCmbo
    cdef Py_ssize_t idxIntoUniqs
    cdef bint stopIteration
    cdef Py_ssize_t x
    cdef Py_ssize_t y
    def __init__(self, lstItems, lenCmbo):
        dctCounter = Counter(lstItems)
        self.lstUniqs = sorted(dctCounter)
        self.lenUniqs = PyList_Size(self.lstUniqs)
        self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])
        self.lenCmbo        = lenCmbo
        self.cmboAsIdxUniqs = [None] * lenCmbo
        self.multiplicities = array('i', [0] * self.lenUniqs)
        self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
        while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
            count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
            self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
            self.multiplicities[self.idxIntoUniqs] = count
            self.idxIntoCmbo += count
            self.idxIntoUniqs += 1
            # print("self.multiplicities:", self.multiplicities)
            # print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
        if self.idxIntoCmbo != self.lenCmbo:
            return
        self.stopIteration = False
        self.x = 0
        self.y = 0
        return
    def __iter__(self):
        return self
    def __next__(self):
        if self.stopIteration is True:
            raise StopIteration
        nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
        for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.y = self.x + 1
            if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
                break
        else:
            self.stopIteration = True
            return nextCmbo
        for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
            self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
            self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
            self.multiplicities[self.x] -= 1
            self.multiplicities[self.y] += 1
            # print("# multiplicities:", multiplicities)
            while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
                self.y += 1
            if self.y == self.lenUniqs:
                break
        return nextCmbo
You actually didn't share your parameters for the timings but I tried it with some of mine:
from itertools import combinations
import random
import time
def create_values(maximum):
    vals = [random.randint(0, maximum) for _ in range(48)]
    print('length: ', len(vals))
    print('sorted values: ', sorted(vals))
    print('uniques: ', len(set(vals)))
    print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))
    return vals
class Timer(object):
    def __init__(self):
        pass
    def __enter__(self):
        self._time = time.time()
    def __exit__(self, *args, **kwargs):
        print(time.time() -  self._time)
vals = create_values(maximum=50)  # and 22 and 75 and 120
n = 6
with Timer():
    list(combinations(vals, n))
with Timer():
    list(uniqCmboClassIter(vals, n))
with Timer():
    list(uniqCmboClassIterOriginal(vals, n))
with Timer():
    list(uniqCmboYieldIterOriginal(vals, n))
length: 48 sorted values: [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22] uniques: 21 uniques in percent: 43.750000% 6.250450611114502 0.4217393398284912 4.250436305999756 2.7186365127563477 length: 48 sorted values: [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50] uniques: 33 uniques in percent: 68.750000% 6.2034173011779785 4.343803882598877 42.39261245727539 26.65750527381897 length: 48 sorted values: [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99] uniques: 39 uniques in percent: 81.250000% 6.859697341918945 10.437987327575684 104.12988543510437 65.25306582450867 length: 48 sorted values: [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147] uniques: 41 uniques in percent: 85.416667% 6.484673023223877 13.610010623931885 136.28764533996582 84.73834943771362
It definetly performs much better than the original approaches, actually several times faster with just type declarations. There's probably a lot more that could be optimized (disable bounds checking, using Python C API function calls, using unsigned integers or smaller integers if you know the "maximum" and "minimum" of your multiplicities, ...) - but the fact that it's not much slower than itertools.combinations even for 80% unique items and much faster than any original implementation is good enough for me. :-)
When you write a generator function using yield, the overhead of saving off and restoring state is handled by the CPython internals (implemented in C). With __iter__/__next__, you have to manage saving and restoring state on each call. In CPython, Python level code is slower than C level built-ins so the extr Python level code involved in the state management (including stuff as simple as accessing attributes of self via dict lookups rather than loading local variables, with only array indexing overhead) ends up costing you a lot.
If you implement your own iterator protocol supporting type in a C extension module, you'll bypass this overhead; saving and restoring state should be a matter of a few C level variable accesses (with similar or lesser overhead compared to what Python generator functions incur, which is to say, very little). Effectively, that's what generator functions are, a C extension type that saves and restores the Python frame on each call to tp_iternext (the C level equivalent of __next__).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With