I want a fast way to grab a URL and parse it while streaming. Ideally this should be super fast. My language of choice is Python. I have an intuition that twisted can do this but I'm at a loss to find an example.
If you need to handle HTTP responses in a streaming fashion, there are a few options.
You can do it via downloadPage:
from xml.sax import make_parser
from twisted.web.client import downloadPage
class StreamingXMLParser:
def __init__(self):
self._parser = make_parser()
def write(self, bytes):
self._parser.feed(bytes)
def close(self):
self._parser.feed('', True)
parser = StreamingXMLParser()
d = downloadPage(url, parser)
# d fires when the response is completely received
This works because downloadPage writes the response body to the file-like object passed to it. Here, passing in an object with write and close methods satisfies that requirement, but incrementally parses the data as XML instead of putting it on a disk.
Another approach is to hook into things at the HTTPPageGetter level. HTTPPageGetter is the protocol used internally by getPage.
class StreamingXMLParsingHTTPClient(HTTPPageGetter):
def connectionMade(self):
HTTPPageGetter.connectionMade(self)
self._parser = make_parser()
def handleResponsePart(self, bytes):
self._parser.feed(bytes)
def handleResponseEnd(self):
self._parser.feed('', True)
self.handleResponse(None) # Whatever you pass to handleResponse will be the result of the Deferred below.
factory = HTTPClientFactory(url)
factory.protocol = StreamingXMLParsingHTTPClient
reactor.connectTCP(host, port, factory)
d = factory.deferred
# d fires when the response is completely received
Finally, there will be a new HTTP client API soon. Since this isn't part of any release yet, it's not as directly useful as the previous two approaches, but it's somewhat nicer, so I'll include it to give you an idea of what the future will bring. :) The new API lets you specify a protocol to receive the response body. So you'd do something like this:
class StreamingXMLParser(Protocol):
def __init__(self):
self.done = Deferred()
def connectionMade(self):
self._parser = make_parser()
def dataReceived(self, bytes):
self._parser.feed(bytes)
def connectionLost(self, reason):
self._parser.feed('', True)
self.done.callback(None)
from twisted.web.client import Agent
from twisted.internet import reactor
agent = Agent(reactor)
d = agent.request('GET', url, headers, None)
def cbRequest(response):
# You can look at the response headers here if you like.
protocol = StreamingXMLParser()
response.deliverBody(protocol)
return protocol.done
d.addCallback(cbRequest) # d fires when the response is fully received and parsed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With