Below is a list of major projects we know people are currently pursuing. If you have thoughts on these or want to help, please let us know.
Replication is currently the major focus for a number of us. This will turn Kafka into a fully replicated message log.
What is replication? Messages are currently written to a single broker with no replication between brokers. We would like to provide replication between brokers and expose options to the producer to block until a configurable number of replicas have acknowledged the message to allow the client to control the fault-tolerance semantics.
You can see more details on this plan here.
Improved Stream Processing LibrariesKafka supports partitioning data by key and doing distributed stream consumption and publication. It would be nice to have a small library for common processing operations like joins, filtering, grouping, etc.
Below is a list of projects which would be great to have but haven't yet been started. Ping the mailing list if you are interested in working on any of these.
Clients In Other Languages
We offer a JVM-based client for production and consumption. It would be great to implement the client in other languages.
Convert Hadoop InputFormat or OutputFormat to Scala
We have an Hadoop InputFormat and OutputFormat that were contributed and are in use at LinkedIn. This code is in Java, though, which means it doesn't quite fit in well with the project. It would be good to convert this code to Scala to keep things consistent.
We currently have a custom producer and also a log4j appender to work for "logging"-type applications. Outside the java world, however, the standard for logging is syslogd. It would be great to have an asynchronous producer that worked with syslogd to support these kinds of applications.
Currently streams are divided into only two levels—topics and partitions. This is unnecessarily limited. We should add support for hierarchical topics and allow subscribing to an arbitrary subset of paths. For example one could have /events/clicks and /events/logins and one could subscribe to either of these alone or get the merged stream by subscribing to the parent directory /events.
In this model, partitions are naturally just subtopics (for example /events/clicks/0 might be one partition). This reduces the conceptual weight of the system and adds some power.
Pluggable Offset Consumer Offset Storage Strategies
Currently consumer offsets are persisted in Zookeeper which works well for many use cases. There is no inherent reason the offsets need to be stored here, however. We should expose a pluggable interface to allow alternate storage mechanisms.
It would be great to have a REST proxy for KAFKA to help integration with languages that don't have first-class clients. It also makes it easier for web applications to produce data to Kafka.