Google Cloud Dataflow: Can't parse proto using TextIO.Read -
here's code
pcollection<myproto> pcollection = p.apply(textio.read.from( "gs://my_bucket/*") .withcoder(proto2coder.of(myproto.class)));
but fails error
caused by: com.google.protobuf.invalidprotocolbufferexception: protocol message contained invalid tag (zero).
the file when downloaded locally parses fine.
i've tried same thing using stringutf8coder , bytearraycoder, no dice.
any help? should not using textio? other options have?
textio splits file lines , applies coder each line. naturally, doesn't work formats not line-based. suppose files contain single serialized proto each, correct? in case have 2 options:
- create own source , reader classes (see generic documentation on creating sources , sinks) subclassing filebasedformat.
- treat act of processing files pardo - create in-memory pcollection containing filenames process (using
create.of()
) , pipe through pardo takes filename , parses file protobuf; pipe rest of pipeline.
the second easier first work better if have lot of files.
Comments
Post a Comment